Bio Perl
Bio Perl
Table of Contents
1. General introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1. Bioperl documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2. General bioperl classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2. Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1. The Bio::SeqIO class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1.1. Format Converter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2. Sequence classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2.2. Building mechanisms summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2.3. A deeper insight into the Bio::Seq class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2.4. Bioperl Sequence classes structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3. Features and Location classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.1. Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.2. Code reading: extracting CDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.3. Tag system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.4. Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.5. Graphical view of features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4. Sequence analysis tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3. Alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1. AlignIO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2. SimpleAlign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3. Code reading: protal2dna. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4. Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.1. Blast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.1.1. Running Blast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.1.2. Parsing Blast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.1.3. Bio::Tools::BPlite family parsers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.1.4. PSI-BLAST (Position Specic Iterative Blast) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.1.5. bl2seq: Blast 2 sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.1.6. Blast Internal classes structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2. Genscan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5. Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.1. Database classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2. Accessing a local database with golden . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 6. Perl Reminders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 6.1. UML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 6.2. Perl reminders to use bioperl modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 6.2.1. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 6.2.2. Filehandles and streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 6.2.3. Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.2.4. Getopt::Std . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 6.2.5. Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.2.6. BEGIN block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.3. Perl reminders for a further advanced understanding of bioperl modules . . . . . . . . . . . . . . . . . . 6.3.1. Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2. Compiler instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.3. Tie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A. Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1. Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2. Alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3. Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.4. Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56 56 56 57 59 59 65 66 78
List of Figures
2.1. Bio::Seq class structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2. Relation of a SwissProt entry and the corresponding Bio::Seq object components . . . . . . . . . . . . . . . . 6 2.3. Bio::Annotation package structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4. Bioperl Sequence classes structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.5. Features Classes structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.6. Correspondance between an EMBL entry and bioperl tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.7. Location Classes Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.8. Graphical view of some features of the SwissProt entry BACR_HALHA . . . . . . . . . . . . . . . . . . . . . . . 18 3.1. AlignIO Classes diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2. Align Classes diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.1. Blast Classes diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2. BPLite Classes diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3. Blast internal classes diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.4. Genscan Classes Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.1. Database Classes structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 6.1. UML meanings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 A.1. Bio::SeqIO structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
List of Examples
2.1. 2.2. 2.3. 3.1. 3.2. 3.3. 4.1. 4.2. 4.3. 4.4. 4.5. 4.6. 4.7. 4.8. 4.9. 5.1. 5.2. SwissProt -> Fasta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Loading a sequence from a remote server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Find the references to the PDB database entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Format conversions with AlignIO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Basic methods of SimpleAlign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Filter gap columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 StandAloneBlast run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 StandAloneBlast parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Parsing from a Blast le . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Parsing with BPLite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Running PSI-blast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 PSI-Blast SearchIO class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Running bl2seq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Genscan parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Genscan parsing, with sub-sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Database class use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Database Index creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
List of Exercises
2.1. Bio::SeqIO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2. An universal converter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.3. Display a sequence in fasta format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.4. More on annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.5. Transmembran helices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.6. extractcds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.7. extractcds version 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.1. Create an alignment without gaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.1. Running Blast on a Swissprot entry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2. Running Blast: Setting parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.3. Running Blast: Saving output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.4. Running a Remote Blast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.5. Display Blast hits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.6. Class of a Blast report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.7. Parse a Blast output le . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.8. Parse Blast results on standard input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.9. Filtering hits by length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.10. Filtering hits by position . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.11. Display the best hit by databank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.12. Multiple queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.13. Extracting the subject sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.14. Extracting alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.15. Locate EST in a genome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.16. Record Blast hits as sequence features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.17. Print informations from a BPLite report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.18. Create a Bio::Tools::BPlite from a le . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.19. Create a Bio::Tools::BPlite from standard input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.20. Parse a PSI-blast report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.21. Build a Bio::SimpleAlign object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.22. Code reading: Bio::Tools::Genscan module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.1. Parse a Genscan report and build a database entry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.2. Parse a Genscan report and build a database entry with the genomic sequence . . . . . . . . . . . . . . . . . . 47 5.3. Build a small bioperl module (for the golden program) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Chapter 2. Sequences
Chapter 2. Sequences
2.1. The Bio::SeqIO class
2.1.1. Format Converter
Example 2.1. SwissProt -> Fasta
pseudocode: Construct a SwissProt formatted sequences input stream Construct a fasta formatted sequences output stream for all sequences: read the sequence from input stream write it to output stream in bioperl:
#! /local/bin/perl -w use strict; use Bio::SeqIO;
my $in
=> <seqs.sp,
=> >seqs.fasta,
Chapter 2. Sequences
Chapter 2. Sequences
from a le:
use Bio::SeqIO; my $seqin = Bio::SeqIO->new ( -file => seq.fasta, -format => fasta); my $seq3 = $seqin->next_seq(); my $seqin2 = Bio::SeqIO->newFh ( -file -format my $seq4 = <$seqin2>; my $seqin3 = Bio::SeqIO->newFh ( -file -format my $seq5 = <$seqin3>; => seq.fasta, => fasta); => golden sp:TAUD_ECOLI |, => swiss);
Chapter 2. Sequences
Chapter 2. Sequences
Chapter 2. Sequences
Figure 2.2. Relation of a SwissProt entry and the corresponding Bio::Seq object components
Chapter 2. Sequences
Chapter 2. Sequences
in bioperl:
# PDB structures entries my @structures = (); foreach my $link ( $annotation->get_Annotations(dblink) ) { if ($link->database() eq PDB) { push (@structures, $link->primary_id()); } } print "\nPDB Structures: ", join (" ", @structures), "\n";
Go to
Work on Exercise 2.6 before you continue.
Tip
Use Figure 2.2 again. Use the appropriate objects documentation Here is a summary [codes/useSeq1.0.pl] solution of the exercises in this section.
10
Chapter 2. Sequences
11
Chapter 2. Sequences
They are created either by parsing database entries (see Bio::SeqIO::FTHelper [http://doc.bioperl.org/releases/bioperl1.0/Bio/SeqIO/FTHelper.html] or Figure A.1) or by parsing programs output. For instance (see Exercise 4.16), a Blast HSP is actually an instance of class Bio::SeqFeature::SimilarityPair [http://doc.bioperl.org/releases/bioperl-1.0/Bio/SeqFeature/SimilarityPair.html], which is a sub-class of Bio::SeqFeature::FeaturePair [http://doc.bioperl.org/releases/bioperl-1.0/Bio/SeqFeature/FeaturePair.html], which again is a feature. The same holds for Genscan predictions (Exercise 5.1). Documentation on features. Features table ofcial documentation [http://www.ebi.ac.uk/embl/Documentation/FT_denitions/feature_table.html]. bioperl tutorial [http://bio.perl.org/Core/bptutorial.html#III_7_1_Representing_sequence_an]. Features Classes. Bio::SeqFeatureI Bio::SeqFeature::Generic Bio::SeqFeature::FeaturePair Bio::SeqFeature::SimilarityPair Bio::SeqFeature::Similarity Bio::SeqFeature::Gene:: Bio::SeqFeature::Gene::ExonI Bio::SeqFeature::Gene::Exon Bio::SeqFeature::Gene::Transcript Bio::SeqFeature::Gene::TranscriptI Bio::SeqFeature::Gene::GeneStructureI Bio::SeqFeature::Gene::GeneStructure
12
Chapter 2. Sequences
13
Chapter 2. Sequences
Tip
Have a look at the Bio::SeqFeatureI [http://doc.bioperl.org/releases/bioperl1.0.1/Bio/SeqFeatureI.html] documentation. Figure 2.5 shows the architecture of the Bio::SeqFeature class.
Tip
Figure 2.6 shows the relation of features in an Embl entry and the Bio::SeqFeature class.
This system enables you to refer to feature qualiers (see documentation [http://www.ebi.ac.uk/embl/Documentation/FT_denitions/featu
The tag system also enables to create new qualier types. Examples in this tutorial: Blast hits as features (see Figure 4.1 and Exercise 4.16). Creation of a database from a Genscan parsing (Exercise 5.1). See methods related to tags in Bio::SeqFeature::Generic 1.0/Bio/SeqFeature/Generic.html] documentation. [http://doc.bioperl.org/releases/bioperl-
14
Chapter 2. Sequences
2.3.4. Location
Locations are useful for everything that looks like a range within a sequence (cf Bio::RangeI): sequences associated to features, sequences belonging to an alignment... Coordinates types. There are several coordinates types and representations: you can use fuzzy locations, ranges, lists, ... to specify the begin or the end (see the documentation
15
Chapter 2. Sequences
taken
which species a location whose begin is before 30 and whose end is at 90. Coding: .. : EXACT, ^ : BETWEEN (a site between 2 bases), exemple : 123^124 a site between bases 123 and 124. 145^177 a site between 2 adjacent bases, somewhere between bases 145 and 177. . : WITHIN (a base within a range), example : (102.110) is a site wihtin the 102 .. 110 range, inclusive. < : BEFORE > : AFTER Location Classes. Bio::LocationI Bio::Location::Simple Bio::Location::SplitLocationIet Bio::Location::Split Bio::Location::FuzzyLocationIet Bio::Location::Fuzzy Coordinate policies, for fuzzy location start and end determination: interface: Bio::Location::CoordinatePolicyI, default: Bio::Location::WidestCoordPolicy Bio::Location::NarrowestCoordPolicy, Bio::Location::AvWithinCoordPolicy
16
Chapter 2. Sequences
17
Chapter 2. Sequences
Figure 2.8 shows some features of the SwissProt entry BACR_HALHA. Here [codes/feature_graph.pl] is the that generates the image. The helix image and the arrow of domains are added glyph types that can be found here [modules/].
Figure 2.8. Graphical view of some features of the SwissProt entry BACR_HALHA
18
Chapter 2. Sequences
$seqobj->translate;
et Bio::Tools::SeqWords [http:doc.bioperl.org/releases/bioperl-1.0.1/Bio/Tools/SeqWords.html//] :
$seq_word = Bio::Tools::SeqWords->new(-seq => $seqobj); $seq_stats->count_words($word_length);
Bio::SeqUtils [http://doc.bioperl.org/releases/bioperl-1.0.1/Bio/Tools/SeqUtils.html] : .
$util = new Bio::SeqUtils; $polypeptide_3char = $util->seq3($seqobj);
or:
$polypeptide_3char = Bio::SeqUtils->seq3($seqobj);
19
Chapter 2. Sequences
-MAKE =>custom);
20
Chapter 3. Alignments
Chapter 3. Alignments
3.1. AlignIO
The AlignIO class is designed the same as the SeqIO class. Therefore format conversions can be done as following:
21
Chapter 3. Alignments
AlignIO class structure. Warning: some alignment formats are available only as input formats.
3.2. SimpleAlign
At rst, lets see some basic methods of the SimpleAlign class:
22
Chapter 3. Alignments
23
Chapter 3. Alignments
SimpleAlign class diagram. It contains the AlignI interface that declares all methods to be implemented by Alignment classes. The diagram integrates also some classes that can create SimpleAlign objects:
The SimpleAlign class contains methods to select sequences or columns, but it can not lter alignments by functions as could be done by the UnivAln class of the old bioperl release. In order to lter columns by
24
Chapter 3. Alignments
properties, you have to extract the columns by yourself, lter them and reconstruct the new sequences. The following example lters gap columns.
25
Chapter 3. Alignments
26
Chapter 4. Analysis
Chapter 4. Analysis
4.1. Blast
4.1.1. Running Blast
4.1.1.1. Running local Blast with Bio::Tools::Run::StandAloneBlast Example 4.1. StandAloneBlast run
use Bio::SeqIO; use Bio::Tools::Run::StandAloneBlast; my $Seq_in = Bio::SeqIO->new (-file => $ARGV[0], -format => fasta); my $query = $Seq_in->next_seq(); my $factory = Bio::Tools::Run::StandAloneBlast->new(program => blastp, database => swissprot, _READMETHOD => "Blast" ); my $blast_report = $factory->blastall($query); my $result = $blast_report->next_result; while( my $hit = $result->next_hit()) { print "\thit name: ", $hit->name(), " significance: ", $hit->significance(), "\n";}
27
Chapter 4. Analysis
28
Chapter 4. Analysis
=> blastp,
while( my $hit = $result->next_hit()) { print "\thit name: ", $hit->name(), " significance: ", $hit->significance(), "\n"; while( my $hsp = $hit->next_hsp()) { print "E: ", $hsp->evalue(), "frac_identical: ", $hsp->frac_identical(), "\n";
}}
$result belongs to the Bio::Search::Result::GenericResult class, where you can nd the documentation about available methods on results.
29
Chapter 4. Analysis
my $result = $blast_report->next_result; while( my $hit = $result->next_hit()) { print "\thit name: ", $hit->name(), "\n"; while( my $hsp = $hit->next_hsp()) { print "E: ", $hsp->evalue(), "frac_identical: ", $hsp->frac_identical(), "\n";
}}
30
Chapter 4. Analysis
31
Chapter 4. Analysis
Solution A.16
4.1.2.5. Extracting data from a Blast report Exercise 4.13. Extracting the subject sequence
Extract as a string the subject sequence from HSP with identity above a given level. Solution A.18
32
Chapter 4. Analysis
The old Bio::Tools::BPlite [http://doc.bioperl.org/releases/bioperl-1.0/Bio/Tools/BPlite.html] family of parsers (BPlite [http://doc.bioperl.org/releases/bioperl-1.0/Bio/Tools/BPlite.html], Bio::Tools::BPpsilite [http://doc.bioperl.org/releases/bioperl-1.0/Bio/Tools/BPpsilite.html], Bio::Tools::BPbl2seq [http://doc.bioperl.org/releases/biop 1.0/Bio/Tools/BPbl2seq.html]) is still maintained in the 1.0 release, although is will be deprecated one day. It is actually still necessary for blasting two sequences and to perform a psiblast (Position Specic Iterative Blast). It is also still the default parser returned by Bio::Tools::Run::StandAloneBlast, as shown below (although this might change soon).
use Bio::SeqIO; use Bio::Tools::Run::StandAloneBlast; my $Seq_in = Bio::SeqIO->new (-file => $ARGV[0], -format => fasta); my $query = $Seq_in->next_seq();; my $factory = Bio::Tools::Run::StandAloneBlast->new( program => blastp, database => swissprot ); my $blast_report = $factory->blastall($query); while (my $subject = $blast_report->nextSbjct()) { print $subject->name(), "\n"; while (my $hsp = $subject->nextHSP()) { print join("\t", $hsp->P, $hsp->percent, $hsp->score), "\n";
33
Chapter 4. Analysis
} }
34
Chapter 4. Analysis
35
Chapter 4. Analysis
36
Chapter 4. Analysis
my $query = <$query_in>; $factory = Bio::Tools::Run::StandAloneBlast->new(database => swissprot); $factory->j(2); $blast_report = $factory->blastpgp($query); for my $iteration (1 .. $blast_report->number_of_iterations()) { print "iteration: $iteration\n"; my $result = $blast_report->round($iteration); while( my $hit = $result->nextSbjct()) { print "\thit name: ", $hit->name(), "\n"; foreach my $hsp ($hit->nextHSP) { print "\tstart: ", $hsp->hit->start(), " end: ",$hsp->hit->end(),"\n"; print "\tscore: ", $hsp->score(), "\n"; } }}
You can also load a report from a le, or from standard input:
my $blast_report = Bio::Tools::BPpsilite->new(-fh=>\*FH);
37
Chapter 4. Analysis
my $query1_in = Bio::SeqIO->newFh ( -file -format => fasta ); my $query1 = <$query1_in>; my $query2_in = Bio::SeqIO->newFh ( -file -format => fasta ); my $query2 = <$query2_in>;
=> $ARGV[1],
my $out = Bio::AlignIO->newFh(-format => clustalw ); $factory = Bio::Tools::Run::StandAloneBlast->new( program => blastp ); $report = $factory->bl2seq($query1, $query2); while(my $hsp = $report->next_feature) { my $aln = Bio::SimpleAlign->new(); my $querySeq = Bio::LocatableSeq->new( -seq => $hsp->qs, -id => $query1->display_id,
38
Chapter 4. Analysis
-start => 1, -end => $query1->length ); my $sbjctSeq = Bio::LocatableSeq->new( -seq => $hsp->ss, -id => $report->sbjctName, -start => 1, -end => $query2->length ); $aln->add_seq($querySeq); $aln->add_seq($sbjctSeq); print $out $aln; }
39
Chapter 4. Analysis
The following diagram shows the internal classes structure. It is provided for information, and it has not to be understood to use bioperl Blast classes.
40
Chapter 4. Analysis
4.2. Genscan
Example 4.8. Genscan parsing
The folowing example shows how to use the Bio::Tools::Genscan [http://doc.bioperl.org/releases/bioperl1.0/Bio/Tools/Genscan.html] parser. (data: Genscan output le [data/genscan.out])
use Bio::Tools::Genscan; my $genscan_file = $ARGV[0]; $genscan = Bio::Tools::Genscan->new(-file => $genscan_file); while(my $gene = $genscan->next_prediction()) { my $prot = $gene->predicted_protein; print "protein (", ref($prot), ") :\n", $prot->seq, "\n\n";}
# Genscan: example with sub-sequences use Bio::Tools::Genscan; my $genscan_file = $ARGV[0]; $genscan = Bio::Tools::Genscan->new(-file => $genscan_file); while(my $gene = $genscan->next_prediction()) { my $prot = $gene->predicted_protein; print "protein (", ref($prot), ") :\n", $prot->seq, "\n\n"; # display genscan predicted cds (if -cds genscan option) my $predicted_cds = $gene->predicted_cds; print "predicted CDS: \n", $predicted_cds->seq, "\n\n";
foreach my $exon ($gene->exons()) { my $loc = $exon->location; print "exon - primary_tag: ",$exon->primary_tag, " coding? ",$exon->is_coding(), " start-end: } foreach my $intron ($gene->introns()) { my $loc = $intron->location;
41
Chapter 4. Analysis
print "intron primary_tag: ",$intron->primary_tag, " start-end: ", $loc->start, "-", $loc->end, "\n" } foreach my $utr ($gene->utrs()) { my $loc = $utr->location; print "utr primary_tag: ",$utr->primary_tag, " start-end: ", $loc->start, "-", $loc->end, "\n"; } print "---------------------------------\n"; }
42
Chapter 4. Analysis
43
Chapter 4. Analysis
1. What is the purpose of the Bio::SeqAnalysisParserI class? 2. Look at the _parse_predictions method in Bio::Tools::Genscan. 3. What happens during the creation of a Bio::Tools::Genscan instance? which methods are called? at which class hierarchy level?
44
Chapter 5. Databases
Chapter 5. Databases
5.1. Database classes
Example 5.1. Database class use
use Bio::Index::Swissprot; use Bio::SeqIO; my $out = Bio::SeqIO->newFh ( -fh => \*STDOUT, -format => fasta); my $Index_File_Name = shift; my $inx = Bio::Index::Swissprot->new( -filename => $Index_File_Name); foreach my $id (@ARGV) { my $seq = $inx->fetch($id); print $out $seq; }
45
Chapter 5. Databases
46
Chapter 5. Databases
containing an entry for each Genscan predicted CDS put this CDS exons as Features in the current entry (b) index this Embl formatted database (c) use your indexed database Outline of the exercise:
Solution A.27
47
Chapter 5. Databases
Exercise 5.2. Parse a Genscan report and build a database entry with the genomic sequence
Starting from exercise Exercise 5.1, add the DNA genomic sequence as sequence of the entry (+- delta at the beginning and at the end). Example:
SQ Sequence 2448 BP; 897 A; 487 C; 411 G; 653 T; 0 other; ttaagttcca ttgatgtatt tcaaagggtt cagagtttta tcgttttaca aagaaatgat gagtgttcat gactgtaaaa tccaccttca tcttccactt tcagtttaac ggctccggct
60 120
(data : Arabidopsis sequence - part of the chromosome V [data/arabidop-chr5-1quart.fasta]) build an additional feature describing all the CDS put the translation as a tag for this CDS (see add_tag_value [http://doc.bioperl.org/releases/bioperl1.0/Bio/SeqFeature/Generic.html#POD14]), which yields, for instance:
FT FT FT FT FT FT FT FT FT FT FT FT CDS join(complement(100..171),complement(284..358), complement(457..492),complement(626..673), complement(1460..1511),complement(2473..2513), complement(2638..2716),complement(2813..2969), complement(3046..3178),complement(3263..3390), complement(3476..3635)) /translation="MASTEGLMPITRAFLASYYDKYPFSPLSDDVSRLSSDMASLIKLL TVQSPPSQGETSLIDEANRQPPHKIDENMWKNREQMEEILFLLSPSRWPVQLREPSTSE DAEFASILRTLKDSFDNAFTAMISFQTKNSERIFSTVMTYMPQDFRGTLIRQQKERSER NKQAEVDALVSSGGSIRDTYALLWKQQMERRRQLAQLGSATGVYKTLVKYLVGVPQVLL DFIRQINDDDGDNEEYKEIIVQAGRTYEDIGFSVEYINASGEKTLILPYRRYEADQGNF STLMAGNYKLVWDNSYSTFFKKTLRYKVDCIAPVVEPDPEPEPLN"
regarding the location of the feature, see Section 2.3.4 in this tutorial, as well as add_sub_Location [http://doc.bioperl.org/releases/bioperl-1.0/Bio/Location/Split.html#POD2]. Solution A.28
48
Chapter 5. Databases
use LocalDBAccess; use Bio::SeqIO; my $seq = LocalDBAccess->get_Seq_by_id ( $ARGV[0] ); my $out = Bio::SeqIO->newFh ( -fh => \*STDOUT, -format => fasta );
Solution A.29
49
Chapter 5. Databases
50
6.1. UML
Figure 6.1. UML meanings
51
See for instance in exercises: 2.4 et 2.7. ref() returns the type of the variable (or its class,see below).
52
open (INFILE, $infile) || die "cannot open $infile:$!"; open (OUTFILE, "> $outfile") || die "cannot open $outfile:$!"; $line = <INFILE> ; # read a sequence objectprint OUTFILE $sequence; # write a sequence object
Filhandles such as INFILE or STDOUT belong to a specic namespace and can not be used in an assignation or as a parameter. For this, you have to pass a reference to the typeglob:
my $banque = Bio::SeqIO->newFh ( -fh => \*STDOUT, -format => embl);
typeglobs are in a general namespace, on top of the other namespaces for scalars, arrays, hashes and functions. You can use it to create aliases (see the "Advanced Perl Programming" book).
6.2.3. Exceptions
When using a bioperl module, the following can happen:
-------------------- EXCEPTION -------------------MSG: Attempting to set the sequence to [==] which does not look healthy STACK Bio::PrimarySeq::seq /local/lib/perl5/site_perl/5.6.0/Bio/PrimarySeq.pm:243 STACK Bio::PrimarySeq::new /local/lib/perl5/site_perl/5.6.0/Bio/PrimarySeq.pm:218 STACK Bio::Seq::new /local/lib/perl5/site_perl/5.6.0/Bio/Seq.pm:132 STACK Bio::SeqIO::fasta::next_primary_seq /local/lib/perl5/site_perl/5.6.0/Bio/SeqIO/fasta.p STACK Bio::SeqIO::fasta::next_seq /local/lib/perl5/site_perl/5.6.0/Bio/SeqIO/fasta.pm:85 STACK toplevel solution/blast_features.pl:22 -------------------------------------------
bioperl indeed use perl exception mechanisms (die et eval) to handle use errors. An exception is raised by the perl module an can be caught:
53
throw : in bioperl, this is the method for raising exceptions. It is provided by the Bio::Root::RootI [http://doc.bioperl.org/releases/bioperl-1.0/Bio//Root/RootI.html] class:
sub throw{ my ($self,$string) = @_; my $std = $self->stack_trace_dump(); my $out = "-------------------- EXCEPTION --------------------\n"."MSG: ".$string."\n".$std."--die $out;}
You can catch exceptions with eval (dont forget the ; at the end of the eval block:
eval { # this instruction may throw an exception # if parameters are incorrect $report = $factory->bl2seq($querySeq, $subjectSeq ); }; if( $@ ) { print "Caught exception"; } else { print "no exception"; }
In bioperl, exceptions are also used to implement interfaces classes; the following example in Bio::PrimarySeqI [http://doc.bioperl.org/releases/bioperl-1.0/Bio/PrimarySeqI.html]:
sub seq { my ($self) = @_; if( $self->can(throw) ) { $self->throw("Bio::PrimarySeqI definition of seq - implementing class did not provide this met
} else { confess("Bio::PrimarySeqI definition of seq - implementing class did not provide this method") } }
is a mean to oblige the subclass to implement a seq method. It is thus a way to encode abstract methods.
54
6.2.4. Getopt::Std
To handle command line options: man Getopt::Std. Example :
use Getopt::Std; my %opts = ();getopts(i:o:I:O:, \%opts); my $infile = $opts{I} || infile; my $outfile = $opts{O} || outfile; my $inform = $opts{i} || swiss; my $outform = $opts{o} || fasta;
Another example :
use Getopt::Std;getopts(f:plgh); if ($opt_f) { $format = $opt_f;} else { $format = "Genbank";}
6.2.5. Classes
Inheritance, example in Bio::Seq [http://doc.bioperl.org/releases/bioperl-1.0/Bio/Seq.html]:
@ISA = qw(Bio::Root::RootI Bio::SeqI);
In bioperl, you use classes the following ways: By instantiating a class, most often with a new method (this is a coding convention - see also newFh [http://doc.bioperl.org/releases/bioperl-1.0/Bio/SeqIO.html#newFh], from_searchResult [http://doc.bioperl.org/releases/bioperl-1.0/Bio/SeqFeature/SimilarityPair.html#from_searchResult], ...) :
$seq_stats = Bio::Tools::SeqStats->new ($seqobj);
As a library component.
55
56
imports symbols tagged as obj - here:$Blast, a static Bio::Tools::Blast object for a restrcited use of the module methods.
6.3.3. Tie
: Associates a class to a variable. whenever a variable is "tied",read and write statements (access, assignation, ...) trigger a call topredened subroutines (FETCH, STORE, PRINT, READLINE...). You can nd an example in (Bio::SeqIO):
sub fh { my $self = shift; my $class = ref($self) || $self; my $s = Symbol::gensym; tie $$s,$class,$self; return $s; }
which returns a lehandle tied to the Bio::SeqIO::Fhclass.As explained in Tying-FileHandles, Bio::SeqIO class has thus to redene these methods:
sub READLINE { my $self = shift; return $self->{seqio}->next_seq() unless wantarray; my (@list, $obj); push @list, $obj while $obj = $self->{seqio}->next_seq(); return @list; } sub PRINT { my $self = shift; }
$self->{seqio}->write_seq(@_);
These methods enable the programmer to read and print through the variable:
57
$fh = $obj->fh; # make a tied filehandle$sequence = <$fh>; # read a sequence object (calls READLINE) print $fh $sequence; # write a sequence object (calls PRINT)
58
Appendix A. Solutions
Appendix A. Solutions
A.1. Sequences
Solution A.1. Bio::SeqIO structure
Exercise 2.1
59
Appendix A. Solutions
60
Appendix A. Solutions
#! /local/bin/perl -w # exercise 1.1 : UnivConvert via file use strict; use Bio::SeqIO; my $inform = shift @ARGV; my $outform = shift @ARGV; my $infile = shift @ARGV; my $outfile = shift @ARGV; if ( ! defined $outfile ) { $outfile = $infile; $outfile =~ s/\..*$// if $outfile =~ /\./; $outfile .= "." . $outform; } my $in = Bio::SeqIO->newFh ( -file -format => $inform ); => $infile,
my $out = Bio::SeqIO->newFh ( -file -format => $outform ); print $out $_ while <$in>;
=> ">$outfile",
=> \*STDOUT,
61
Appendix A. Solutions
my $in
=> $infile,
=> ">$outfile",
62
Appendix A. Solutions
# exercise 1.1 : UnivConvert with option and stream use strict; use Getopt::Std; use Bio::SeqIO; my %opts = (); getopts(hi:o:, \%opts); usage() if (exists $opts{h}); my $inform = lc($opts{i}) || swiss; my $outform = lc($opts{o}) || fasta; (format_ok($inform) && format_ok($outform)) || exit 1; my $in = Bio::SeqIO->newFh ( -fh -format => $inform ); => \*STDIN,
=> \*STDOUT,
print $out $_ while <$in>; sub format_ok { my $form = shift @_; my %formats = qw(fasta 1 swiss 1 embl 1 genbank 1 scf 1 pir 1 gcg 1 raw 1 ace 1 bsml 1 fastq 1 phd 1 qual 1); if (! exists $formats{$form}) { print STDERR $form, " -- unknown format\n"; usage (); return 0; # false } return 1; # true }
sub usage { my $prog =basename $0; chomp($prog); print "\n"; print "$prog convert files from one format into another format\n"; print "\n";
63
Appendix A. Solutions
"$prog [-i informat] [-o outformat] < infile\n"; "\n"; "-i informat -- infile format (default swiss)\n"; "-o outformat -- outfile format (default fasta)\n"; "\n"; "-h -- print this message\n"; "\n"; "Available formats: fasta, swiss, EMBL, GenBank, SCF, Pir, GCG, raw et ACE\n";
64
Appendix A. Solutions
A.2. Alignments
Solution A.6. Create an alignment without gaps
Exercise 3.1
#! /local/bin/perl -w ### ### Create a new alignment without gaps at the begining and the end of an alignment
use strict; use Bio::AlignIO; # get alignment my $in; if ( $#ARGV > -1 ) { $in = new Bio::AlignIO ( -file =>, $ARGV[0], -format => clustalw ); } else { $in = new Bio::AlignIO ( -fh => \*STDIN, -format => clustalw ); } my $out = newFh Bio::AlignIO ( -fh => \*STDOUT, -format => clustalw );
65
Appendix A. Solutions
my $aln = $in->next_aln(); # find max column index of starting gaps and # min column index of ending gaps my $startgaps = 0; my $endgaps = 0; my $gap_char = $aln->gap_char(); foreach my $seq ($aln->each_seq) { my $str = $seq->seq(); $str =~/^(\Q$gap_char\E*)/; my $len = length($1); $startgaps = ($startgaps > $len) ? $startgaps : $len; $str =~/(-*)$/; $len = length($1); $endgaps = ($endgaps > $len) ? $endgaps : $len; } # cut the starting and ending block containing gaps print $out $aln->slice($startgaps+1, $aln->length()-$endgaps);
A.3. Analysis
Solution A.7. Running Blast on a Swissprot entry
Exercise 4.1
use Bio::SeqIO; use Bio::Tools::Run::StandAloneBlast; # a) solution if you have a local program to fetch # database entries # (ftp://ftp.pasteur.fr/pub/GenSoft/unix/db_soft/golden/) #my $Seq_in = Bio::SeqIO->new (-file => golden sp:TAUD_ECOLI |, # -format => swiss); #my $query = $Seq_in->next_seq();; # b) else: use Bio::DB::SwissProt; my $database = new Bio::DB::SwissProt; my $query = $database->get_Seq_by_id(TAUD_ECOLI); my $factory = Bio::Tools::Run::StandAloneBlast->new(
66
Appendix A. Solutions
program => blastp, database => swissprot, _READMETHOD => "Blast" ); my $blast_report = $factory->blastall($query); my $result = $blast_report->next_result; while( my $hit = $result->next_hit()) { print "\thit name: ", $hit->name(), " significance: ", $hit->significance(), "\n"; }
67
Appendix A. Solutions
database => swissprot, _READMETHOD => "Blast" ); $factory->outfile(blast.out); my $blast_report = $factory->blastall($query); my $result = $blast_report->next_result; while( my $hit = $result->next_hit()) { print "\thit name: ", $hit->name(), " significance: ", $hit->significance(), "\n"; }
while ( my @rids = $factory->each_rid ) { print STDERR "waiting...\n" ; # RID = Remote Blast ID (e.g: 1017772174-16400-6638) foreach my $rid ( @rids ) { my $rc = $factory->retrieve_blast($rid); if( !ref($rc) ) { if( $rc < 0 ) { # retrieve_blast returns -1 on error $factory->remove_rid($rid); } # retrieve_blast returns 0 on job not finished sleep 5; } else { #---- Blast done ---$factory->remove_rid($rid);
68
Appendix A. Solutions
my $result = $rc->next_result; print "database: ", $result->database_name(), "\n"; while( my $hit = $result->next_hit ) { print "hit name is: ", $hit->name, "\n"; while( my $hsp = $hit->next_hsp ) { print "score is: ", $hsp->score, "\n"; } } } } }
69
Appendix A. Solutions
my $factory = Bio::Tools::Run::StandAloneBlast->new( program => blastp, database => swissprot, _READMETHOD => "Blast" ); my $blast_report = $factory->blastall($query); print "Class of blast_report is: ", ref($blast_report), "\n";
One way to use this code is by feeding a blastall output directly to the script through a Unix pipe:
blastall -p blastp -d swissprot -i data/1prot.fasta | perl solution/blast_parse_stdin.pl
70
Appendix A. Solutions
71
Appendix A. Solutions
-file => $ARGV[0]); my $result = $blast_report->next_result; while( my $hit = $result->next_hit()) { my $dbname = dbname($hit->name()); if ($done{$dbname}) { next; } else { $done{$dbname} = 1; } print "\thit name: ", $hit->name(), "\n"; while( my $hsp = $hit->next_hsp()) { print "\t\tE: ", $hsp->evalue(), "frac_identical: ", $hsp->frac_identical(), "\n"; } }
72
Appendix A. Solutions
73
Appendix A. Solutions
my $seq_in = Bio::SeqIO->new (-file => $ARGV[0], -format => fasta); my $seq = $seq_in->next_seq(); my $contig_seq = Bio::LocatableSeq->new( -seq => $seq->seq, -id => $seq->display_id, -start => 1, -end => $seq->length ); my $aln = Bio::SimpleAlign->new(); $aln->add_seq($contig_seq); my $blast_report = new Bio::SearchIO (-format => blast, -file => $ARGV[1]); my $result = $blast_report->next_result; while( my $hit = $result->next_hit()) { my $i = 0; while( my $hsp = $hit->next_hsp()) { $i++; my $start = $hsp->hit->start(); my $end = $hsp->hit->end(); my $name = $result->query_name . "_HSP_$i"; my $seq = fill_gaps($hsp->query_string, $start, $aln->length); my $query_seq = Bio::LocatableSeq->new( -seq => $seq, -id => $name, -start => $start, -end => $end ); $aln->add_seq($query_seq); } } my $out = Bio::AlignIO->newFh(-format => clustalw ); print $out $aln; sub fill_gaps { # add gaps from 0 to start and from start + seq length to end my ($seq, $start, $length) = @_; my $gapped_seq = ""; for (my $i = 0; $i < $start - 1; $i++) { $gapped_seq .= "-"; } $gapped_seq .= $seq; for (my $i = $start + length($seq); $i <= $length; $i++) { $gapped_seq .= "-"; } return $gapped_seq; }
74
Appendix A. Solutions
Example of use:
perl solution/blast_locate.pl data/contig data/blast-est.out
foreach my $db (@dbs) { my $factory = Bio::Tools::Run::StandAloneBlast->new(database => $db, program => blastp); my $report = $factory->blastall($query); while(my $sbjct = $report->nextSbjct) { print STDERR "name : ", $sbjct->name, "\n"; while (my $hsp = $sbjct->nextHSP) { print STDERR "GFF:\n", $hsp->query()->gff_string(), "\n"; $query->add_SeqFeature($hsp); } last; } } foreach my $feat ($query->all_SeqFeatures()) { print STDERR "Feature from : ", $feat->start, " to : ", $feat->end, " Primary tag : ", $feat->primary_tag, ", produced by ", $feat->source_tag(), "\n"; $feat->primary_tag("BLAST"); } $out = Bio::SeqIO->newFh(-fh => \*STDOUT , -format => swiss); print $out $query;
75
Appendix A. Solutions
76
Appendix A. Solutions
for my $iteration (1 .. $blast_report->number_of_iterations()) { print "\n-----------------------------\niteration: $iteration\n"; my $result = $blast_report->round($iteration); my $hitarray_ref = $result->newhits; foreach my $hit (@{ $hitarray_ref }) { print "NEW hit name: $hit\n"; } my $oldhitarray_ref = $result->oldhits; foreach my $hit (@previous_hitarray) { if (grep /\Q$hit\E/, @{ $oldhitarray_ref }) { next; } print "DISAPEARED hit name: $hit\n"; } push (@previous_hitarray, @{ $oldhitarray_ref }); push (@previous_hitarray, @{ $hitarray_ref }); }
77
Appendix A. Solutions
my $query1_in = Bio::SeqIO->newFh ( -file -format => fasta ); my $query1 = <$query1_in>; my $query2_in = Bio::SeqIO->newFh ( -file -format => fasta ); my $query2 = <$query2_in>;
=> $ARGV[1],
my $out = Bio::AlignIO->newFh(-format => clustalw ); $factory = Bio::Tools::Run::StandAloneBlast->new( program => blastp ); $report = $factory->bl2seq($query1, $query2); while(my $hsp = $report->next_feature) { my $aln = Bio::SimpleAlign->new(); my $querySeq = Bio::LocatableSeq->new( -seq => $hsp->qs, -id => $query1->display_id, -start => 1, -end => $query1->length ); my $sbjctSeq = Bio::LocatableSeq->new( -seq => $hsp->ss, -id => $report->sbjctName, -start => 1, -end => $query2->length ); $aln->add_seq($querySeq); $aln->add_seq($sbjctSeq); print $out $aln; }
A.4. Databases
78
Appendix A. Solutions
Solution A.27.
Exercise 5.1
=> \*STDOUT,
79
Appendix A. Solutions
# add all detected UTRs as SeqFeature to the entry foreach my $utr ($gene->utrs()) { $entry->add_SeqFeature($utr); } # print the entry to the database flatfile in embl format print $banque $entry; }
80
Appendix A. Solutions
Solution A.28. Parse a Genscan report and build a database entry with the genomic sequence
Exercise 5.2
#! /local/bin/perl -w # Genscan, build a database # - 1 entry per gene with genomic DNA # + 1 feature for the whole CDS # + 1 feature per exon use strict; use Bio::Tools::Genscan; use Bio::SeqIO;
81
Appendix A. Solutions
my $genscan_file = shift @ARGV; my $seqin = Bio::SeqIO->new ( -fh => \*STDIN, -format => fasta); # sequence submitted to genscan my $seq = $seqin->next_seq(); # genscan report my $genscan = Bio::Tools::Genscan->new(-file => $genscan_file); # database to construct my $banque = Bio::SeqIO->newFh ( -fh -format => embl);
=> \*STDOUT,
# +- delta of the subsequence for each entry in database my $delta = 100; # date of construction my $date = gmtime; # print $banque $seq; while(my $gene = $genscan->next_prediction()) { # create entry object as database sequence my $entry = Bio::Seq::RichSeq->new(); $entry->add_date($date); $entry->molecule(DNA); $entry->division(PLN); # extract id of genscan prediction my $cds = $gene->predicted_cds; my @ids = split(/\|/, $cds->id()); # annotation my $start = 0; my $end = 0; my $loc; my $first = 1; my $newloc = new Bio::Location::Split(); my $strand = 1; my $nocds = 0; foreach my $exon ($gene->exons()) { $loc = $exon->location(); if ($first) { $first=0; $start=$loc->start-$delta; } if ($start < 0) { $start = 0; } $loc->start($loc->start-$start); $loc->end ($loc->end-$start);
82
Appendix A. Solutions
$exon->add_tag_value(source, genscan); $exon->primary_tag =~ m/(.*)Exon/; $exon->primary_tag(Exon); $exon->add_tag_value(lc($1), 1); # add all detected exons as SeqFeature to the entry $entry->add_SeqFeature($exon); # construct location for CDS if ($strand == -1 && $loc->strand == 1) { $newloc->warn("exons are coded on different strands"); $nocds = 1; } if ($loc->strand == -1) { $loc->strand(1); $newloc->strand(-1); } $newloc->add_sub_Location($loc); } $end = $loc->end+$delta; if ($end > $seq->length()) { $end = $seq->length(); } if (! $nocds) { # contruct a SeqFeature CDS for the entry ( join all exons) my $newcds = Bio::SeqFeature::Generic->new ( -primary => CDS, -location => $newloc); # add the translation to the CDS Feature $newcds->add_tag_value (translation, $gene->predicted_protein()->seq()); # add the CDS as Feature to the entry $entry->add_SeqFeature($newcds); } # construct the sequence for the entry # subsequence of the sequence submitted to genscan +- delta at each side my $seqstr; if ( $start > $end ) { $seqstr = $seq->subseq($end, $start); } else { $seqstr = $seq->subseq($start, $end); } my $newseq = Bio::PrimarySeq->new ( -seq -id => $ids[1] . " ", -moltype => dna); $entry->primary_seq($newseq); => $seqstr,
# print the entry to the database flatfile in embl format print $banque $entry; }
83
Appendix A. Solutions
Solution A.29. Build a small bioperl module (for the golden program)
Exercise 5.3
# Build a small bioperl module package LocalDBAccess; use strict; use vars qw(@ISA
%AVAIL_LOCALDB);
use Bio::DB::RandomAccessI; use Bio::SeqIO; #use Bio::Seq; BEGIN { %AVAIL_LOCALDB = ( sprot => swiss, sptrnrdb => swiss, gb => genbank, embl => embl ); }
@ISA = qw(Bio::DB::RandomAccessI); sub get_Seq_by_id{ my ($self, @args) = @_; return $self->get_Seq( @args, -i); } sub get_Seq_by_acc{ my ($self, @args) = @_; return $self->get_Seq(@args, -a); } sub get_Seq { my ($self, $entry, $access) = @_;
84
Appendix A. Solutions
my ($db, $id) = split(:, $entry); if ($access !~ /^-[ia]$/) { $access = ""; } if (!$id) { $self->throw ("no database specified"); } if (!_has_local_DB($db)) { $self->throw ("$db -- no such database available"); } my $in = Bio::SeqIO->new( -file => "golden $access $entry 2> /dev/null |", -format => _get_seq_format($db)); my $seq = $in->next_seq(); # # # # $in->DESTROY(); if ($? >> 8) { $self->throw ("$id -- no such entry in database $db"); } $self->throw ("$id -- no such entry in database $db") if (! defined $seq); return $seq; }
sub _has_local_DB { my ($db) = @_; if (exists $AVAIL_LOCALDB{$db}) { return 1; } return 0; } sub _get_seq_format { my ($db) = @_; return $AVAIL_LOCALDB{$db}; }
85
Appendix A. Solutions
86