Gene Finding
Gene Finding
The fast developments in DNA sequencing techniques have paved the way for tremendous increase in biological databases. Once the whole DNA of an organism is sequenced, the next big task is to predict the protein coding DNA/exons present in that sequence. This idea known as gene finding is one of the major challenges in the analysis of newly sequenced genomes. The most critical thing with biological studies is their accuracy in predicting exact protein coding DNA. Though the state-of-the-art tools report high accuracy, they exhibit performance variations with respect to the length of the sequence being analysed. This work began with an investigation about correspondence between accuracy and length of the DNA sequence being analysed for gene prediction. The preliminary results implied that there is correlation between length and accuracy. Hence in this work, we have developed different models specialized for different ranges of length starting with less than 500 nucleotides to greater than 10000 nucleotides. From a set of features that could discriminate between exons and introns, we have identified those features powerful for each length range. Based on these features we have trained the models and a tool is developed such that when an input sequence is given, it is assigned to the model that is tuned for that particular length range and the prediction is obtained. The proposed work employing Adaboost.M1 in conjunction with random forests as the base classifier shows considerable enhancement of prediction accuracy.
ACKNOWLEDGEMENT
This thesis would not have been possible without the assistance and support of many people. I would sincerely like to thank my supervisor Dr. Achuthsankar S.Nair, HOD Dept. of Computational Biology & Bioinformatics, University of Kerala for offering me this thesis topic and then supporting and guiding me throughout my research. His teaching will definitely have a continuing impact in my future academic and professional career. I would also like to thank my internal guide Ms. Muneera C.R., Associate Professor, Dept. of Electronics & Communication, GEC Thrissur & external guide Ms. Baharak Goli, Research Scholar, Dept. of Computational Biology & Bioinformatics, University of Kerala, for their time and support during the completion of my thesis. I would take this opportunity to thank Dr. Sheeba V.S. , HOD Dept. of Electronics & Communication, GEC Thrissur and the project coordinators Mr. Mohammed Salih K.K., Assistant Professor, Dept. of Electronics & Communication, GEC Thrissur & Mr. Roy Francis, Assistant Professor, Dept. of Electronics & Communication, GEC Thrissur. Last but not the least; I would like to acknowledge as well the invaluable support and encouragement supplied by my family and friends. I greatly appreciate their support.
ii
LIST OF TABLES
Title FrameD Result GeneMark Result Distribution Mapping Schemes Physicochemical properties of nucleotides
Page No. 16 17 19 27 28
29 32 34 36 43
4.2
44
5.1
45
iii
LIST OF FIGURES
Figure No. 1.1 1.2 2.1 2.2 2.3 2.4 2.5 2.6
Title Accuracy of Frame D Accuracy of GeneMark DNA Structure DNA Replication Central Dogma Codons for Amino Acids Eukaryotic DNA Classical Approaches to Gene Finding
Page No. 3 3 6 7 8 9 10 11
3.1
18
3.2
21
3.3
26
3.4 3.5
38 41
iv
Deoxyribo Nucleic Acid Feature Spectral content Paired Spectral Content Spectral Rotation Positional Frequency Distribution of Nucleotides
AMDF
CV
Cross Validation