0% found this document useful (0 votes)

70 views

Gene Finding

This thesis aims to improve the accuracy of gene prediction from DNA sequences by developing models specialized for different ranges of sequence lengths. Existing gene prediction tools have varying accuracy depending on the length of the input sequence. The author investigates the relationship between accuracy and sequence length, finding a correlation between the two. Different models are developed for ranges of lengths from less than 500 nucleotides to over 10,000 nucleotides. Features discriminating exons from introns are identified for each length range. Models trained on these features using Adaboost.M1 and random forests show improved prediction accuracy compared to existing tools. The thesis presents the developed length-specific gene finding tool and evaluates its performance on test datasets.

Uploaded by

Vineetha Mary Ipe

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

70 views

Gene Finding

Uploaded by

Vineetha Mary Ipe

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

ABSTRACT

The fast developments in DNA sequencing techniques have paved the way for tremendous increase in biological databases. Once the whole DNA of an organism is sequenced, the next big task is to predict the protein coding DNA/exons present in that sequence. This idea known as gene finding is one of the major challenges in the analysis of newly sequenced genomes. The most critical thing with biological studies is their accuracy in predicting exact protein coding DNA. Though the state-of-the-art tools report high accuracy, they exhibit performance variations with respect to the length of the sequence being analysed. This work began with an investigation about correspondence between accuracy and length of the DNA sequence being analysed for gene prediction. The preliminary results implied that there is correlation between length and accuracy. Hence in this work, we have developed different models specialized for different ranges of length starting with less than 500 nucleotides to greater than 10000 nucleotides. From a set of features that could discriminate between exons and introns, we have identified those features powerful for each length range. Based on these features we have trained the models and a tool is developed such that when an input sequence is given, it is assigned to the model that is tuned for that particular length range and the prediction is obtained. The proposed work employing Adaboost.M1 in conjunction with random forests as the base classifier shows considerable enhancement of prediction accuracy.

ACKNOWLEDGEMENT

This thesis would not have been possible without the assistance and support of many people. I would sincerely like to thank my supervisor Dr. Achuthsankar S.Nair, HOD Dept. of Computational Biology & Bioinformatics, University of Kerala for offering me this thesis topic and then supporting and guiding me throughout my research. His teaching will definitely have a continuing impact in my future academic and professional career. I would also like to thank my internal guide Ms. Muneera C.R., Associate Professor, Dept. of Electronics & Communication, GEC Thrissur & external guide Ms. Baharak Goli, Research Scholar, Dept. of Computational Biology & Bioinformatics, University of Kerala, for their time and support during the completion of my thesis. I would take this opportunity to thank Dr. Sheeba V.S. , HOD Dept. of Electronics & Communication, GEC Thrissur and the project coordinators Mr. Mohammed Salih K.K., Assistant Professor, Dept. of Electronics & Communication, GEC Thrissur & Mr. Roy Francis, Assistant Professor, Dept. of Electronics & Communication, GEC Thrissur. Last but not the least; I would like to acknowledge as well the invaluable support and encouragement supplied by my family and friends. I greatly appreciate their support.

LIST OF TABLES

Table No. 3.1 3.2 3.3 3.4 3.5

Title FrameD Result GeneMark Result Distribution Mapping Schemes Physicochemical properties of nucleotides

Page No. 16 17 19 27 28

3.6 3.7 3.8 3.9 4.1

Summary of Filters Feature Vector Attribute Selection

Comparison of Various Classifier Methods

29 32 34 36 43

Self Consistency Test Results

4.2

Independent Dataset Test Results

5.1

Comparison of Prediction Accuracy

iii

LIST OF FIGURES

Figure No. 1.1 1.2 2.1 2.2 2.3 2.4 2.5 2.6

Title Accuracy of Frame D Accuracy of GeneMark DNA Structure DNA Replication Central Dogma Codons for Amino Acids Eukaryotic DNA Classical Approaches to Gene Finding

Page No. 3 3 6 7 8 9 10 11

3.1

General Schematic Representation of Work

3.2

Spectral Content with Nucleotide Position

3.3

Feature Extraction using Mapping Techniques

3.4 3.5

Tool Developed Working of Length Specific Gene Finding Tool

38 41

LIST OF ABBREVIATIONS AND ACCRONYMS

DNA f SC PSC SR PFDN

Deoxyribo Nucleic Acid Feature Spectral content Paired Spectral Content Spectral Rotation Positional Frequency Distribution of Nucleotides

AMDF

Average Magnitude Difference Function

Cross Validation

Gene Prediction Using Statistical Methods
No ratings yet
Gene Prediction Using Statistical Methods
47 pages
Optimizing Classification Efficiency With Machine Learning Techniques For Pattern Matching
No ratings yet
Optimizing Classification Efficiency With Machine Learning Techniques For Pattern Matching
18 pages
BBT3 - CASD - BIOCOMP - 2ndassignment' With You
No ratings yet
BBT3 - CASD - BIOCOMP - 2ndassignment' With You
7 pages
DA3
No ratings yet
DA3
12 pages
LHQ Thesis
No ratings yet
LHQ Thesis
198 pages
Research Paper 1 Publication
No ratings yet
Research Paper 1 Publication
4 pages
Molecular Biology Notes
No ratings yet
Molecular Biology Notes
4 pages
Gene Prediction
No ratings yet
Gene Prediction
25 pages
Gene Prediction
No ratings yet
Gene Prediction
24 pages
Predicting rRNA-, RNA-, and DNA-binding Proteins From Primary Structure With Support Vector Machines
No ratings yet
Predicting rRNA-, RNA-, and DNA-binding Proteins From Primary Structure With Support Vector Machines
10 pages
Bio Report El
No ratings yet
Bio Report El
8 pages
Gene Prediction
25% (4)
Gene Prediction
36 pages
Unveiling DNA Sequences: A Comparison of Machine Learning and Deep Learning Techniques For Prediction
No ratings yet
Unveiling DNA Sequences: A Comparison of Machine Learning and Deep Learning Techniques For Prediction
11 pages
Genomic Sequence Data Classification Using Machine Learning Techniques
100% (1)
Genomic Sequence Data Classification Using Machine Learning Techniques
23 pages
A Sequence Based Multiple Kernel Model For Identifying DNA Binding Proteins
No ratings yet
A Sequence Based Multiple Kernel Model For Identifying DNA Binding Proteins
17 pages
Bioinformatics
No ratings yet
Bioinformatics
11 pages
6470177521
No ratings yet
6470177521
53 pages
Gene characterization thesis
No ratings yet
Gene characterization thesis
45 pages
Signal Processing in Sequence Analysis: Advances in Eukaryotic Gene Prediction
No ratings yet
Signal Processing in Sequence Analysis: Advances in Eukaryotic Gene Prediction
12 pages
Introduction to Bioinformatics, Sequence and Genome Analysis
From Everand
Introduction to Bioinformatics, Sequence and Genome Analysis
Jerry H. Swift
No ratings yet
BookThesis
No ratings yet
BookThesis
38 pages
Comparation Analysis of Ensemble Technique With Boosting (Xgboost) and Bagging (Randomforest) For Classify Splice Junction Dna Sequence Category
No ratings yet
Comparation Analysis of Ensemble Technique With Boosting (Xgboost) and Bagging (Randomforest) For Classify Splice Junction Dna Sequence Category
10 pages
Unit 6 - Bioinformatics
No ratings yet
Unit 6 - Bioinformatics
41 pages
(Ebook) Computational Biology by Niranjan Nagarajan, Mihai Pop (auth.), David Fenyö (eds.) ISBN 9781607618416, 1607618419 - Own the ebook now with all fully detailed chapters
No ratings yet
(Ebook) Computational Biology by Niranjan Nagarajan, Mihai Pop (auth.), David Fenyö (eds.) ISBN 9781607618416, 1607618419 - Own the ebook now with all fully detailed chapters
54 pages
Rosales
No ratings yet
Rosales
27 pages
12 ICIEV Dhaka
No ratings yet
12 ICIEV Dhaka
5 pages
Machine Learning Based Prediction Methods in Bioinformatics
No ratings yet
Machine Learning Based Prediction Methods in Bioinformatics
34 pages
Deep Learning For Comp Bio Review
No ratings yet
Deep Learning For Comp Bio Review
16 pages
ID Tissue Tissue Short Name Sources Wgbs Rna-Seq Technical Replicates
No ratings yet
ID Tissue Tissue Short Name Sources Wgbs Rna-Seq Technical Replicates
14 pages
Deep Learning for Cb
No ratings yet
Deep Learning for Cb
16 pages
DA2
No ratings yet
DA2
8 pages
CUBT401 - 4 - Sequence and Genome Annotation
No ratings yet
CUBT401 - 4 - Sequence and Genome Annotation
66 pages
The Application of The Permutation Test in Genome Wide Expression Analysis
No ratings yet
The Application of The Permutation Test in Genome Wide Expression Analysis
115 pages
TargetDBP_Accurate_DNA-Binding_Protein_Prediction_Via_Sequence-Based_Multi-View_Feature_Learning
No ratings yet
TargetDBP_Accurate_DNA-Binding_Protein_Prediction_Via_Sequence-Based_Multi-View_Feature_Learning
11 pages
Protein Stability Prediction-16
No ratings yet
Protein Stability Prediction-16
68 pages
University of California Los Angeles
No ratings yet
University of California Los Angeles
45 pages
MATH3353 Notes
No ratings yet
MATH3353 Notes
100 pages
Ghosh and Mallik
No ratings yet
Ghosh and Mallik
68 pages
(IJCST-V1I2P7) : T.Shanmugavadivu, T.Ravichandran
No ratings yet
(IJCST-V1I2P7) : T.Shanmugavadivu, T.Ravichandran
3 pages
A Comprehensive Survey of Deep Learning Techniques in Protein Function Prediction
No ratings yet
A Comprehensive Survey of Deep Learning Techniques in Protein Function Prediction
11 pages
Proj 782
No ratings yet
Proj 782
31 pages
PHD THESIS Kakumani
No ratings yet
PHD THESIS Kakumani
139 pages
Contents:-: Using Neural Networks 7
No ratings yet
Contents:-: Using Neural Networks 7
21 pages
Gene Pridiction and Orf
No ratings yet
Gene Pridiction and Orf
34 pages
mgcp_report(4-1)
No ratings yet
mgcp_report(4-1)
19 pages
Protein Classification Using Hybrid Feature Selection Technique
No ratings yet
Protein Classification Using Hybrid Feature Selection Technique
9 pages
2214ijitmc01
No ratings yet
2214ijitmc01
8 pages
Manual PDF
100% (1)
Manual PDF
53 pages
Exon - Intron
No ratings yet
Exon - Intron
4 pages
Project Report - Validation of Time Seri PDF
No ratings yet
Project Report - Validation of Time Seri PDF
31 pages
Edger: Differential Analysis of Sequence Read Count Data User'S Guide
No ratings yet
Edger: Differential Analysis of Sequence Read Count Data User'S Guide
119 pages
Deep Learning Models in Genomics Are We There Yet?: Lefteris Koumakis
No ratings yet
Deep Learning Models in Genomics Are We There Yet?: Lefteris Koumakis
8 pages
Edge RUsers Guide
No ratings yet
Edge RUsers Guide
138 pages
Edger: Differential Analysis of Sequence Read Count Data User'S Guide
No ratings yet
Edger: Differential Analysis of Sequence Read Count Data User'S Guide
122 pages
GKN 589
No ratings yet
GKN 589
9 pages
DeepFinder An Integration of Feature Based and Deep Learning Approach For DNA Motif Discovery
No ratings yet
DeepFinder An Integration of Feature Based and Deep Learning Approach For DNA Motif Discovery
11 pages
LayoutingFix
No ratings yet
LayoutingFix
8 pages
Introduction to Bioinformatics Using Action Labs
From Everand
Introduction to Bioinformatics Using Action Labs
Jean-Louis Lassez
5/5 (1)
Bioinformatics: Merging Biology and Technology
From Everand
Bioinformatics: Merging Biology and Technology
Mani Devar
No ratings yet
Toehold Mediated Strand Displacement: Molecular Control of DNA Hybridization and Strand Exchange
From Everand
Toehold Mediated Strand Displacement: Molecular Control of DNA Hybridization and Strand Exchange
Fouad Sabry
No ratings yet
PG2607 PJT11009 COL28204 Imputation Aware Design Whitepaper D1 CG - SMJ - AM
No ratings yet
PG2607 PJT11009 COL28204 Imputation Aware Design Whitepaper D1 CG - SMJ - AM
5 pages
Icar-Indian Institute of Pulses Research Kalyanpur, Kanpur - 208 024 (An ISO 9001:2008 Certified Institute)
No ratings yet
Icar-Indian Institute of Pulses Research Kalyanpur, Kanpur - 208 024 (An ISO 9001:2008 Certified Institute)
4 pages
a cross-platform graphical analysis tool for high-throughput CRISPR-based genome editing evaluation
No ratings yet
a cross-platform graphical analysis tool for high-throughput CRISPR-based genome editing evaluation
10 pages
Information Age Presentation
No ratings yet
Information Age Presentation
44 pages
Ramprasad R
No ratings yet
Ramprasad R
3 pages
Rafsanjani_biodata
No ratings yet
Rafsanjani_biodata
2 pages
MBG2004 Introduction - and - Comparative Genomics - Week - I - II
No ratings yet
MBG2004 Introduction - and - Comparative Genomics - Week - I - II
33 pages
Lab Report 3 Bioinformatics
No ratings yet
Lab Report 3 Bioinformatics
18 pages
Drug Discovery
No ratings yet
Drug Discovery
8 pages
F20 Unit 4 Phylogeny lecture 3
No ratings yet
F20 Unit 4 Phylogeny lecture 3
37 pages
Progress Report of MR V.K.singh Information Officer
No ratings yet
Progress Report of MR V.K.singh Information Officer
2 pages
Special Issue On VLSI Technology For Bioinformatics and Biomedical Applications
No ratings yet
Special Issue On VLSI Technology For Bioinformatics and Biomedical Applications
1 page
BT3040 - BIOINFORMATICS - Assignment 4: Question 1
No ratings yet
BT3040 - BIOINFORMATICS - Assignment 4: Question 1
9 pages
Syllabus 2010
No ratings yet
Syllabus 2010
38 pages
2 Cladograms - BioNinja
No ratings yet
2 Cladograms - BioNinja
5 pages
Single Cell Analysis
No ratings yet
Single Cell Analysis
22 pages
aYChr-DB A Database of Ancient Human Y Haplogroups
No ratings yet
aYChr-DB A Database of Ancient Human Y Haplogroups
4 pages
Rethinking Attention With Performers
No ratings yet
Rethinking Attention With Performers
38 pages
M.SC Part II Syllabus
No ratings yet
M.SC Part II Syllabus
41 pages
Data Sharing Policy and Guidelines Jan 2021
No ratings yet
Data Sharing Policy and Guidelines Jan 2021
10 pages
Omics Introduction
No ratings yet
Omics Introduction
25 pages
Clustal
No ratings yet
Clustal
2 pages
Lab Work
No ratings yet
Lab Work
29 pages
Human Genome Project
No ratings yet
Human Genome Project
25 pages
rapid-pcr-barcoding-RPB 9059 v1 Revl 14aug2019-Minion
No ratings yet
rapid-pcr-barcoding-RPB 9059 v1 Revl 14aug2019-Minion
20 pages
Final BS Date Sheet Final Term Sem S-2024
No ratings yet
Final BS Date Sheet Final Term Sem S-2024
8 pages
Accurate calculation of large map distances and the mapping of human chromosomes involve understanding the principles of genetics
No ratings yet
Accurate calculation of large map distances and the mapping of human chromosomes involve understanding the principles of genetics
2 pages
Sop RH Diu
No ratings yet
Sop RH Diu
2 pages
Priyanshi Pachauri Resume
No ratings yet
Priyanshi Pachauri Resume
2 pages
Experiment 9 Bioinformatics Tools For Cell and Molecular Biology
No ratings yet
Experiment 9 Bioinformatics Tools For Cell and Molecular Biology
11 pages

Gene Finding

Uploaded by

Gene Finding

Uploaded by

ABSTRACT

Table No. 3.1 3.2 3.3 3.4 3.5

3.6 3.7 3.8 3.9 4.1

Summary of Filters Feature Vector Attribute Selection

Self Consistency Test Results

Independent Dataset Test Results

Comparison of Prediction Accuracy

General Schematic Representation of Work

Spectral Content with Nucleotide Position

Feature Extraction using Mapping Techniques

Tool Developed Working of Length Specific Gene Finding Tool

LIST OF ABBREVIATIONS AND ACCRONYMS

DNA f SC PSC SR PFDN

Average Magnitude Difference Function

You might also like