Bioinfo PPT Unit 1 Half
Bioinfo PPT Unit 1 Half
APPLICATIONS
BT203CC
CONTENTS
INTRODUCTION
Bioinformatics basics
2
INTRODUCTION
Bioinformatics is a subdiscipline of
biology and computer science
concerned with the acquisition, storage,
analysis, and dissemination of
biological data, most often DNA and
amino acid sequences. Bioinformatics
uses computer programs for a variety
of applications, including determining
gene and protein functions, establishing
evolutionary relationships, and
predicting the three-dimensional shapes
of proteins.
33
Bioinformatics is a field of computational science that has to do with
the analysis of sequences of biological molecules. It usually refers to
genes, DNA, RNA, or protein, and is particularly useful in
comparing genes and other sequences in proteins and other
sequences within an organism or between organisms, looking at
evolutionary relationships between organisms, and using the patterns
that exist across DNA and protein sequences to figure out what their
function is. You can think about bioinformatics as essentially the
linguistics part of genetics. That is, the linguistics people are looking
at patterns in language, and that's what bioinformatics people do--
looking for patterns within sequences of DNA or protein.
C h r i s t o p h e r P. A u s t i n , M . D .
L E T ’ S D I V E I N 4
The term “Bioinformatics” was invented by Paulien Hogeweg and Ben Hesper in
1970 as "the study of informatic processes in biotic systems". Bioinformatics is
the use of information technology in biotechnology for the data storage, data
warehousing and analyzing the DNA sequences.
Definition :-
5
OPERATING
SYSTEMS
An Operating System (OS)
is a collection of software
that manages computer
hardware and provides
services for programs.
S p e c i f i c a l l y, i t h i d e s
h a r d w a r e c o m p l e x i t y,
manages computational
resources, and provides
isolation and protection .
6
OPERATING SYSTEMS
• An operating system, or "OS," is software that communicates with the hardware and
allows other programs to run.
• Every desktop computer, tablet, and smartphone includes an operating system that
provides basic functionality for the device.
• While each OS is different, most provide a graphical user interface, or GUI, that includes
a desktop and the ability to manage files and folders.
• They also allow you to install and run programs written for the operating system.
Windows and Linux can be installed on standard PC hardware, while OS X is designed to
run on Apple systems. Therefore, the hardware you choose affects what operating
system(s) you can run.
7
O P E R AT I N G S Y S T E M S U S E D I N B I O I N F O R M AT I C S
8
The vision for Cloud BioLinux is to offer a base image of genome analysis resources for
cloud computing platforms, such as Amazon EC2. This Science as a Service model (ScaaS)
will allow us to incorporate, develop and optimize life science software as well as
supporting data sets on compute clouds. This project is driven by the observation that
commonly-used bioinformatics tools are hard to build and maintain, require high
amounts of resources, or just too numerous to choose from.
Debian Med is a "Debian Pure Blend" with the aim to develop Debian into an operating
system that is particularly well fit for the requirements for medical practice and research.
The goal of Debian Med is a complete system for all tasks in medical care which is built
completely on free software.
9
Ubuntu Biology Packages
Ubuntu Linux includes many packages useful for biologists.
•Seaview - A sequence editor with multiple alignment (via ClustalW) and other
capabilities
10
Continued…..
11
Other Biology software not in Ubuntu
•LDhat -LDhat is a package written in the C language for the analysis of recombination
rates from population genetic data.
•Visualization Toolkit (VTK) - an open-source software toolkit for image processing and
data visualization.
12
Continue…..
•Squint- A sequence editor with automated alignment, translation, etc.
13
•Continue…
•DAWG - DNA Alignment With Gaps - A sequence simulation program similar to Seq-Gen,
that allows the user to simulate the evolution of DNA sequences along a phylogeny.
Includes all common models of sequence evolution as well as a model of indel formation.
•FaMoZ - Software written in C and Tcl/Tk which uses likelihood calculations and
simulations to perform parentage studies with dominant, codominant and cytoplasmic
markers.
•ApE - A Plasmid Editor. ApE is an easy to use DNA sequence editor with many nice
features. ApE is written in Tcl/Tk script. You will need to download the appropriate Tclkit
version for your Linux distribution.
14
DATABASES
• A database is a computerized
archive used to store and organize
data in such a way that information
can be retrieved easily via a variety
of search criteria.
• Data retrieval is the main purpose of
all databases
• Database helps easily handle and
share large amount of data and
supports large scale analysis by easy
access and data updating
15
BIOLOGICAL DATABASES
• Biological databases are libraries of life sciences information ,collected from
scientific experiments, published literature, high- throughput experiment
technology and computational analysis.
o Primary databases
o Secondary databases
o Specialized databases
16
17
PRIMARY DATABASES
• It contain original biological data
• It archives of raw sequence or structural data submitted by
the scientific community
• GenBank, European Molecular Biology Laboratory (EMBL), or
DNA DataBank of Japan (DDBJ)
o freely available on the Internet
o Closely collaborate and exchange new data daily
o They together constitute the International Nucleotide Sequence
Database Collaboration
• Protein Data Bank (PDB) - for the three-dimensional
structures of biological macromolecule
18
19
SECONDARY DATABASES
• A Secondary database contain additional information derived from the
analysis of data available in primary sources.
20
SPECIALISED DATABASES
• Those that cater to a particular research interest.
• For example, Flybase, HIV sequence database, and Ribosomal Database
Project
21
FLAT FILE STORAGE DATA FORMATS
• When GenBank, EMBL and DDBJ formed a collaboration (1986),
sequence databases had moved to a defined flat file format with a shared
feature table format and annotation standards.
• The flat file formats from the sequence databases are still used to access
and display sequence and annotation. They are also convenient for storage of
local copies.
22
23
DATABASE MANAGEMENT
SYSTEMS
• Contain raw data records
• Operational instructions to help identify hidden connections
among data records
• Easy execution of the searches
• To combine different records to form final search reports
• Relational databases or Object-oriented databases
24
EXAMPLES OF DATABASE MANAGEMENT
SYSTEMS
• Database management systems are computer software applications that interact with
the user, other applications, and the database itself to capture and analyze data. A
general-purpose DBMS is designed to allow the definition, creation, querying, update,
and administration of databases. DBMS
• Well-known DBMSs include MySQL, PostgreSQL, Microsoft SQL Server, Oracle, SAP
and IBM DB2 Examples
25
Data definition (DDL)
• The DBMS must be able to accept data definition in source form and convert them into
appropriate object form. In other words, the DBMS must include DDL processor or DDL
compiler.
• Depending on the type of data at hand, there are two basic ways of searching:
o using descriptive words to search text databases
o using a nucleotide or protein sequence to search sequence databases
27
TEXT-BASED DATABASE SEARCHING
• The advantage of these retrieval systems is that they not only return
matches to a query, but also provide handy pointers to additional
important information in related databases
28
T E X T- B A S E D D ATA B A S E S E A R C H I N G B A S I C S E A R C H
CONCEPTS
• Boolean Search – An advanced query search using two or more terms, using
Boolean operator AND, OR, NOT, default – AND
• Broadening the Search – If the results of a search produce no useful entries, change
or remove terms
• Narrowing the Search – If the results of a search produce too many entries, change
or add terms
• Wild Card – The character * prepended or appended to a search term make a search
less specific., e.g., to look for all authors with last name Zav, search using Zav*
29
ENTREZ
• Entrez - is a molecular biology database and retrieval system developed by the National
Center for Biotechnology Information (NCBI)It is an entry point for exploring distinct but
integrated databases.
o Protein sequence databases - Swiss-Prot, PIR, PRF, PDB, and translated protein sequences from
DNA sequence databases
o Literature database, PubMed - provides excellent and easy access to MEDLINE and pre-MEDLINE
articles.
o Taxonomy database - allows retrieval of DNA and protein sequences for any taxonomic group.
30
31
DBGET
• An integrated data retrieval system developed and maintained by, - The
Institute for Chemical Research (Kyoto University) - The Human Genome
Center (University of Tokyo)
SRS
• SRS - Sequence Retrieval System - Data retrieval tool developed by EBI -
Integrates 80 molecular biology DBs - An Open source software (Can be
installed locally) SRS has an associated scripting language called Icarus
32
33
34
NUCLEIC ACID DATABASES
• There are three major sites for finding information about nucleic acids (DNA and/or
RNA sequences) on the Web, and all of them contain basically the same information.
• The methods and databases that you will want to use will depend mainly on how
much data you want and in what form.
• The databases EMBL, GenBank, and DDBJ are the three primary nucleotide
sequence databases:
• There is comparatively little error checking and there is a fair amount of redundancy
35
• The entries in the EMBL, GenBank and DDBJ databases are synchronized on a
daily basis, and the accession numbers are managed in a consistent manner
between these three centers.
• The nucleotide databases have reached such large sizes that they are available
in subdivisions that allow searches or downloads that are more limited, and
hence less time-consuming.
• There are no legal restrictions on the use of the data in these databases.
However, there are patented sequences in the databases.
36
PROTEIN DATABASES
• A protein database is a collection of data that has been constructed from physical,
chemical and biological information on sequence, domain structure, function,
three‐dimensional structure and protein‐protein interactions.
• Each protein database can be further classified into more specialized categories
according to the type of information sought.
37
38
• Huge amounts of data for protein structures, functions, and
particularly sequences are being generated. Searching databases are
often the first step in the study of a new protein. It has the following
uses:
40
Examples of RNA Structural Database
Viral RNA Structure Database It provides with the information of RNA structure of various
viruses.
41
THANK YOU
Rajasree Sasidharan
2nd Semester
MSc Biotechnology