0% found this document useful (0 votes)
221 views42 pages

Bioinfo PPT Unit 1 Half

The document discusses bioinformatics and operating systems used for bioinformatics applications. It provides details on common operating systems like Ubuntu and their biology software packages. It also discusses other specialized bioinformatics operating systems and software tools.

Uploaded by

srajasree98
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
221 views42 pages

Bioinfo PPT Unit 1 Half

The document discusses bioinformatics and operating systems used for bioinformatics applications. It provides details on common operating systems like Ubuntu and their biology software packages. It also discusses other specialized bioinformatics operating systems and software tools.

Uploaded by

srajasree98
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 42

BIOINFORMATICS-COMPUTER

APPLICATIONS
BT203CC
CONTENTS
INTRODUCTION

Bioinformatics basics

Introduction to operating systems ;


Operating systems used in bioinformatics

Data bases; Database management


system; Data retrieval

Protein and Nucleic acid databases


Structural databases

2
INTRODUCTION

Bioinformatics is a subdiscipline of
biology and computer science
concerned with the acquisition, storage,
analysis, and dissemination of
biological data, most often DNA and
amino acid sequences. Bioinformatics
uses computer programs for a variety
of applications, including determining
gene and protein functions, establishing
evolutionary relationships, and
predicting the three-dimensional shapes
of proteins.

33
Bioinformatics is a field of computational science that has to do with
the analysis of sequences of biological molecules. It usually refers to
genes, DNA, RNA, or protein, and is particularly useful in
comparing genes and other sequences in proteins and other
sequences within an organism or between organisms, looking at
evolutionary relationships between organisms, and using the patterns
that exist across DNA and protein sequences to figure out what their
function is. You can think about bioinformatics as essentially the
linguistics part of genetics. That is, the linguistics people are looking
at patterns in language, and that's what bioinformatics people do--
looking for patterns within sequences of DNA or protein.

C h r i s t o p h e r P. A u s t i n , M . D .

L E T ’ S D I V E I N 4
The term “Bioinformatics” was invented by Paulien Hogeweg and Ben Hesper in
1970 as "the study of informatic processes in biotic systems". Bioinformatics is
the use of information technology in biotechnology for the data storage, data
warehousing and analyzing the DNA sequences.

Definition :-

Bioinformatics is conceptualizing biology in terms of molecules and applying


“Information Techniques” (Applied Mathematics , Computer Science &
Statistics) to understand and organize the information associated with these
molecules , on a large scale

5
OPERATING
SYSTEMS
An Operating System (OS)
is a collection of software
that manages computer
hardware and provides
services for programs.
S p e c i f i c a l l y, i t h i d e s
h a r d w a r e c o m p l e x i t y,
manages computational
resources, and provides
isolation and protection .

6
OPERATING SYSTEMS
• An operating system, or "OS," is software that communicates with the hardware and
allows other programs to run.

• It is comprised of system software, or the fundamental files your computer needs


to boot up and function.

• Every desktop computer, tablet, and smartphone includes an operating system that
provides basic functionality for the device.

• Common desktop operating systems include Windows, OS X, and Linux.

• While each OS is different, most provide a graphical user interface, or GUI, that includes
a desktop and the ability to manage files and folders.

• They also allow you to install and run programs written for the operating system.
Windows and Linux can be installed on standard PC hardware, while OS X is designed to
run on Apple systems. Therefore, the hardware you choose affects what operating
system(s) you can run.
7
O P E R AT I N G S Y S T E M S U S E D I N B I O I N F O R M AT I C S

•biobuntu which comes preloaded with amapalign, biococoa, biomode, bioperl,


biopython, biosquid, blast2, boxshade, chemtool, clustalw, clustalx, dialign, easychem,
fastdnaml, fastlink, garlic, gchempaint, gcubin, gff2aplot, gff2ps, EMBOSS, gmt,
gromacs, hmmer, ifeffit, ifrit, imview, kalign, leksbot, libbioruby, meltinggui, mipe,
molphy, mozillabiofox, mpqc/support, mummer, muscle, mustang, ncbiepcr,
ncbitoolsbin, ncbitoolsx11, njplot, openbabel, perlprimer, phylip, poa, polyxmass,
primer3, probcons, proda, pybliographerA, rasmol, readseq, seaview, sibsim4,
sigmaalignSimple, sim4, SixpackDisplay, tcoffeeMultiple, tigrglimmerGene,
treepuzzleReconstruction, treetool, treeviewxDisplays, viewmol, wise, xbs, xdrawchem
and xmakemol

•Bio-Linux a fully featured, powerful, configurable and easy to maintain bioinformatics


workstation. Bio-Linux provides more than 500 bioinformatics programs on an Ubuntu
Linux base. There is a graphical menu for bioinformatics programs, as well as easy
access to the Bio-Linux bioinformatics documentation system and sample data useful
for testing programs. You can also install Bio-Linux packages to handle new generation
sequence data types.

8
The vision for Cloud BioLinux is to offer a base image of genome analysis resources for
cloud computing platforms, such as Amazon EC2. This Science as a Service model (ScaaS)
will allow us to incorporate, develop and optimize life science software as well as
supporting data sets on compute clouds. This project is driven by the observation that
commonly-used bioinformatics tools are hard to build and maintain, require high
amounts of resources, or just too numerous to choose from.

BioBrew is a collection of open-source applications for life scientists and an in-house


project at Bioinformatics.Org. The BioBrew Roll for Rocks can be used to create
Rocks/BioBrew Linux, a distribution customized for both cluster and bioinformatics
computing: it automates cluster installation, includes all the HPC software a cluster
enthusiast needs, and contains popular bioinformatics applications.

Debian Med is a "Debian Pure Blend" with the aim to develop Debian into an operating
system that is particularly well fit for the requirements for medical practice and research.
The goal of Debian Med is a complete system for all tasks in medical care which is built
completely on free software.

9
Ubuntu Biology Packages
Ubuntu Linux includes many packages useful for biologists.

•BioPerl - a collection of Perl modules that are useful for programmers in


bioinformatics/biology

•Clustalw - This program performs an alignment of multiple nucleotide or amino


acid sequences. ClustalX is a GUI front end for ClustalW.

•Emboss - a suite of bioinformatics tools.

•Muscle - A program for multiple alignment of protein sequences. MUSCLE stands


for multiple sequence comparison by log-expectation. In the authors tests, MUSCLE
achieved the highest scores of all tested programs on several alignment accuracy
benchmarks, and is also one of the fastest programs out there

•Seaview - A sequence editor with multiple alignment (via ClustalW) and other
capabilities

10
Continued…..

•Phylip - A Package of programs for inferring phylogenies.

•Tree View X - Tree View X is an open source program to display phylogenetic


trees on Linux, Unix, Mac OS X, and Windows platforms. It can read and display NEXUS
and Newick format tree files (such as those output by PAUP*, ClustalX, TREE-PUZZLE,
and other programs).

•Tree-puzzle - TREE-PUZZLE is a computer program to reconstruct phylogenetic


trees from molecular sequence data by maximum likelihood.

•DIALIGN2 - a command line tool to perform multiple alignment of protein or DNA


sequences.

•UGENE - an integrated bioinformatics suite with handy visual interface.


Integrates tool for HMM profile search, multiple sequence alignment, PCR primers
design, protein secondary structure prediction and others.

11
Other Biology software not in Ubuntu

•LDhat -LDhat is a package written in the C language for the analysis of recombination
rates from population genetic data.

•STRUCTURE Population genetic analysis of admixture, Uses include inferring the


presence of distinct populations, assigning individuals to populations, studying hybrid
zones, identifying migrants and admixed individuals, and estimating population allele
frequencies in situations where many individuals are migrants or admixed.

•Mesquite - Mesquite is a java based software for evolutionary biology, designed to


help biologists analyze comparative data about organisms.

•Visualization Toolkit (VTK) - an open-source software toolkit for image processing and
data visualization.

•Code biofox - Code biofox aims at implementing various bioinformatics tools as an


extension on the Firefox browser.

12
Continue…..
•Squint- A sequence editor with automated alignment, translation, etc.

•simuPOP - Highly modular and customizable forward population genetic


simulation software.

•ms -The gold standard in coalescent simulations for population genetics.

•libsequence - A C++ library designed to aid writing applications for genomics


and evolutionary genetics. The library is intended to be viewed as a "BioC++"
akin to the bioperl project, although the focus is on biological computation, such
as the analysis of SNP data and data generated from coalescent simulation.

•Geneious - NCBI-search, BLAST, sequence alignment, tree-building and


publications all in a single intuitive application that updates your data.

13
•Continue…

•DAWG - DNA Alignment With Gaps - A sequence simulation program similar to Seq-Gen,
that allows the user to simulate the evolution of DNA sequences along a phylogeny.
Includes all common models of sequence evolution as well as a model of indel formation.

•FaMoZ - Software written in C and Tcl/Tk which uses likelihood calculations and
simulations to perform parentage studies with dominant, codominant and cytoplasmic
markers.

•Insight Toolkit (ITK) - an open-source software toolkit for performing registration


and segmentation. Created to support the Visible Human Project.

•ApE - A Plasmid Editor. ApE is an easy to use DNA sequence editor with many nice
features. ApE is written in Tcl/Tk script. You will need to download the appropriate Tclkit
version for your Linux distribution.

14
DATABASES

• A database is a computerized
archive used to store and organize
data in such a way that information
can be retrieved easily via a variety
of search criteria.
• Data retrieval is the main purpose of
all databases
• Database helps easily handle and
share large amount of data and
supports large scale analysis by easy
access and data updating

15
BIOLOGICAL DATABASES
• Biological databases are libraries of life sciences information ,collected from
scientific experiments, published literature, high- throughput experiment
technology and computational analysis.

• They contain information from genomics,proteomics,microarry gene


expression.

• Information contained in biological databases includes gene function,


structure, localization (both cellular and chromosomal), biological sequences
and structures.

• Based on their contents, biological databases can be roughly divided into


three categories

o Primary databases
o Secondary databases
o Specialized databases

16
17
PRIMARY DATABASES
• It contain original biological data
• It archives of raw sequence or structural data submitted by
the scientific community
• GenBank, European Molecular Biology Laboratory (EMBL), or
DNA DataBank of Japan (DDBJ)
o freely available on the Internet
o Closely collaborate and exchange new data daily
o They together constitute the International Nucleotide Sequence
Database Collaboration
• Protein Data Bank (PDB) - for the three-dimensional
structures of biological macromolecule

18
19
SECONDARY DATABASES
• A Secondary database contain additional information derived from the
analysis of data available in primary sources.

• Secondary databases are analysed in a variety of ways and contain different


information in different formats.

• Translated protein sequence databases containing functional annotation


belong to this category.

• SWISS-PROT provides detailed sequence annotation that includes


structure, function, and protein family assignment

Some secondary databases


TrEMBL ,Pfam, PROSITE, Profiles, SCOP, CATH

20
SPECIALISED DATABASES
• Those that cater to a particular research interest.
• For example, Flybase, HIV sequence database, and Ribosomal Database
Project

21
FLAT FILE STORAGE DATA FORMATS
• When GenBank, EMBL and DDBJ formed a collaboration (1986),
sequence databases had moved to a defined flat file format with a shared
feature table format and annotation standards.

• The flat file formats from the sequence databases are still used to access
and display sequence and annotation. They are also convenient for storage of
local copies.

22
23
DATABASE MANAGEMENT
SYSTEMS
• Contain raw data records
• Operational instructions to help identify hidden connections
among data records
• Easy execution of the searches
• To combine different records to form final search reports
• Relational databases or Object-oriented databases

24
EXAMPLES OF DATABASE MANAGEMENT
SYSTEMS
• Database management systems are computer software applications that interact with
the user, other applications, and the database itself to capture and analyze data. A
general-purpose DBMS is designed to allow the definition, creation, querying, update,
and administration of databases. DBMS

• Well-known DBMSs include MySQL, PostgreSQL, Microsoft SQL Server, Oracle, SAP
and IBM DB2 Examples

25
Data definition (DDL)
• The DBMS must be able to accept data definition in source form and convert them into
appropriate object form. In other words, the DBMS must include DDL processor or DDL
compiler.

Data manipulation (DML)


• The DBMS must be able to handle requests to retrieve, update or delete existing data in
the database or add new data to the database

Optimization and Execution


• DML requests must be processed by the optimizer, component whose purpose is to
determining an efficient of implementing the request. The optimizer requests are then
executed under the control of the run-time manager.

Data security and Integrity


• DBMS must monitor user request and reject any attempt to violate the security and
integrity constant defined by the DBA The DBMS must enforce certain recovery and
concurrency controls.

Recovery and Concurrency


• DBMS software components which handles these is basically known as transition
manager or TPC Transaction Processing monitor. Backup and Restore Data
26
DATA RETRIEVAL
• The amount of biologically relevant data accessible via the WWW is increasing at a
very rapid rate. It is important for Scientists to have easy and efficient ways of
wading through the data and finding what is important for their research. Knowing
how to access and search for information in the database is essential.

• Depending on the type of data at hand, there are two basic ways of searching:
o using descriptive words to search text databases
o using a nucleotide or protein sequence to search sequence databases

27
TEXT-BASED DATABASE SEARCHING

• There are three important data retrieval systems of particular relevance


to molecular biologists:

o Entrez (at NCBI)


o Sequence Retrieval System, SRS (at EBI)
o DBGET/LinkDB (at Japan)

• The advantage of these retrieval systems is that they not only return
matches to a query, but also provide handy pointers to additional
important information in related databases

28
T E X T- B A S E D D ATA B A S E S E A R C H I N G B A S I C S E A R C H
CONCEPTS

• Boolean Search – An advanced query search using two or more terms, using
Boolean operator AND, OR, NOT, default – AND

• Broadening the Search – If the results of a search produce no useful entries, change
or remove terms

• Narrowing the Search – If the results of a search produce too many entries, change
or add terms

• Proximity Searching – To search with multiword terms or phrases, place quotes


around the terms

• Wild Card – The character * prepended or appended to a search term make a search
less specific., e.g., to look for all authors with last name Zav, search using Zav*

29
ENTREZ
• Entrez - is a molecular biology database and retrieval system developed by the National
Center for Biotechnology Information (NCBI)It is an entry point for exploring distinct but
integrated databases.

• The Entrez system provides access to:

o Nucleotide sequence databases – GenBank/DDBJ/EBI

o Protein sequence databases - Swiss-Prot, PIR, PRF, PDB, and translated protein sequences from
DNA sequence databases

o Genome and chromosome mapping data

o Molecular Modelling 3-D structures Database

o Literature database, PubMed - provides excellent and easy access to MEDLINE and pre-MEDLINE
articles.

o Taxonomy database - allows retrieval of DNA and protein sequences for any taxonomic group.

o Specialized Databases – OMIM, dbSNP, UniSTS, etc.

30
31
DBGET
• An integrated data retrieval system developed and maintained by, - The
Institute for Chemical Research (Kyoto University) - The Human Genome
Center (University of Tokyo)

• Data bases covered are,


o Nucleic acid Sequences – GenBank,
o EMBL Protein Sequences – SWISS-PROT, PIR
o 3D structures – PDB
o Sequence motifs – PROSITE
o Enzyme reactions – LIGAND
o Literature – LITDB Medline etc.

SRS
• SRS - Sequence Retrieval System - Data retrieval tool developed by EBI -
Integrates 80 molecular biology DBs - An Open source software (Can be
installed locally) SRS has an associated scripting language called Icarus

32
33
34
NUCLEIC ACID DATABASES

• There are three major sites for finding information about nucleic acids (DNA and/or
RNA sequences) on the Web, and all of them contain basically the same information.

• The methods and databases that you will want to use will depend mainly on how
much data you want and in what form.

• The databases EMBL, GenBank, and DDBJ are the three primary nucleotide
sequence databases:

• They include sequences submitted directly by scientists and genome sequencing


group, and sequences taken from literature and patents.

• There is comparatively little error checking and there is a fair amount of redundancy

35
• The entries in the EMBL, GenBank and DDBJ databases are synchronized on a
daily basis, and the accession numbers are managed in a consistent manner
between these three centers.

• The nucleotide databases have reached such large sizes that they are available
in subdivisions that allow searches or downloads that are more limited, and
hence less time-consuming.

• For example, GenBank has currently 17 divisions.

• There are no legal restrictions on the use of the data in these databases.
However, there are patented sequences in the databases.

36
PROTEIN DATABASES
• A protein database is a collection of data that has been constructed from physical,
chemical and biological information on sequence, domain structure, function,
three‐dimensional structure and protein‐protein interactions.

• Collectively, protein databases may form a protein sequence database.

• Protein databases can generally be divided into two types.


o The first type is a universal database, which covers the proteins present in all
known biological species
o . The second type is a specialized database which deals with the proteins
belonging to a specific group or family of proteins of certain species.

• Each protein database can be further classified into more specialized categories
according to the type of information sought.

• The NCBI Protein database is a collection of sequences from several sources,


including translations from annotated coding regions in GenBank, RefSeq and TPA,
as well as records from SwissProt, PIR, PRF, and PDB.

37
38
• Huge amounts of data for protein structures, functions, and
particularly sequences are being generated. Searching databases are
often the first step in the study of a new protein. It has the following
uses:

o Comparison between proteins or between protein families


provides information about the relationship between proteins
within a genome or across different species and hence offers
much more information that can be obtained by studying only
an isolated protein.

o Secondary databases derived from experimental databases are


also widely available. These databases reorganize and annotate
the data or provide predictions.

o The use of multiple databases often helps researchers


understand the structure and function of a protein.
39
STRUCTURAL DATABASES
Structural databases provide the structural information of various proteins and RNAs

Examples Of Protein Structural Databases

PDB The Protein Data Bank is a repository of the 3-D structural


(Protein Data Bank) data of large biological molecules, such as proteins and
nucleic acids. The data, typically obtained by X-ray
crystallography or NMR spectroscopy

CATH CATH classifies the protein domain structures and a


hierarchical clustering by grouping the protein domain
with sequence and structure similarity.

SCOP SCOP classifies protein on the basis of their structures. The


(Structural Classification of classification of protein structures in the database is based
Proteins) on evolutionary relationships and
on the principles that govern their three-dimensional
structure.

40
Examples of RNA Structural Database

SCOR The Structural Classification of RNA (SCOR) is a database


(Structural Classification of designed to provide a comprehensive perspective and
RNA) understanding of RNA motif structure, function, tertiary
interactions and their relationships

RNA STRAND It is a comprehensive collection of RNA secondary


structures, and provide the ways of analysing, searching and
updating the proposed database.

FSDB The frameshift database is a comprehensive compilation of


experimentally known or computationally predicted data
about programmed ribosomal frameshifting. The database
provides the graphical view of the frameshift cassettes and
the genes utilizing frameshifting for their expression.

Viral RNA Structure Database It provides with the information of RNA structure of various
viruses.

41
THANK YOU

Rajasree Sasidharan

2nd Semester

MSc Biotechnology

You might also like