Introduction Bioinformatics
Introduction Bioinformatics
THAMPI
Assistant Professor
Dept. of CSE
LBS College of Engineering
Kasaragod, Kerala-671542
[email protected]
Introduction
Bioinformatics is a new discipline that addresses the need to manage and interpret the data
that in the past decade was massively generated by genomic research. This discipline
represents the convergence of genomics, biotechnology and information technology, and
encompasses analysis and interpretation of data, modeling of biological phenomena, and
development of algorithms and statistics. Bioinformatics is by nature a cross-disciplinary
field that began in the 1960s with the efforts of Margaret O. Dayhoff, Walter M. Fitch,
Russell F. Doolittle and others and has matured into a fully developed discipline. However,
bioinformatics is wide-encompassing and is therefore difficult to define. For many, including
myself, it is still a nebulous term that encompasses molecular evolution, biological modeling,
biophysics, and systems biology. For others, it is plainly computational science applied to a
biological system. Bioinformatics is also a thriving field that is currently in the forefront of
science and technology. Our society is investing heavily in the acquisition, transfer and
exploitation of data and bioinformatics is at the center stage of activities that focus on the
living world. It is currently a hot commodity, and students in bioinformatics will benefit from
employment demand in government, the private sector, and academia.
With the advent of computers, humans have become data gatherers, measuring every aspect
of our life with inferences derived from these activities. In this new culture, everything can
and will become data (from internet traffic and consumer taste to the mapping of galaxies or
human behavior). Everything can be measured (in pixels, Hertz, nucleotide bases, etc), turned
into collections of numbers that can be stored (generally in bytes of information), archived in
databases, disseminated (through cable or wireless conduits), and analyzed. We are expecting
giant pay-offs from our data: proactive control of our world (from earthquakes and disease to
finance and social stability), and clear understanding of chemical, biological and
cosmological processes. Ultimately, we expect a better life. Unfortunately, data brings clutter
and noise and its interpretation cannot keep pace with its accumulation. One problem with
data is its multi-dimensionality and how to uncover underlying signal (patterns) in the most
parsimonious way (generally using nonlinear approaches. Another problem relates to what
we do with the data. Scientific discovery is driven by falsifiability and imagination and not by
purely logical processes that turn observations into understanding. Data will not generate
knowledge if we use inductive principles.
The gathering, archival, dissemination, modeling, and analysis of biological data falls within
a relatively young field of scientific inquiry, currently known as bioinformatics,
Bioinformatics was spurred by wide accessibility of computers with increased compute
power and by the advent of genomics. Genomics made it possible to acquire nucleic acid
sequence and structural information from a wide range of genomes at an unprecedented pace
and made this information accessible to further analysis and experimentation. For example,
sequences were matched to those coding for globular proteins of known structure (defined by
crystallography) and were used in high-throughput combinatorial approaches (such as DNA
microarrays) to study patterns of gene expression. Inferences from sequences and
biochemical data were used to construct metabolic networks. These activities have generated
terabytes of data that are now being analyzed with computer, statistical, and machine learning
techniques. The sheer number of sequences and information derived from these endeavors
has given the false impression that imagination and hypothesis do not play a role in
acquisition of biological knowledge. However, bioinformatics becomes only a science when
fueled by hypothesis-driven research and within the context of the complex and everchanging living world.
The science that relates to bioinformatics has many components. It usually relates to
biological molecules and therefore requires knowledge in the fields of biochemistry,
molecular biology, molecular evolution, thermodynamics, biophysics, molecular engineering,
and statistical mechanics, to name a few. It requires the use of computer science,
mathematical, and statistical principles. Bioinformatics is in the cross roads of experimental
and theoretical science. Bioinformatics is not only about modeling or data mining, it is
about understanding the molecular world that fuels life from evolutionary and mechanistic
perspectives. It is truly inter-disciplinary and is changing. Much like biotechnology and
genomics, bioinformatics is moving from applied to basic science, from developing tools to
developing hypotheses.
The Genome
The hereditary information that an organism passes to its offspring is represented in each of
its cells. The representation is in the form of DNA molecules. Each DNA molecule is a long
chain of chemical structures called nucleotides of four different types, which can be viewed
abstractly as characters from the alphabet fA; C; T; Gg. The totality of this information is
called the genome of the organism. In humans the genome consists of nucleotides. A major
task of molecular biology is to _ Extract the information contained in the genomes of
different organisms; _ Elucidate the structure of the genome; _ Apply this knowledge to the
diagnosis and ultimately, treatment, of genetic diseases (about 4000 such diseases in humans
have been identified); _ By comparing the genomes of different species, explain the process
and mechanisms of evolution. These tasks require the invention of new algorithms.
Genetics
Genetics is the study of heredity. It began with the observations of Mendel on the inheritance
of simple traits such as the color of peas. Mendel worked out the fundamental mathematical
rules for the inheritance of such traits. The central abstraction in genetics is the concept of a
gene. In classical genetics an organisms genes were abstract attributes for which no
biochemical basis was known. Over the past 40 years it has become possible to understand
the mechanisms of heredity at the molecular level. Classical genetics was based on breeding
experiments on plants and animals. Modern genetics is based on the molecular study of fungi,
bacteria and viruses.
Cells
In the early 19th century development of the microscope enabled cells to be observed. A cell
is an assortment of chemicals inside a sac bounded by a fatty layer called the plasma
membrane. Bacteria and protozoa consist of single cells. Plants and animals are complex
collections of cells of different types, with similar cells joined together to form tissues. A
human has trillions of cells. In a eukaryotic cell the genetic material is inside a nucleus
separated from the rest of the cell by a membrane, and the cell contains a number of discrete
functional units called organelles. The part of the cell outside the nucleus is called the
cytoplasm. In a prokaryotic cell there is no nucleus and the structure is more homogeneous.
Bacteria are prokaryotes, plants, animals and protozoa are eukaryotes. A eukaryotic cell has
diameter 10-100 microns, a bacterial cell 1-10 microns.
The genetic material in cells is contained in structures called chromosomes. In most
prokaryotes, each cell has a single chromosome. In eukaryotes, all the cells of any individual
contain the same number of chromosomes, 8 in fruities, 46 in humans and bats, 84 in the
rhinoceros. In mammals and many other eukaryotic organisms the chromosomes occur in
homologous (similar in shape and structure) pairs, except for the sex chromosomes. The male
has an X chromosome and a Y chromosome, and the female has two X chromosomes. Cells
with homologous pairs of chromosomes are called diploid. Cells with unpaired chromosomes
are called haploid.Inpolyploid organisms the chromosomes come in homologous triplets,
quadruplets etc. Mitosis is the process of cell division. In mitosis each chromosome gets
duplicated, the chromosomes line up in the central part of the cell, then separate into two
groups which move to the ends of the cell, and a cell membrane forms separating the two
sister cells. During mitosis chromosomes can be observed well under the microscope. They
can be distinguished from one another by their banding structure, and certain abnormalities
observed. For example, Down Syndrome occurs when there is an extra copy of chromosome
21.
The sperm and egg cells (gametes) in sexually reproducing species are haploid. The process
of meiosis produces four sperm/egg cells from a single diploid cell. We describe the main
steps in meiosis. The original diploid cell has two homologous copies of each chromosome
(with the exception of the X and Y chromosomes). One of these is called the paternal copy
(because it was inherited from the father) and the other is called the maternal copy. Each of
the two copies duplicates itself. The four resulting copies are then brought together, and
material may be exchanged between the paternal and maternal copies. This process is called
recombination or crossover. Then a cell division occurs, with the two paternal copies
(possibly containing some material from the maternal copies by virtue of recombination)
going to one of the daughter cells, and the two maternal copies to the other daughter cell.
Each daughter cell then divides, with one of the two copies going to each of its daughter
cells. The result is that four haploid sperm or egg cells are produced. In fertilization a sperm
cell and an egg cell combine to form a diploid zygote, from which a new individual develops.
Inheritance of Simple Traits
Many traits have two different versions: for example, pea seeds may be either green or
yellow, and either smooth or wrinkled. In the late 19th century Mendel postulated that there
is an associated gene that can exist in two versions, called alleles. An individual has two
copies of the gene for the trait. If the two copies are the same allele the individual is
homozygous for the trait, otherwise heterozygous. On the basis of breeding experiments with
peas Mendel postulated that each parent contributes a random one of its two alleles to the
child. He also postulated incorrectly that the alleles for different traits are chosen
independently.
Proteins
A protein is a very large biological molecule composed of a chain of smaller molecules
called amino acids. Thousands of different proteins are present in a cell, the synthesis of each
type of protein being directed by a different gene. Proteins make up much of the cellular
structure (our hair, skin, and fingernails consist largely of protein). Some proteins are
enzymes which catalyze (make possible or greatly increase the rate of) chemical reactions
within the cell. As we shall discuss later, some proteins are transcription factors which
regulate the manner in which genes direct the production of other proteins. Proteins on the
surfaces of cells act as receptors for hormones and other signaling molecules. The genes
control the ability of cells to make enzymes. Thus genes control the functioning of cells.
DNA
DNA was discovered in 1869. Most of the DNA in cells is contained in the chromosomes.
DNA is chemically very different from protein. DNA is structured as a double helix
consisting of two long strands that wind around a common axis. Each strand is a very long
chain of nucleotides of four types, A,C, T and G. There are four different types of
nucleotides, distinguished by rings called bases, together with a common portion consisting
of a sugar called deoxyribose and a phosphate group. On the sugar there are two sites, called
the 3 and 5 sites. Each phosphate is bonded to two successive sugars, at the 3 site of one
and the 5 site of the other. These phosphate-sugar links form the backbones of the two
chains. The bases of the two chains are weakly bonded together in complementary pairs each
of the form CG or AT. Thus the chains have directionality (from 3 to 5), and the sequence
of bases on one chain determines the sequence on the other, by the rule of complementarity.
The linear ordering of the nucleotides determines the genetic information. Because of
complementarity the two strands contain the same information, and this redundancy is the
basis of DNA repair mechanisms. A gene is a part of the sequence of nucleotides which
codes for a protein in a manner that will be detailed below. Variations in the sequence of the
gene create different alleles, and mutations (changes in the sequence, due, for example, to
disturbance by ultraviolet light) create new alleles. The genome is the totality of DNA stored
in chromosomes typical of each species. The genome contains most of the information
needed to specify an organisms properties. DNA undergoes replication, repair,
rearrangement and recombination. These reactions are catalyzed by enzymes. We describe
DNA replication. During replication, the two strands unwind at a particular point along the
double helix. In the presence of an enzyme called DNA polymerase the unwound chain serves
as a template for the formation of a complementary sequence of nucleotides, which are
adjoined to the complementary strand one-by-one. Many short segments are formed, and
these are bonded together in a reaction catalyzed by an enzyme called DNA ligase. There are
mechanisms for repairing errors that occur in this replication process, which occurs during
mitosis.
Definition of Bioinformatics
Roughly, bioinformatics describes any use of computers to handle biological information.
In practice the definition used by most people is narrower; bioinformatics to them is a
synonym for "computational molecular biology"- the use of computers to characterize the
molecular components of living things.
"Classical" bioinformatics:
"The mathematical, statistical and computing methods that aim to solve biological problems
using DNA and amino acid sequences and related information.
The Loose" definition:
There are other fields-for example medical imaging/image analysis which might be
considered part of bioinformatics. There is also a whole other discipline of biologicallyinspired computation; genetic algorithms, AI, neural networks. Often these areas interact in
strange ways. Neural networks, inspired by crude models of the functioning of nerve cells in
the brain, are used in a program called PHD to predict, surprisingly accurately, the secondary
structures of proteins from their primary sequences. What almost all bioinformatics has in
common is the processing of large amounts of biologically-derived information, whether
DNA sequences or breast X-rays.
Even though the three terms: bioinformatics, computational biology and bioinformation
infrastructure are often times used interchangeably, broadly, the three may be defined as
follows:
1. bioinformatics refers to database-like activities, involving persistent sets of data that
are maintained in a consistent state over essentially indefinite periods of time;
2. computational biology encompasses the use of algorithmic tools to facilitate
biological analyses; while
3. bioinformation infrastructure comprises the entire collective of information
management systems, analysis tools and communication networks supporting
biology. Thus, the latter may be viewed as a computational scaffold of the former
two.
Bioinformatics is currently defined as the study of information content and information flow
in biological systems and processes. It has evolved to serve as the bridge between
observations (data) in diverse biologically-related disciplines and the derivations of
understanding (information) about how the systems or processes function, and subsequently
the application (knowledge). A more pragmatic definition in the case of diseases is the
understanding of dysfunction (diagnostics) and the subsequent applications of the knowledge
for therapeutics and prognosis.
The National Center for Biotechnology Information (NCBI 2001) defines bioinformatics
as:
"Bioinformatics is the field of science in which biology, computer science, and information
technology merge into a single discipline. There are three important sub-disciplines within
bioinformatics: the development of new algorithms and statistics with which to assess
relationships among members of large data sets; the analysis and interpretation of various
types of data including nucleotide and amino acid sequences, protein domains, and protein
structures; and the development and implementation of tools that enable efficient access and
management of different types of information."
A Bioinformaticist versus a Bioinformatician (1999):
Bioinformatics has become a mainstay of genomics, proteomics, and all other *.omics (such
as phenomics) that many information technology companies have entered the business or are
considering entering the business, creating an IT (information technology) and BT
(biotechnology) convergence.
A bioinformaticist is an expert who not only knows how to use bioinformatics tools, but also
knows how to write interfaces for effective use of the tools.
A bioinformatician, on the other hand, is a trained individual who only knows to use
bioinformatics tools without a deeper understanding.
Thus, a bioinformaticist is to *.omics as a mechanical engineer is to an automobile. A
bioinformatician is to *.omics as a technician is to an automobile.
Proteomics:
Michael J.Dunn, the Editor-in-Chief of Proteomics defines the "proteome" as: "the Protein
complement of the genome" and proteomics to be concerned with: "Qualitative and
quantitative studies of gene expression at the level of the functional proteins themselves" that
is: "an interface between protein biochemistry and molecular biology" Characterizing the
many tens of thousands of proteins expressed in a given cell type at a given time---whether
measuring their molecular weights or isoelectric points, identifying their ligands or
determining their structures---involves the storage and comparison of vast numbers of data.
Inevitably this requires bioinformatics.
Pharmacogenomics:
Pharmacogenomics is the application of genomic approaches and technologies to the
identification of drug targets. Examples include trawling entire genomes for potential
receptors by bioinformatics means, or by investigating patterns of gene expression in both
pathogens and hosts during infection, or by examining the characteristic expression patterns
found in tumours or patients samples for diagnostic purposes (or in the pursuit of potential
cancer therapy targets).
Pharmacogenetics:
All individuals respond differently to drug treatments; some positively, others with little
obvious change in their conditions and yet others with side effects or allergic reactions. Much
of this variation is known to have a genetic basis. Pharmacogenetics is a subset of
pharmacogenomics which uses genomic/bioinformatic methods to identify genomic
correlates, for example SNPs (Single Nucleotide Polymorphisms), characteristic of particular
patient response profiles and use those markers to inform the administration and development
of therapies. Strikingly such approaches have been used to "resurrect" drugs thought
previously to be ineffective, but subsequently found to work with in subset of patients or in
optimizing the doses of chemotherapy for particular patients.
Cheminformatics:
The Web advertisement for Cambridge Healthtech Institute's Sixth Annual Cheminformatics
conference describes the field thus: "the combination of chemical synthesis, biological
screening, and data-mining approaches used to guide drug discovery and development" but
this, again, sounds more like a field being identified by some of its most popular (and
lucrative) activities, rather than by including all the diverse studies that come under its
general heading.
The story of one of the most successful drugs of all time, penicillin, seems bizarre, but the
way we discover and develop drugs even now has similarities, being the result of chance,
observation and a lot of slow, intensive chemistry. Until recently, drug design always seemed
doomed to continue to be a labour-intensive, trial-and-error process. The possibility of using
information technology, to plan intelligently and to automate processes related to the
chemical synthesis of possible therapeutic compounds is very exciting for chemists and
biochemists. The rewards for bringing a drug to market more rapidly are huge, so naturally
this is what a lot of cheminformatics works is about. The span of academic cheminformatics
is wide and is exemplified by the interests of the cheminformatics groups at the Centre for
Molecular and Biomolecular Informatics at the University of Nijmegen in the Netherlands.
These interests include: Synthesis Planning Reaction and Structure Retrieval 3-D
Structure Retrieval Modelling Computational Chemistry Visualisation Tools and Utilities
Trinity University's Cheminformatics Web page, for another example, concerns itself with
cheminformatics as the use of the Internet in chemistry.
Medical Informatics:
"Biomedical Informatics is an emerging discipline that has been defined as the study,
invention, and implementation of structures and algorithms to improve communication,
understanding and management of medical information." Medical informatics is more
concerned with structures and algorithms for the manipulation of medical data, rather than
with the data itself. This suggests that one difference between bioinformatics and medical
informatics as disciplines lies with their approaches to the data; there are bioinformaticists
interested in the theory behind the manipulation of that data and there are bioinformatics
scientists concerned with the data itself and its biological implications. Medical informatics,
for practical reasons, is more likely to deal with data obtained at "grosser" biological levels--that is information from super-cellular systems, right up to the population level-while most
bioinformatics is concerned with information about cellular and biomolecular structures and
systems.
in the 80's, scientific use of the Internet grew dramatically following the release of the WWW
by CERN in 1992.
HTML:
The WWW is a graphical interface based on hypertext by which text and graphics can be
displayed and highlighted. Each highlighted element is a pointer to another document or an
element in another document which can reside on any internet host computer. Page display,
hypertext links and other features are coded using a simple, cross-platform HyperText
Markup Language (HTML) and viewed on UNIX workstations, PCs and Apple Macs as
WWW pages using a browser.
Java:
The first graphical WWW browser - Mosaic for X and the first molecular biology WWW
server - ExPASy were made available in 1993. In 1995, Sun Microsystems released Java, an
object-oriented, portable programming language based on C++. In addition to being a
standalone programming language in the classic sense, Java provides a highly interactive,
dynamic content to the Internet and offers a uniform operational level for all types of
computers, provided they implement the 'Java Virtual Machine' (JVM). Thus, programs can
be written, transmitted over the internet and executed on any other type of remote machine
running a JVM. Java is also integrated into Netscape and Microsoft browsers, providing both
the common interface and programming capability which are vital in sorting through and
interpreting the gigabytes of bioinformatics data now available and increasing at an
exponential rate.
XML:
The new XML standard 8 is a project of the World-Wide Web Consortium (W3C) which
extends the power of the WWW to deliver not only HTML documents but an unlimited range
of document types using customised markup. This will enable the bioinformatics community
to exchange data objects such as sequence alignments, chemical structures, spectra etc.,
together with appropriate tools to display them, just as easily as they exchange HTML
documents today. Both Microsoft and Netscape support this new technology in their latest
browsers.
CORBA:
Another new technology, called CORBA, provides a way of bringing together many existing
or 'legacy' tools and databases with a common interface that can be used to drive them and
access data. CORBA frameworks for bioinformatics tools and databases have been developed
by, for example, NetGenics and the European Bioinformatics Institute (EBI).
Representatives from industry and the public sector under the umbrella of the Object
Management Group are working on open CORBA-based standards for biological information
representation The Internet offers scientists a universal platform on which to share and search
for data and the tools to ease data searching, processing, integration and interpretation. The
same hardware and software tools are also used by companies and organisations in more
private yet still global Intranet networks. One such company, Oxford GlycoSciences in the
UK, has developed a bioinformatics system as a key part of its proteomics activity.
ROSETTA:
ROSETTA focuses on protein expression data and sets out to identify the specific proteins
which are up- or down-regulated in a particular disease; characterise these proteins with
respect to their primary structure, post-translational modifications and biological function;
evaluate them as drug targets and markers of disease; and develop novel drug candidates
OGS uses a technique called fluorescent IPG-PAGE to separate and measure different protein
types in a biological sample such as a body fluid or purified cell extract. After separation,
each protein is collected and then broken up into many different fragments using controlled
techniques. The mass and sequence of these fragments is determined with great accuracy
using a technique called mass spectrometry. The sequence of the original protein can then be
theoretically reconstructed by fitting these fragments back together in a kind of jigsaw. This
reassembly of the protein sequence is a task well-suited to signal processing and statistical
methods.
ROSETTA is built on an object-relational database system which stores demographic and
clinical data on sample donors and tracks the processing of samples and analytical results. It
also interprets protein sequence data and matches this data with that held in public, client and
proprietary protein and gene databases. ROSETTA comprises a suite of linked HTML pages
which allow data to be entered, modified and searched and allows the user easy access to
other databases. A high level of intelligence is provided through a sophisticated suite of
proprietary search, analytical and computational algorithms. These algorithms facilitate
searching through the gigabytes of data generated by the Company's proteome projects,
matching sequence data, carrying out de novo peptide sequencing and correlating results with
clinical data. These processing tools are mostly written in C, C++ or Java to run on a variety
of computer platforms and use the networking protocol of the internet, TCP/IP, to co-ordinate
the activities of a wide range of laboratory instrument computers, reliably identifying samples
and collecting data for analysis.
The need to analyse ever increasing numbers of biological samples using increasingly
complex analytical techniques is insatiable. Searching for signals and trends in noisy data
continues to be a challenging task, requiring great computing power. Fortunately this power
is available with today's computers, but of key importance is the integration of analytical
data, functional data and biostatistics. The protein expression data in ROSETTA forms only
part of an elaborate network of the type of data which can now be brought to bear in Biology.
The need to integrate different information systems into a collaborative network with a
friendly face is bringing together an exciting mixture of talents in the software world and has
brought the new science of bioinformatics to life.
similar to the original systems gathered together by researchers 15-20 years ago. Many are
simple extensions of the original academic systems, which have served the needs of both
academic and commercial users for many years. These systems are now beginning to fall
behind as they struggle to keep up with the pace of change in the pharma industry. Databases
are still gathered, organised, disseminated and searched using flat files. Relational databases
are still few and far between, and object-relational or fully object oriented systems are rarer
still in mainstream applications. Interfaces still rely on command lines, fat client interfaces,
which must be installed on every desktop, or HTML/CGI forms. Whilst they were in the
hands of bioinformatics specialists, pharmcos have been relatively undemanding of their
tools. Now the problems have expanded to cover the mainstream discovery process, much
more flexible and scalable solutions are needed to serve pharma R&D informatics
requirements.
There are different views of origin of Bioinformatics- From T K Attwood and D J ParrySmith's "Introduction to Bioinformatics", Prentice-Hall 1999 [Longman Higher Education;
ISBN 0582327881]: "The term bioinformatics is used to encompass almost all computer
applications in biological sciences, but was originally coined in the mid-1980s for the
analysis of biological sequence data."
From Mark S. Boguski's article in the "Trends Guide to Bioinformatics" Elsevier, Trends
Supplement 1998 p1: "The term "bioinformatics" is a relatively recent invention, not
appearing in the literature until 1991 and then only in the context of the emergence of
electronic publishing... The National Center for Biotechnology Information (NCBI), is
celebrating its 10th anniversary this year, having been written into existence by US
Congressman Claude Pepper and President Ronald Reagan in 1988. So bioinformatics has, in
fact, been in existence for more than 30 years and is now middle-aged."
1953 - Watson & Crick proposed the double helix model for DNA based x-ray data
obtained by Franklin & Wilkins.
1954 - Perutz's group develop heavy atom methods to solve the phase problem in
protein crystallography.
1955 - The sequence of the first protein to be analysed, bovine insulin, is announed by
F.Sanger.
1969 - The ARPANET is created by linking computers at Standford and UCLA.
1970 - The details of the Needleman-Wunsch algorithm for sequence comparison are
published.
1972 - The first recombinant DNA molecule is created by Paul Berg and his group.
1973 - The Brookhaven Protein DataBank is announeced
(Acta.Cryst.B,1973,29:1764). Robert Metcalfe receives his Ph.D from Harvard
University. His thesis describes Ethernet.
1974 - Vint Cerf and Robert Khan develop the concept of connecting networks of
computers into an "internet" and develop the Transmission Control Protocol (TCP).
1975 - Microsoft Corporation is founded by Bill Gates and Paul Allen.
Two-dimensional electrophoresis, where separation of proteins on SDS
polyacrylamide gel is combined with separation according to isoelectric points, is
announced by P.H.O'Farrel.
1988 - The National Centre for Biotechnology Information (NCBI) is established at
the National Cancer Institute.
The Human Genome Intiative is started (commission on Life Sciences, National
Research council. Mapping and sequencing the Human Genome, National Academy
Press: wahington, D.C.), 1988.
The FASTA algorith for sequence comparison is published by Pearson and Lupman.
A new program, an Internet computer virus desined by a student, infects 6,000
military computers in the US.
1989 - The genetics Computer Group (GCG) becomes a privatae company.
Oxford Molceular Group,Ltd.(OMG) founded, UK by Anthony Marchigton, David
Ricketts, James Hiddleston, Anthony Rees, and W.Graham Richards. Primary
products: Anaconds, Asp, Cameleon and others (molecular modeling, drug design,
protein design).
1990 - The BLAST program (Altschul,et.al.) is implemented.
Molecular applications group is founded in California by Michael Levitt and Chris
Lee. Their primary products are Look and SegMod which are used for molecular
modeling and protein deisign.
InforMax is founded in Bethesda, MD. The company's products address sequence
analysis, database and data management, searching, publication graphics, clone
construction, mapping and primer design.
1991 - The research institute in Geneva (CERN) announces the creation of the
protocols which make -up the World Wide Web.
The creation and use of expressed sequence tags (ESTs) is described.
Incyte Pharmaceuticals, a genomics company headquartered in Palo Alto California,
is formed.
Myriad Genetics, Inc. is founded in Utah. The company's goal is to lead in the
discovery of major common human disease genes and their related pathways. The
company has discovered and sequenced, with its academic collaborators, the
Biological Database
A biological database is a large, organized body of persistent data, usually associated with
computerized software designed to update, query, and retrieve components of the data stored
within the system. A simple database might be a single file containing many records, each of
which includes the same set of information. For example, a record associated with a
nucleotide sequence database typically contains information such as contact name; the input
sequence with a description of the type of molecule; the scientific name of the source
organism from which it was isolated; and, often, literature citations associated with the
sequence.
For researchers to benefit from the data stored in a database, two additional requirements
must be met:
Easy access to the information; and
A method for extracting only that information needed to answer a specific
biological question.
Currently, a lot of bioinformatics work is concerned with the technology of databases. These
databases include both "public" repositories of gene data like GenBank or the Protein
DataBank (the PDB), and private databases like those used by research groups involved in
gene mapping projects or those held by biotech companies. Making such databases accessible
via open standards like the Web is very important since consumers of bioinformatics data use
a range of computer platforms: from the more powerful and forbidding UNIX boxes favoured
by the developers and curators to the far friendlier Macs often found populating the labs of
computer-wary biologists. RNA and DNA are the proteins that store the hereditary
information about an organism. These macromolecules have a fixed structure, which can be
analyzed by biologists with the help of bioinformatic tools and databases. A few popular
databases are GenBank from NCBI (National Center for Biotechnology Information),
SwissProt from the Swiss Institute of Bioinformatics and PIR from the Protein Information
Resource.
GenBank:
GenBank (Genetic Sequence Databank) is one of the fastest growing repositories of known
genetic sequences. It has a flat file structure, that is an ASCII text file, readable by both
humans and computers. In addition to sequence data, GenBank files contain information like
accession numbers and gene names, phylogenetic classification and references to published
literature.There are approximately 191,400,000 bases and 183,000 sequences as of June 1994.
EMBL:
The EMBL Nucleotide Sequence Database is a comprehensive database of DNA and RNA
sequences collected from the scientific literature and patent applications and directly
submitted from researchers and sequencing groups. Data collection is done in collaboration
with GenBank (USA) and the DNA Database of Japan (DDBJ). The database currently
doubles in size every 18 months and currently (June 1994) contains nearly 2 million bases
from 182,615 sequence entries.
SwissProt:
This is a protein sequence database that provides a high level of integration with other
databases and also has a very low level of redundancy (means less identical sequences are
present in the database).
PROSITE:
The PROSITE dictionary of sites and patterns in proteins prepared by Amos Bairoch at the
University of Geneva.
EC-ENZYME:
The 'ENZYME' data bank contains the following data for each type of characterized enzyme
for which an EC number has been provided: EC number, Recommended name, Alternative
names, Catalytic activity, Cofactors, Pointers to the SWISS-PROT entrie(s) that correspond
to the enzyme, Pointers to disease(s) associated with a deficiency of the enzyme.
PDB:
The X-ray crystallography Protein Data Bank (PDB), compiled at the Brookhaven National
Laboratory.
GDB:
The GDB Human Genome Data Base supports biomedical research, clinical medicine, and
professional and scientific education by providing for the storage and dissemination of data
about genes and other DNA markers, map location, genetic disease and locus information,
and bibliographic information.
OMIM:
The Mendelian Inheritance in Man data bank (MIM) is prepared by Victor Mc Kusick with
the assistance of Claire A. Francomano and Stylianos E. Antonarakis at John Hopkins
University.
PIR-PSD:
PIR (Protein Information Resource) produces and distributes the PIR-International Protein
Sequence Database (PSD). It is the most comprehensive and expertly annotated protein
sequence database.The PIR serves the scientific community through on-line access,
distributing magnetic tapes, and performing off-line sequence identification services for
researchers. Release 40.00: March 31, 1994 67,423 entries 19,747,297 residues.
Protein sequence databases are classified as primary, secondary and composite depending
upon the content stored in them. PIR and SwissProt are primary databases that contain
protein sequences as 'raw' data. Secondary databases (like Prosite) contain the information
derived from protein sequences. Primary databases are combined and filtered to form nonredundant composite database
Genethon Genome Databases
PHYSICAL MAP: computation of the human genetic map using DNA fragments in the form
of YAC contigs. GENETIC MAP: production of micro-satellite probes and the localization of
chromosomes, to create a genetic map to aid in the study of hereditary diseases.
GENEXPRESS (cDNA): catalogue the transcripts required for protein synthesis obtained
from specific tissues, for example neuromuscular tissues.
21 Bdb: LBL's Human Chr 21 database:
This is a W3 interface to LBL's ACeDB-style database for Chromosome 21, 21Bdb, using the
ACeDB gateway software developed and provided by Guy Decoux at INRA.
MGD: The Mouse Genome Databases:
MGD is a comprehensive database of genetic information on the laboratory mouse. This
initial release contains the following kinds of information: Loci (over 15,000 current and
withdrawn symbols), Homologies (1300 mouse loci, 3500 loci from 40 mammalian species),
Probes and Clones (about 10,000), PCR primers (currently 500 primer pairs), Bibliography
(over 18,000 references), Experimental data (from 2400 published articles).
ACeDB (A Caenorhabditis elegans Database) :
Containing data from the Caenorhabditis Genetics Center (funded by the NIH National
Center for Research Resources), the C. elegans genome project (funded by the MRC and
NIH), and the worm community. Contacts: Mary O'Callaghan ([email protected])
and Richard Durbin.
ACeDB is also the name of the generic genome database software in use by an increasing
number of genome projects. The software, as well as the C. elegans data, can be obtained via
ftp.
ACeDB databases are available for the following species: C. elegans, Human Chromosome
21, Human Chromosome X, Drosophila melanogaster, mycobacteria, Arabidopsis, soybeans,
rice, maize, grains, forest trees, Solanaceae, Aspergillus nidulans, Bos taurus, Gossypium
hirsutum, Neurospora crassa, Saccharomyces cerevisiae, Schizosaccharomyces pombe, and
Sorghum bicolor.
MEDLINE:
MEDLINE is NLM's premier bibliographic database covering the fields of medicine, nursing,
dentistry, veterinary medicine, and the preclinical sciences. Journal articles are indexed for
MEDLINE, and their citations are searchable, using NLM's controlled vocabulary, MeSH
(Medical Subject Headings). MEDLINE contains all citations published in Index Medicus,
and corresponds in part to the International Nursing Index and the Index to Dental Literature.
Citations include the English abstract when published with the article (approximately 70% of
the current file).
Data quality - data quality has to be of the highest priority. However, because the
data services in most cases lack access to supporting data, the quality of the data must
remain the primary responsibility of the submitter.
Supporting data - database users will need to examine the primary experimental
data, either in the database itself, or by following cross-references back to networkaccessible laboratory databases.
Deep annotation - deep, consistent annotation comprising supporting and ancillary
information should be attached to each basic data object in the database.
Timeliness - the basic data should be available on an Internet-accessible server within
days (or hours) of publication or submission.
Integration - each data object in the database should be cross-referenced to
representation of the same or related biological entities in other databases. Data
services should provide capabilities for following these links from one database or
data service to another.
The end user (the biologist) may not be a frequent user of computer technology
These software tools must be made available over the internet given the global
distribution of the scientific research community
CpG islands and compositional biases. The identification of these and other biological
properties are all clues that aid the search to elucidate the specific function of your sequence.
blastp compares an amino acid query sequence against a protein sequence database
blastn compares a nucleotide query sequence against a nucleotide sequence database
blastx compares a nucleotide query sequence translated in all reading frames against a
protein sequence database
tblastn compares a protein query sequence against a nucleotide sequence database
dynamically translated in all reading frames
tblastx compares the six-frame translations of a nucleotide query sequence against the
six-frame translations of a nucleotide sequence database.
FASTA
FAST homology search All sequences .An alignment program for protein sequences created
by Pearsin and Lipman in 1988. The program is one of the many heuristic algorithms
proposed to speed up sequence comparison. The basic idea is to add a fast prescreen step to
locate the highly matching segments between two sequences, and then extend these matching
segments to local alignments using more rigorous algorithms such as Smith-Waterman.
EMBOSS
EMBOSS (European Molecular Biology Open Software Suite) is a software-analysis
package. It can work with data in a range of formats and also retrieve sequence data
transparently from the Web. Extensive libraries are also provided with this package, allowing
other scientists to release their software as open source. It provides a set of sequence-analysis
programs, and also supports all UNIX platforms.
Clustalw
It is a fully automated sequence alignment tool for DNA and protein sequences. It returns the
best match over a total length of input sequences, be it a protein or a nucleic acid.
RasMol
It is a powerful research tool to display the structure of DNA, proteins, and smaller
molecules. Protein Explorer, a derivative of RasMol, is an easier to use program.
PROSPECT
PROSPECT (PROtein Structure Prediction and Evaluation Computer ToolKit) is a proteinstructure prediction system that employs a computational technique called protein threading
to construct a protein's 3-D model.
PatternHunter
PatternHunter, based on Java, can identify all approximate repeats in a complete genome in a
short time using little memory on a desktop computer. Its features are its advanced patented
algorithm and data structures, and the java language used to create it. The Java language
version of PatternHunter is just 40 KB, only 1% the size of Blast, while offering a large
portion of its functionality.
COPIA
COPIA (COnsensus Pattern Identification and Analysis) is a protein structure analysis tool
for discovering motifs (conserved regions) in a family of protein sequences. Such motifs can
be then used to determine membership to the family for new protein sequences, predict
secondary and tertiary structure and function of proteins and study evolution history of the
sequences.
Bioinformatics Projects
BioJava
The BioJava Project is dedicated to providing Java tools for processing biological data which
includes objects for manipulating sequences, dynamic programming, file parsers, simple
statistical routines, etc.
BioPerl
The BioPerl project is an international association of developers of Perl tools for
bioinformatics and provides an online resource for modules, scripts and web links for
developers of Perl-based software.
BioXML
A part of the BioPerl project, this is a resource to gather XML documentation, DTDs and
XML aware tools for biology in one location.
Biocorba
Interface objects have facilitated interoperability between bioperl and other perl packages
such as Ensembl and the Annotation Workbench. However, interoperability between bioperl
and packages written in other languages requires additional support software. CORBA is one
such framework for interlanguage support, and the biocorba project is currently implementing
a CORBA interface for bioperl. With biocorba, objects written within bioperl will be able to
communicate with objects written in biopython and biojava (see the next subsection). For
more information, see the biocorba project website at http://biocorba.org/. The Bioperl
BioCORBA server and client bindings are available in the bioperl-corba-server and bioperlcorba-client bioperl CVS repositories respecitively. (see http://cvs.bioperl.org/ for more
information).
Ensembl
Ensembl is an ambitious automated-genome-annotation project at EBI. Much of Ensembl\'s
code is based on bioperl, and Ensembl developers, in turn, have contributed significant pieces
of code to bioperl. In particular, the bioperl code for automated sequence annotation has been
largely contributed by Ensembl developers. Describing Ensembl and its capabilities is far
beyond the scope of this tutorial The interested reader is referred to the Ensembl website at
http://www.ensembl.org/.
bioperl-db
Bioperl-db is a relatively new project intended to transfer some of Ensembl's capability of
integrating bioperl syntax with a standalone Mysql database (http://www.mysql.com/) to the
bioperl code-base. More details on bioperl-db can be found in the bioperl-db CVS directory
at http://cvs.bioperl.org/cgi-bin/viewcvs/viewcvs.cgi/bioperl-db/?cvsroot=bioperl. It is worth
mentioning that most of the bioperl objects mentioned above map directly to tables in the
bioperl-db schema. Therefore object data such as sequences, their features, and annotations
can be easily loaded into the databases, as in $loader->store($newid,$seqobj) Similarly one
can query the database in a variety of ways and retrieve arrays of Seq objects. See
biodatabases.pod, Bio::DB::SQL::SeqAdaptor, Bio::DB::SQL::QueryConstraint, and
Bio::DB::SQL::BioQuery for examples.
elucidate requirements
develop new algorithms
implement computer programs and tools for bio data analysis and display of results
design databases for bio data
participate in data analysis
Such individuals will gain employment in national and private research centres such as the
European Bioinformatics Institue (EBI) at Hinxton Cambridge, and the European Media Lab,
academic institutions (eg. the Laboratory for Molecular Biology (LMB) in Cambridge) as
well as in pharmaceutical companies (both multinational, for example GlaxoWellcome,
smithkline, Astro Zencia, Roche etc, and small but influential companies, for example
InPharmatica).
Furthermore, there is a growing demand for graduates with software engineering and
distributed systems skills to develop and maintain the powerful and sophisticated computer
systems that support research, development and production activities in bioinformatics. There
is a heavy use of internet and intranet technology within the sector in order to manage the
storage, analysis and retrieval of biodata whose attributes are
large volume
rapidly increase in volume
high complexity
Bioinformatics Applications
Molecular medicine
The human genome will have profound effects on the fields of biomedical research and
clinical medicine. Every disease has a genetic component. This may be inherited (as is the
case with an estimated 3000-4000 hereditary disease including Cystic Fibrosis and
Huntingtons disease) or a result of the body's response to an environmental stress which
causes alterations in the genome (e.g. cancers, heart disease, diabetes.).
The completion of the human genome means that we can search for the genes directly
associated with different diseases and begin to understand the molecular basis of these
diseases more clearly. This new knowledge of the molecular mechanisms of disease will
enable better treatments, cures and even preventative tests to be developed.
Personalised medicine
Clinical medicine will become more personalised with the development of the field of
pharmacogenomics. This is the study of how an individual's genetic inheritance affects the
body's response to drugs. At present, some drugs fail to make it to the market because a small
percentage of the clinical patient population show adverse affects to a drug due to sequence
variants in their DNA. As a result, potentially life saving drugs never makes it to the
marketplace. Today, doctors have to use trial and error to find the best drug to treat a
particular patient as those with the same clinical symptoms can show a wide range of
responses to the same treatment. In the future, doctors will be able to analyse a patient's
genetic profile and prescribe the best available drug therapy and dosage from the beginning.
Preventative medicine
With the specific details of the genetic mechanisms of diseases being unraveled, the
development of diagnostic tests to measure a persons susceptibility to different diseases may
become a distinct reality. Preventative actions such as change of lifestyle or having treatment
at the earliest possible stages when they are more likely to be successful, could result in huge
advances in our struggle to conquer disease.
Gene therapy
In the not too distant future, the potential for using genes themselves to treat disease may
become a reality. Gene therapy is the approach used to treat, cure or even prevent disease by
changing the expression of a persons genes. Currently, this field is in its infantile stage with
clinical trials for many different types of cancer and other diseases ongoing.
Drug development
At present all drugs on the market target only about 500 proteins. With an improved
understanding of disease mechanisms and using computational tools to identify and validate
new drug targets, more specific medicines that act on the cause, not merely the symptoms, of
the disease can be developed. These highly specific drugs promise to have fewer side effects
than many of today's medicines.
Microbial genome applications
Microorganisms are ubiquitous, that is they are found everywhere. They have been found
surviving and thriving in extremes of heat, cold, radiation, salt, acidity and pressure. They are
present in the environment, our bodies, the air, food and water. Traditionally, use has been
made of a variety of microbial properties in the baking, brewing and food industries. The
arrival of the complete genome sequences and their potential to provide a greater insight into
the microbial world and its capacities could have broad and far reaching implications for
environment, health, energy and industrial applications. For these reasons, in 1994, the US
Department of Energy (DOE) initiated the MGP (Microbial Genome Project) to sequence
genomes of bacteria useful in energy production, environmental cleanup, industrial
processing and toxic waste reduction. By studying the genetic material of these organisms,
scientists can begin to understand these microbes at a very fundamental level and isolate the
genes that give them their unique abilities to survive under extreme conditions.
Waste cleanup
Deinococcus radiodurans is known as the world's toughest bacteria and it is the most
radiation resistant organism known. Scientists are interested in this organism because of its
potential usefulness in cleaning up waste sites that contain radiation and toxic chemicals.
Climate change Studies
Increasing levels of carbon dioxide emission, mainly through the expanding use of fossil
fuels for energy, are thought to contribute to global climate change. Recently, the DOE
(Department of Energy, USA) launched a program to decrease atmospheric carbon dioxide
levels. One method of doing so is to study the genomes of microbes that use carbon dioxide
as their sole carbon source.
Alternative energy sources
Scientists are studying the genome of the microbe Chlorobium tepidum which has an unusual
capacity for generating energy from light
Biotechnology
The archaeon Archaeoglobus fulgidus and the bacterium Thermotoga maritima have potential
for practical applications in industry and government-funded environmental remediation.
These microorganisms thrive in water temperatures above the boiling point and therefore may
provide the DOE, the Department of Defence, and private companies with heat-stable
enzymes suitable for use in industrial processes.
Other industrially useful microbes include, Corynebacterium glutamicum which is of high
industrial interest as a research object because it is used by the chemical industry for the
biotechnological production of the amino acid lysine. The substance is employed as a source
of protein in animal nutrition. Lysine is one of the essential amino acids in animal nutrition.
Biotechnologically produced lysine is added to feed concentrates as a source of protein, and
is an alternative to soybeans or meat and bonemeal.
Xanthomonas campestris pv. is grown commercially to produce the exopolysaccharide
xanthan gum, which is used as a viscosifying and stabilising agent in many industries.
Lactococcus lactis is one of the most important micro-organisms involved in the dairy
industry, it is a non-pathogenic rod-shaped bacterium that is critical for manufacturing dairy
products like buttermilk, yogurt and cheese. This bacterium, Lactococcus lactis ssp., is also
used to prepare pickled vegetables, beer, wine, some bread and sausages and other fermented
foods. Researchers anticipate that understanding the physiology and genetic make-up of this
bacterium will prove invaluable for food manufacturers as well as the pharmaceutical
industry, which is exploring the capacity of L. lactis to serve as a vehicle for delivering drugs.
Antibiotic resistance
Scientists have been examining the genome of Enterococcus faecalis-a leading cause of
bacterial infection among hospital patients. They have discovered a virulence region made up
of a number of antibiotic-resistant genes that may contribute to the bacterium's transformation
from a harmless gut bacteria to a menacing invader. The discovery of the region, known as a
pathogenicity island, could provide useful markers for detecting pathogenic strains and help
to establish controls to prevent the spread of infection in wards.
Forensic analysis of microbes
Scientists used their genomic tools to help distinguish between the strain of Bacillus
anthryacis that was used in the summer of 2001 terrorist attack in Florida with that of closely
related anthrax strains
The reality of bioweapon creation
Scientists have recently built the virus poliomyelitis using entirely artificial means. They did
this using genomic data available on the Internet and materials from a mail-order chemical
supply. The research was financed by the US Department of Defense as part of a biowarfare
response program to prove to the world the reality of bioweapons. The researchers also hope
their work will discourage officials from ever relaxing programs of immunisation. This
project has been met with very mixed feelings
Evolutionary studies
The sequencing of genomes from all three domains of life, eukaryota, bacteria and archaea
means that evolutionary studies can be performed in a quest to determine the tree of life and
the last universal common ancestor.
Crop improvement
Comparative genetics of the plant genomes has shown that the organisation of their genes has
remained more conserved over evolutionary time than was previously believed. These
findings suggest that information obtained from the model crop systems can be used to
suggest improvements to other food crops. At present the complete genomes of Arabidopsis
thaliana (water cress) and Oryza sativa (rice) are available.
Insect resistance
Genes from Bacillus thuringiensis that can control a number of serious pests have been
successfully transferred to cotton, maize and potatoes. This new ability of the plants to resist
insect attack means that the amount of insecticides being used can be reduced and hence the
nutritional quality of the crops is increased.
Improve nutritional quality
Scientists have recently succeeded in transferring genes into rice to increase levels of Vitamin
A, iron and other micronutrients. This work could have a profound impact in reducing
occurrences of blindness and anaemia caused by deficiencies in Vitamin A and iron
respectively. Scientists have inserted a gene from yeast into the tomato, and the result is a
plant whose fruit stays longer on the vine and has an extended shelf life
Development of Drought resistance varieties
Progress has been made in developing cereal varieties that have a greater tolerance for soil
alkalinity, free aluminium and iron toxicities. These varieties will allow agriculture to
succeed in poorer soil areas, thus adding more land to the global production base. Research is
also in progress to produce crop varieties capable of tolerating reduced water conditions
Vetinary Science
Sequencing projects of many farm animals including cows, pigs and sheep are now well
under way in the hope that a better understanding of the biology of these organisms will have
huge impacts for improving the production and health of livestock and ultimately have
benefits for human nutrition.
Comparative Studies
Analysing and comparing the genetic material of different species is an important method for
studying the functions of genes, the mechanisms of inherited diseases and species evolution.
Bioinformatics tools can be used to make comparisons between the numbers, locations and
biochemical functions of genes in different organisms.
Organisms that are suitable for use in experimental research are termed model organisms.
They have a number of properties that make them ideal for research purposes including short
life spans, rapid reproduction, being easy to handle, inexpensive and they can be manipulated
at the genetic level.
An example of a human model organism is the mouse. Mouse and human are very closely
related (>98%) and for the most part we see a one to one correspondence between genes in
the two species. Manipulation of the mouse at the molecular level and genome comparisons
between the two species can and is revealing detailed information on the functions of human
genes, the evolutionary relationship between the two species and the molecular mechanisms
of many human diseases.
Fig.
Applications Areas
Bioinformatics in India
Studies of IDC points out that India will be a potential star in bioscience field in the coming
years after considering the factors like bio-diversity, human resources, infrastructure facilities
and governments initiatives. According to IDC, bioscience includes pharma, Bio-IT
(bioinformatics), agriculture and R&D. IDC has been reported that the pharmaceutical firms
and research institutes in India are looking forward for cost-effective and high-quality
research, development, and manufacturing of drugs with more speed.
Bioinformatics has emerged out of the inputs from several different areas such as biology,
biochemistry, biophysics, molecular biology, biostatics, and computer science. Specially
designed algorithms and organized databases is the core of all informatics operations. The
requirements for such an activity make heavy and high level demands on both the hardware
and software capabilities.
This sector is the quickest growing field in the country. The vertical growth is because of the
linkages between IT and biotechnology, spurred by the human genome project. The
promising start-ups are already there in Bangalore, Hyderabad, Pune, Chennai, and Delhi.
There are over 200 companies functioning in these places. IT majors such as Intel, IBM,
Wipro are getting into this segment spurred by the promises in technological developments.
Government initiatives
Informatics is a very essential component in the biotech revolution. Ranging from reference
to type-culture collections or comparing gene sequences access to comprehensive up-to-date
biological information, all are crucial in almost every aspects of biotechnology. India, as a
hub of scientific and academic research, was one of the first countries in the world to
establish a nation wide bioinformatics network.
The department of biotechnology (DBT) initiated the program on bioinformatics way back in
1986-87. The Biotechnology Information System Network (BTIS), a division of DBT, has
covered the entire country by connecting to the 57 key research centers. BTIS is providing an
easy access to huge database to the scientists. Six national facilities on interactive graphics
are dedicated to molecular modeling and other related areas. More than 100 databases on
biotechnology have been developed. Two major databases namely coconut biotechnology
databases and complete genome of white spota syndrome of shrimp has been released for
public use. Several major international data bases for application to genomics and proteomics
have been established in the form of mirror sites under the National Jai Vigyan Mission.
The BTIS proposes to increase the bandwidth of existing network and provide high-speed
internet connectivity to continue with its present activities of training, education mirroring of
public utility packages, consideration of R&D projects and support to different research
activities in this field. The DBT is planning to set up a National Bioinformatics Institute as an
apex body for the bioinformatics network in the country. The DBT also proposes to bolster a
proper education in bioinformatics through publication of textbooks and monographs by
reputed scientists in the field. Collaboration with the industry is also poised to increase in the
coming years.
Opportunities
According to Confederation of Indian Industry(CII), the global bioinformatics industry
clocked an estimated turnover of $2 billion in 2000 and is expected to become $60 billion by
2005. If the industry and government work together it is possible to achieve a five percent
global market share by 2005, i.e., a $3 billion opportunity in India.
The past two years has seen many large multinational pharmaceutical companies acquiring
other small companies and developing in the biosciences sector. IDC currently forecasts a
compound annual growth rate (from 2001-02 to 2004-05) of about 10 percent in the spending
on Information Technology by bioscience organizations. Considering the local market is
generally less mature than those in the US and Europe, IDC forecasts more aggressive growth
beyond 2005, as many of the organizations attempt to play "catch-up". Enterprise
applications including data warehousing, knowledge management and storage are being
pursued by these companies as priorities.
IDC expects IT spending in biosciences in India will cross $138 million by 2005, mainly in
the areas of system clusters, storage, application software, and services. Also the
governments life science focus provides a great deal of the necessary backbone to develop
and deliver innovative products and technologies. This focus will also help to build fastgrowing and lucrative enterprises, attract international investment, and create additional highvalue employment opportunities.
Hence the focus of the IT sector should be on products and services that align with bioscience
needs. Demonstrating a true understanding of the IT requirements of biotechnology processes
is the key for IT suppliers to bridge the chasm that currently exists between IT and Science.
Advantages India has
India is well placed to take the global leadership in genome analysis, as is in a unique
position in terms of genetic resources. India has several ethnic populations that are valuable
in providing information about disease predisposition and susceptibility, which in turn will
help in drug discovery.
However, as India lacks the records of clinical information about the patients, sequence data
without clinical information will have little meaning. And hence partnership with clinicians is
essential. The real money is in discovering new drugs for ourselves and not in supplying
genetic information and data to the foreign companies, who would then use this information
to discover new molecules.
The genomic data provides information about the sequence, but it doesnt give information
about the function. It is still not possible to predict the actual 3-D structure of proteins. This is
a key area of work as tools to predict correct folding patterns of proteins will help drug
design research substantially. India has the potential to lead if it invests in this area.
Looking at this biotech and pharma companies need tremendous software support. Software
expertise is required to write algorithms, develop software for existing algorithms, manage
databases, and in final process of drug discovery.
exposure to the IT side of bioinformatics, which is very important. Another issue is that some
companies face shortage of funds and infrastructure. The turn around time for an average
biotech industry to breakeven would be around three to five years.
Most of the venture capitals and other sources of funding would not be very supportive,
especially if the company is not part of a larger group venture. It would help if the
government would take an active role in building infrastructure and funding small and
medium entrepreneurs.
Biotechnology Information Systems (BTIS)
The objective of setting up BTIs
Proteomics-Introduction
Definition: The analysis of complete complements of proteins. Proteomics includes not only
the identification and quantification of proteins, but also the determination of their
localization, modifications, interactions, activities, and, ultimately, their function. Initially
encompassing just two- dimensional (2D) gel electrophoresis for protein separation and
identification, proteomics now refers to any procedure that characterizes large sets of
proteins. The explosive growth of this field is driven by multiple forces - genomics and its
revelation of more and more new proteins; powerful protein technologies such as newly
developed mass spectrometry approaches, global [yeast] two - hybrid techniques, and spinoffs from DNA arrays; and innovative computational tools and methods to process, analyze,
and interpret prodigious amounts of data."
The theme of molecular biology research, in the past, has been oriented around the gene
rather than the protein. This is not to say that researchers have neglected to study proteins, but
rather that the approaches and techniques most commonly used have looked primarily at the
nucleic acids and then later at the protein(s) implicated.
The main reason for this has been that the technologies available, and the inherent
characteristics of nucleic acids, have made the genes the low hanging fruit. This situation has
changed recently and continues to change as larger scale, higher throughput methods are
developed for both nucleic acids and proteins. The majority of processes that take place in a
cell are not performed by the genes themselves, but rather by the proteins that they code for.
A disease can arise when a gene/protein is over- or under-expressed, or when a mutation in a
gene results in a malformed protein, or when post translational modifications alter a protein's
function. Thus to truly understand a biological process, the relevant proteins must be studied
directly. But there are more challenges when studying proteins compared to studying genes,
due to their complex 3-D structure which is related to the function, analogous to a machine.
Proteomics is defined as the systematic large-scale analysis of protein expression under
normal and perturbed (stressed, diseased, and/or drugged) states, and generally involves the
separation, identification, and characterization of all of the proteins in a cell or tissue
sample. The meaning of the term has also been expanded, and is now used loosely to refer to
the approach of analyzing which proteins a particular type of cell synthesizes, how much the
cell synthesizes, how cells modify proteins after synthesis, and how all of those proteins
interact.
There are orders of magnitude more proteins than genes in an organism - based on alternative
splicing (several per gene) and post translational modifications (over 100 known), there are
estimated to be a million or more.
Fortunately there are features such as folds and motifs, which allow them to be categorized
into groups and families, making the task of studying them more tractable. There is a broad
range of technologies used in proteomics, but the central paradigm has been the use of 2-D
gel electrophoresis (2D-GE) followed by mass spectrometry (MS). 2D-GE is used to first
separate the proteins by isoelectric point and then by size.
The individual proteins are subsequently removed from the gel and prepared, then analyzed
by MS to determine their identity and characteristics. There are various types of mass
analyzers used in proteomics MS including quadrupole, time-of-flight (TOF), and ion trap,
and each has its own particular capabilities. Tandem arrangements are often used, such as
quadrupole-TOF, to provide more analytical power. The recent development of soft
ionization techniques, namely matrix-assisted laser desorption ionization (MALDI) and
electro-spray ionization (ESI), has allowed large biomolecules to be introduced into the mass
analyzer without completely decomposing their structures, or even without breaking them at
all, depending on the design of the experiment.
There are techniques which incorporate liquid chromatography (LC) with MS, and others that
use LC by itself. Robotics has been applied to automate several steps in the 2DGE-MS
process such as spot excision and enzyme digests. To determine a protein's structure, XRD
and NMR techniques are being improved to reach higher throughput and better performance.
For example, automated high-throughput crystallization methods are being used upstream of
XRD to alleviate that bottleneck. For NMR, cryo-probes and flow probes shorten analysis
time and decrease sample volume requirements. The hope is that determining about 10,000
protein structures will be enough to characterize the estimated 5,000 or so folds, which will
feed into more reliable in silico structural prediction methods.
Structure by itself does not provide all of the desired information, but is a major step in the
right direction. Protein chips are being developed for many of the processes in proteomics.
For example, researchers are developing protocols for protein microarrays at institutions such
as Harvard and Stanford as well as at several companies. These chips - grids of attached
peptide fragments, attached antibodies, or gel "pads" with proteins suspended inside - will be
used for various experiments such as protein-protein interaction studies and differential
expression analysis.
They can also be used to filter out high abundance proteins before further experiments; one of
the major challenges in proteomics is isolating and analyzing the low abundance proteins,
which are thought to be the most important. There are many other types of protein chips, and
the number will continue to grow. For example, microfluidics chips can combine the sample
preparation steps prior to MS, such as enzyme digests, with nanoelectrospray ionization, all
on the one chip. Or, the samples can be ionized directly off of the surface of the chip, similar
to a MALDI target. Microfluidics chips are also being combined with NMR.
In the next few years, various protein chips will be used increasingly in diagnostic
applications as well. The bioinformatics side of proteomics includes both databases and
analysis software. There are many public and private databases containing protein data
ranging from sequences, to functions, to post translational modifications. Typically, a
researcher will first perform 2D-GE followed by MS; this will result in a fingerprint,
molecular weight, or even sequence for each protein of interest, which can then be used to
query databases for similarities or other information.
Swiss-Prot and TrEMBL, developed in collaboration between the Swiss Institute of
Bioinformatics and the European Bioinformatics Institute, are currently the major databases
dedicated to cataloging protein data, but there are dozens of more specialized databases and
tools. New bioinformatics approaches are constantly being introduced. Recent customized
versions of PSI-BLAST can, for example, utilize not only the curated protein entries in
Swiss-Prot but also linguistic analyses of biomedical journal articles to help determine
protein family relationships. Publicly available databases and tools are popular, but there are
also several companies offering subscriptions to proprietary databases, which often include
protein-protein interaction maps generated using the yeast two-hybrid (Y2H) system.
The proteomics market is comprised of instrument manufacturers, bioinformatics companies,
laboratory product suppliers, service providers, and other biotech related companies which
can defy categorization. A given company can often overlap more than one of these areas.
Many of the companies involved in the proteomics market are actually doing drug discovery
as their major focus, while partnering, or providing services or subscriptions, to other
companies to generate short term revenues. The market for proteomics products and services
was estimated to be $1.0B in 2000, growing at a CAGR of 42% to about $5.8B in 2005.
The major drivers will continue to be the biopharmaceutical industry's pursuit of blockbuster
drugs and the recent technological advances, which have allowed large-scale studies of genes
and proteins. Alliances are becoming increasingly important in this field, because it is
challenging for companies to find all of the necessary expertise to cover the different
activities involved in proteomics. Synergies must be created by combining forces. For
example, many companies working with mass spectrometry, both the manufacturers and end
user labs, are collaborating with protein chip related companies. The technologies are a
natural fit for many applications, such as microfluidic chips, which provide nanoelectrospray
ionization into a mass spectrometer.
There are many combinations of diagnostics, instrumentation, chip, and bioinformatics
companies which create effective partnerships. In general, proteomics appears to hold great
promise in the pursuit of biological knowledge. There has been a general realization that the
large-scale approach to biology, as opposed to the strictly hypothesis-driven approach, will
rapidly generate much more useful information.
The two approaches are not mutually exclusive, and the happy medium seems to be the
formation of broad hypotheses, which are subsequently investigated by designing large-scale
experiments and selecting the appropriate data. Proteomics and genomics, and other varieties
of 'omics', will all continue to complement each other in providing the tools and information
for this type of research.
Microarrays
Microarray: A 2D array, typically on a glass, filter, or silicon wafer, upon which genes or
gene fragments are deposited or synthesized in a predetermined spatial order allowing them
to be made available as probes in a high-throughput, parallel manner.
Microarrays that consist of ordered sets of DNA fixed to solid surfaces provide
pharmaceutical firms with a means to identify drug targets. In the future, the emerging
technology promises to help physicians decide the most effective drug treatments for
individual patients.
Microarrays are simply ordered sets of DNA molecules of known sequence. Usually
rectangular, they can consist of a few hundred to hundreds of thousands of sets. Each
individual feature goes on the array at precisely defined location on the substrate. The
identity of the DNA molecule fixed to each feature never changes. Scientists use that fact in
calculating their experimental results.
Microarray analysis permits scientists to detect thousands of genes in a small sample
simultaneously and to analyze the expression of those genes. As a result, it promises to
enable biotechnology and pharmaceutical companies to identify drug targets - the proteins
with which drugs actually interact. Since it can also help identify individuals with similar
biological patterns, microarray analysis can assist drug companies in choosing the most
appropriate candidates for participating in clinical trials of new drugs. In the future, this
emerging technology has the potential to help medical professionals select the most effective
drugs, or those with the fewest side effects, for individual patients.
Potential of Microarray analysis
The academic research community stands to benefit from microarray technology just as much
as the pharmaceutical industry. The ability to use it in place of existing technology will allow
researchers to perform experiments faster and more cheaply, and will enable them to
concentrate on analyzing the results of microarray experiments rather than simply performing
the experiments. This research could then lead to a better understanding of the disease
process. That will require many different levels of research. While the field of expression has
received most attention so far, looking at the gene copy level and protein level is just as
important. Microarray technology has potential applications in each of these three levels.
Identifying drug targets provided the initial market for the microarrays. A good drug target
has extraordinary value for developing pharmaceuticals. By comparing the ways in which
genes are expressed in a normal and diseased heart, for example, scientists might be able to
identify the genes and hence the associated proteins -- that are part of the disease process.
Researchers could then use that information to synthesize drugs that interact with these
proteins, thus reducing the disease's effect on the body.
Gene sequences can be measured simultaneously and calculated instantly when an ordered set
of DNA molecules of known sequence a microarray is used. Consequently, scientists can
evaluate an entire set of genes at once, rather than looking at physiological changes one gene
at a time. For example, Genetics Institute, a biotechnology company in Cambridge,
Massachusetts, built an array consisting of genes for cytokines, which are proteins that affect
cell physiology during the inflammatory response, among other effects. The full set of DNA
molecules contained more than 250 genes. While that number was not large by current
standards of microarrays, it vastly outnumbered the one or two genes examined in typical
pre-microarray experiments. The Genetics Institute scientists used the array to study how
changes experienced by cells in the immune system during the inflammatory response are
reflected in the behavior of all 250 genes at the same time. This experiment established the
potential for using the patterns of response to help locate points in the body at which drugs
could prove most effective.
Microarray Products
Within that basic technological foundation, microarray companies have created a variety of
products and services. They range in price, and involve several different technical
approaches. A kit containing a simple array with limited density can cost as little as $1,100,
while a versatile system favored by R&D laboratories in pharmaceutical and biotechnology
companies costs more than $200,000. The differences among products lie in the basic
components and the precise nature of the DNA on the arrays.
The type of molecule placed on the array units also varies according to circumstances. The
most commonly used molecule is cDNA, or complementary DNA, which is derived from
messenger RNA and cloned. Since they are derived from a distinct messenger RNA, each
feature represents an expressed gene.
Microarray-Identifying interactions
To detect interactions at microarray features, scientists must label the test sample in such a
way that an appropriate instrument can recognize it. Since the minute size of microarray
features limits the amount of material that can be located at any feature, detection methods
must be extremely sensitive.
Other than a few low-end systems that use radioactive or chemiluminescent tagging, most
microarrays use fluorescent tags as their means of identification. These labels can be
delivered to the DNA units in several different ways. One simple and flexible approach
involves attaching a fluorophore such as fluorescein or Cy3 to the oligonucleotide layer.
While relatively simple, this approach has low sensitivity because it delivers only one unit of
label per interaction. Technologists can achieve more sensitivity by multiplexing the labeled
entity -- that is, delivering more than one unit of label per interaction.
Biotech research generates such massive volumes of information, so quickly that we need
newer and swifter ways of crunching the deluge of data (much of it hopelessly jumbled)
churned out by the research labs. Mining, digitising and indexing this enormous quantum
of data requires high-end computational technology and sophisticated software solutions.
There is no such thing as a typical career path in this field. Bioinformaticians need to
perform two critical roles: develop IT tools embodying novel algorithms and analytical
techniques, and apply existing tools to achieve new insights into molecular biology.
However, you must remember that although powerful and highly specialised in itself,
bioinformatics is only a part of biotechnology. Getting a DNA sequence coding for a new
protein does not automatically make it useful. Unless this information is converted into
useful processes and products, it serves no purpose. You can not, for instance, have a
virtual drug or a virtual vaccine. We need real products. And we need to develop our own
new molecules (particularly if we have to survive in the new IPR regime).
There are different types of career opportunities available for different stream students,
Life Sciences:
Scientific Curator, Gene Analyst, Protein Analyst, Phylogenitist, Research Scientist /
Associate.
Computer Science / Engineering:
Data base programmer, Bio-informatics software developer, Computational biologist,
Network Administrator / Analyst.
Applied Science:
Structural analyst, Molecular Modeler, Bio-statistician, Bio-mechanics, Database
programmer.
Pharmaceutical Science:
Cheminformatician, Pharmacogenetician, Pharmacogenomics, Research Scientist /
Associate.
According to the scientist working at companies such as Celera Genomics and Eli Lilly,
the following "core requirements" for bioinformaticians:
>> Fairly deep background in some aspect of molecular biology. It can be biochemistry,
molecular biology, molecular biophysics, or even molecular modeling, but without a core
of knowledge of molecular biology is like, "run into brick walls too often."
>>Understanding the central dogma of molecular biology, how and why DNA sequence
is transcribed into RNA and translated into protein is vital.
>>Should have substantial experience with at least one or two major molecular biology
software packages, either for sequence analysis or molecular modeling. The experience of
learning one of these packages makes it much easier to learn to use other software
quickly.
>>Should be comfortable working in a command-line computing environment. Working
in Linux or UNIX will provide this experience.
>>Should have experience with programming in a computer language such as Java, Unix,
C, C++, RDBMS such as Oracle and Sybase, CORBA, Perl or Python, CGI and web
scripting.
Source:Extracted from the book "Developing Bioinformatics Computer Skills" by Cynthia
Gibbs & Per Jambeck, O'Reilly & Associates, Inc
India:
Starting with a package of Rs 12,000 to Rs 15,000, you can expect Rs 20,000 with a
couple of years of experience under your belt. In fact, the acute shortage of experts in the
field has given rise to active poaching of scientists from premier research institutions. The
going price for bioinformaticians with a years experience is upwards of Rs 50,000 per
month.
Abroad Countries:
Starting salaries in the USA range between $60,000 and $ 90,000 for professionals with a
couple of years of experience.
Biostatistics
Regulatory Affairs
* Associate: $51,500
*Senior associate: $65,000
*Manager: $85,000
*Clinical research physician:
$90,000200,000
*Senior laboratory
technician: $34,000
*Junior laboratory
technician: $21,715
*Associate: $52,000
*Senior associate: $76,000
Quality Assurance
*BS: $29,000
*MS: $34,450
*PhD: $45,700
* Specialist: $54,500
*Engineer: $58,000
*President/General manager:
$94,000$400,000
Median earnings in industries employing the greatest number of biological and medical
scientists in 1997 were:
*Federal government: $48,600
*Pharmaceuticals: $46,300
*Research and testing services: $40,800
*State government: $38,000
Sources: Stax Research (1999), Abbott, Langer & Associates (1999), and U.S. Bureau of
Labor Statistics
they are cross-trained in computing skills. They can also be Business Analysts for life science
companies.
The answer depends on whether you are talking to a computer scientist who does
biology or a molecular biologist who does computing. Most of what you will read in the
popular press is that the importance of interdisciplinary scientists cannot be over-stressed
and that the young people getting the top jobs in the next few years will be those
graduating from truly interdisciplinary programs.
However, there are many types of bioinformatics jobs available, so no one background is
ideal for all of them. The fact is that many of the jobs available currently involve the
design and implementation of programs and systems for the storage, management and
analysis of vast amounts of DNA sequence data. Such positions require in-depth
programming and relational database skills which very few biologists possess and so it is
largely the computational specialists who are filling these roles.
This is not to say the computer-savvy biologist doesnt play an important role. As the
bioinformatics field matures there will be a huge demand for outreach to the biological
community as well as the need for individuals with the in-depth biological background
necessary to sift through gigabases of genomic sequence in search of novel targets. It will
be in these areas that biologists with the necessary computational skills will find their
niche.
Bioinformatics combines the tools and techniques of mathematics, computer science and
biology in order to understand the biological significance of a variety of data. So if you
like to get into this new scientific field you should be fond of these classic disciplines.
Because the field is so new, almost everyone in it did something else before. Some
biologist went into bioinformatics by picking up programming but others entered via the
reverse route.
History of Bioinformatics
Bioinformatics is the application of computer technology to the management of biological
information. Computers are used to gather, store, analyze and integrate biological and genetic
information which can then be applied to gene-based drug discovery and development. The
need for Bioinformatics capabilities has been precipitated by the explosion of publicly
available genomic information resulting from the Human Genome Project. The goal of this
project - determination of the sequence of the entire human genome (approximately three
billion base pairs) - will be reached by the year 2002. The science of Bioinformatics, which is
the melding of molecular biology with computer science, is essential to the use of genomic
information in understanding human diseases and in the identification of new molecular
targets for drug discovery. In recognition of this, many universities, government institutions
and pharmaceutical firms have formed bioinformatics groups, consisting of computational
biologists and bioinformatics computer scientists. Such groups will be key to unraveling the
mass of information generated by large scale sequencing efforts underway in laboratories
around the world.
The Modern bioinformatics is can be classified into two broad categories, Biological Science
and computational Science. Here is the data of historical events for both biology and
computer science.
The history of biology in general, B.C. and before the discovery of genetic inheritance by G.
Mendel in 1865, is extremely sketch and inaccurate. This was the start of Bioinformatics
history. Gregor Mendel. is known as the "Father of Genetics". He did experiment on the
cross-fertilization of different colors of the same species. He carefully recorded the data and
analyzed the data. Mendel illustrated that the inheritance of traits could be more easily
explained if it was controlled by factors passed down from generation to generation.
The understanding of genetics has advanced remarkably in the last thirty years. In 1972, Paul
berg made the first recombinant DNA molecule using ligase. In that same year, Stanley
Cohen, Annie Chang and Herbert Boyer produced the first recombinant DNA organism. In
1973, two important things happened in the field of genomics. The advancement of
computing in 1960-70s resulted in the basic methodology of bioinformatics. However, it is
the 1990s when the INTERNET arrived when the full fledged bioinformatics field was born.
Here are some of the major events in bioinformatics over the last several decades. The events
listed in the list occurred long before the term, "bioinformatics", was coined.
BioInformatics Events
1665 Robert Hooke published Micrographia, described the cellular structure of cork. He
also described microscopic examinations of fossilized plants and animals, comparing
their microscopic structure to that of the living organisms they resembled. He argued
for an organic origin of fossils, and suggested a plausible mechanism for their
formation.
1683 Antoni van Leeuwenhoek discovered bacteria.
1686 John Ray, John Ray's in his book "Historia Plantarum" catalogued and described
18,600 kinds of plants. His book gave the first definition of species based upon
common descent.
1843 Richard Owen elaborated the distinction of homology and analogy.
1864 Ernst Haeckel (Hckel) outlined the essential elements of modern zoological
classification.
1865 Gregory Mendel (1823-1884), Austria, established the theory of genetic inheritance.
1902 The chromosome theory of heredity is proposed by Sutton and Boveri, working
independently.
1962 Pauling's theory of molecular evolution
1905 The word "genetics" is coined by William Bateson.
1913 First ever linkage map created by Columbia undergraduate Alfred Sturtevant (working
with T.H. Morgan).
1930 Tiselius, Uppsala University, Sweden, A new technique, electrophoresis, is introduced
by Tiselius for separating proteins in solution. "The moving-boundary method of
studying the electrophoresis of proteins" (published in Nova Acta Regiae Societatis
Scientiarum Upsaliensis, Ser. IV, Vol. 7, No. 4)
1946 Genetic material can be transferred laterally between bacterial cells, as shown by
Lederberg and Tatum.
1952 Alfred Day Hershey and Martha Chase proved that the DNA alone carries genetic
information. This was proved on the basis of their bacteriophage research.
1961 Sidney Brenner, Franois Jacob, Matthew Meselson, identify messenger RNA,
1965 Margaret Dayhoff's Atlas of Protein Sequences
1970 Needleman-Wunsch algorithm
1977 DNA sequencing and software to analyze it (Staden)
1981 Smith-Waterman algorithm developed
1981 The concept of a sequence motif (Doolittle)
1982 GenBank Release 3 made public
1982 Phage lambda genome sequenced
1983 Sequence database searching algorithm (Wilbur-Lipman)
1985 FASTP/FASTN: fast sequence similarity searching
1988 National Center for Biotechnology Information (NCBI) created at NIH/NLM
1988 EMBnet network for database distribution
1990 BLAST: fast sequence similarity searching
1991 EST: expressed sequence tag sequencing
1993 Sanger Centre, Hinxton, UK
1994 EMBL European Bioinformatics Institute, Hinxton, UK
1995 First bacterial genomes completely sequenced
1996 Yeast genome completely sequenced
1997 PSI-BLAST
1998 Worm (multicellular) genome completely sequenced
1999 Fly genome completely sequenced
2000 Jeong H, Tombor B, Albert R, Oltvai ZN, Barabasi AL. The large-scale organization
of metabolic networks. Nature 2000 Oct 5;407(6804):651-4, PubMed
2000 The genome for Pseudomonas aeruginosa (6.3 Mbp) is published.
2000 The A. thaliana genome (100 Mb) is sequenced.
2001 The human genome (3 Giga base pairs) is published.
Protein Folding
Proteins are the biological molecules that are the building blocks of cells and organs, and the
biochemical processes required to keep living organisms alive are catalyzed and regulated by
proteins called enzymes. Proteins are linear polymers of amino acids that fold into
conformations dictated by the physical and chemical properties of the amino acid chain. The
biological function of a protein is dependent on the protein folding into the correct, or
"native", state. Protein folding is usually a spontaneous process, and often when a protein
unfolds because of heat or chemical denaturation, it will be capable of refolding into the
correct conformation as soon as it is removed from the environment of the denaturant.
Protein folding can go wrong for many reasons. When an egg is boiled, the proteins in the
white unfold and misfold into a solid mass of protein that will not refold or redissolve. This
type of irreversible misfolding is similar to the insoluble protein deposits found in certain
tissues that are characteristic of some diseases, such as Alzheimer's disease. These protein
deposits are aggregates of proteins folded into the wrong shapes.
Determining the process by which proteins fold into particular shapes, characteristic of their
amino acid sequence, is commonly called "the protein folding problem", an area of study at
the forefront of computational biology. One approach to studying the protein folding process
is the application of statistical mechanics techniques and molecular dynamics simulations to
the study of protein folding.
The Stanford University Folding@home project has simulated protein folding with atomistic
detail in an implicit solvent model by using a large scale distributed computing project that
allows timescales thousands to millions of times longer than previously achievable with a
model of this detail. Look at the menu on the left border of the Stanford Folding@home web
page. Click on the "Science" link to read the scientific background behind the protein folding
distributed computing project. (Q1) What are the 3 functions of proteins that are mentioned in
the "What are proteins?" section of the scientific background? (Q2) What are 3 diseases that
are believed to result from protein misfolding? (Q3) What are typical timescales for
molecular dynamics simulations? (Q4) What are typical timescales at which the fastest
proteins fold? (Q5) How does the Stanford group break the microsecond barrier with their
simulations?
Return to the Stanford Folding@home home page. Click on the "Results" link in the left
border of the web page. Look at the information on the folding simulations of the villin
headpiece. (Q6) How many amino acids are in the simulated villin headpiece? (Q7) How
does this compare with the number of amino acids in a typical protein? (Q8) Taking into
consideration the size of the biological molecules in these simulations and the requirements
that necessitated using large scale distributed computing methods for the simulations, what
are the biggest impediments to understanding the protein folding problem?
With the publication of entire genomes that contain sequences to many unknown proteins, it
is possible to imagine someday having the ability to predict the final structure of a protein
based on its sequence. This would require an understanding of the fundamental rules that
govern protein folding. Also, inroads into the mechanisms behind protein folding would
provide important knowledge for fighting disease states where misfolded proteins are
implicated.
BioInformatics - Introduction
To build a house you need bricks and mortar and something else -- the "know-how", or
"information" as to how to go about your business. The Victorians knew this. But when it
came to the "building" of animals and plants, -- the word, and perhaps the concept, of
"information" is difficult to trace in their writings.
Classical scholars tell us that Aristotle did not have this problem. The "eidos", the formgiving essence that shapes the embryo "contributes nothing to the the material body of the
embryo but only communicates its program of development" (see Delbrck's "Aristotle-totletotle" in Of Microbes and Life 1971, pp. 50-55).
William Bateson spoke of a "factor" (gene) having the "power" to bring about the building
of the characters which make up an organism. He used the "information" concept, but not the
word. He was prepared to believe his factors were molecules of the type we would today call
macromolecules, but he did not actually call them "informational macromolecules".
Information has many forms. If you turn down the corner of a page of a book to remind you
where you stopped reading ("book-mark"), then you have left information on the page. In
future you read ("decode") the bookmark with the knowledge that it means "continue here".
A future historian might be interested in where you paused in your reading. Coming across
the book, he/she would notice creases suggesting that a flap had been turned down. Making
assumptions about the code you were employing, a feasible map of the book could then be
made with your pause sites. It might be discovered that you paused at particular sites, say at
the ends of chapters. In this case pauses would be correlated with the distribution of the
book's "primary information". Or perhaps there was a random element to your pausing ...
perhaps when your partner wanted the light out. In this case pausing would be influenced by
your pairing relationship.
A more familiar form of information is the linear form you are now decoding (reading),
which is similar to the form you might decode on the page of a book. If a turned-down flap
on a page is a large one, it might cover up some of the information. Thus, one form of
information might interfere with another form of information. To read the text you would
have to correct (fold back) the "secondary structure" of the page (the flap) so that it no longer
overlapped the text. Thus, there is a conflict. You can either retain the flap and not read the
text, or get rid of the flap and read the text.
In the case of a book page, the text is imposed on an underlying flat two dimensional base,
the paper. The text (message) and the medium are different. Similarly, in the case of our
genetic material, DNA, the "medium" is a chain of two units (phosphate and ribose) and the
most easily recognized "message" is provided by a sequence of "letters" (bases) attached, like
beads, to the chain. As in the case of a written text on paper, "flaps" in DNA (secondary
structure) can conflict with the base sequence (primary structure). Thus the pressures to
ACGATGC
TGCTACG
A
For this to happen there have to be matching (complementary) bases. Only the bases in the
loop (CGTA) are unpaired in this structure. The stem consists of paired bases. Thus
Chargaff's parity rule has to apply, to a close approximation, to single strands of DNA. When
one examines DNAs from whatever biological source, one invariably finds that the rule
applies. We refer to this as Chargaff's second parity rule.
Returning to our own written textual form of information, the sentence "Mary had a little
lamb its fleece was white as snow" contains the information that a person called Mary is in
possession of an immature sheep. The same information might be written in Chinese or
Greek. Thus, the sentence contains not only its primary information, but secondary
information about its origin -- e.g. it is likely that the author is more familiar with English
than other languages. Some believe that English is on the way to displacing other languages,
so that eventually it (or the form it evolves to) will constitute the only language used by
human beings on this planet. Similarly, in the course of early evolution it is likely that a
prototypic nucleic acid language displaced contenders.
It would be difficult to discern a relationship between the English, Chinese and Greek
versions of the above sentence, because these languages diverged from primitive root
languages thousands of years ago. However, in England, if a person with a Cockney accent
were to speak the sentence it would sound like "Miree ader liawl laimb sfloyce wors woyt
ers snaa". Cockney English and "regular" English diverged more recently and it is easy to
discern similarities.
Now look at the following text:
yewas htbts llem ws arifea ac wMhitte alidsnoe la
irsnwwis aee ar lal larfoMyce b sos woilmyt erdea
One line of text is the regular English version with the letters shuffled. The other line is the
cockney version with the letters shuffled. Can you tell which is which? If the shuffling was
thorough, the primary information has been destroyed. However, there is still some
information left. With the knowledge that cockneys tend to "drop" their Hs, it can be deduced
that the upper text is more likely to be from someone who spoke regular English. With a
longer text, this could be more precisely quantitated. Languages have characteristic letter
frequencies. You can take a segment ("window") and count the various letters in that
segment.
In this way you can identify a text as English, Cockney, Chinese or Greek, without too much
trouble. We can call this information "secondary information". There may be various other
levels of information in a sequence of symbols. To evaluate the secondary information in
DNA (with only four "letters"), you select a "window" (say 1000 bases) and counts the
number of bases in that window. You can apply the same window to another section of the
DNA, or to another DNA molecule from a different biological species, and repeat the count.
Then you can compare DNA "accents".
The best understood type of primary information in DNA is the information for proteins. The
DNA sequence of bases (one type of "letter") encodes another type of "letter", the "amino
acids". There are 20 amino acids, with names such as aspartate, glycine, phenylalanine,
serine and valine (which are abbreviated as Asp, Gly, Phe, Ser and Val). Under instructions
received from DNA, amino acids are joined together in the same order as they are encoded in
DNA, to form proteins. The latter, chains of amino acids which fold in complicated ways,
play a major role in determining how we interact with our environment. The proteins
determine our "phenotype". For example, in an organism of a particular species ("A") the
twenty one base DNA sequence:
TTTTCATTAGTTGGAGATAAA
read in sets of three bases ("codons"), conveys primary information for a seven amino acid
protein fragment (PheSerLeuValGlyAspLys). All members of the species will tend to have
the same DNA sequence, and differences between members of the species will tend to be rare
and of minor degree. If the protein is fundamental to cell function it is likely that organisms
of another species ("B") will have DNA which encodes the same protein fragment. However,
when we examine their DNA we might find major differences compared with the DNA of the
first species (the similarities are emphasized in red):
TTCAGCCTCGTGGGGGACAAG
This sequence also encodes the above protein fragment, showing that the DNA contains the
same primary information as in the first DNA sequence, but it is "spoken" with a different
"accent". This secondary information might have some biological role. It is theoretical
possible (but unlikely) that all the genes in an organism of species B would have this
"accent", yet otherwise encode the same proteins. In this case, organisms of species A and B
would be both anatomically and functionally (physiologically) identical, while differing
dramatically with respect to secondary information.
On the other hand, consider a single change in the sequence of species A to:
TTTTCATTAGTTGGAGTTAAA
Here the difference would change one of the seven amino acids. It is likely that such minor
changes in a very small number of genes affecting development would be sufficient to cause
anatomical and morphological differentiation within species A (e.g. compare a bulldog and a
poodle, as "varieties" of dogs, which are able to breed with each other). Yet, in this case the
secondary information would be hardly changed.
The view developed in these pages is that, like the Cockney's dropped H's, the role of
secondary information is to initiate, and, for a while, maintain, reproductive isolation. This
can occur because the genetic code is a "redundant" or "degenerate" code; for example, the
amino acid serine is not encoded by just one codon; there are six possible codons (TCT,
TCC, TCA, TCG, AGT, AGC). In the first of the above DNA sequences (A) the amino acid
serine (Ser) is encoded by TCA, whereas AGC is used in the second (B). On the other hand,
the change in species A from GAT (first sequence) to GTT (third sequence) changes the
encoded amino acid from aspartic acid (Asp) to valine (Val), and this should be sufficient to
change the properties of the corresponding protein, and hence change the phenotype.
Thus, the biological interest of linguistic barriers is that they also tend to be reproductive
barriers. Even if a Chinese person and an English person are living in the same territory
("sympatrically"), if they do not speak the same language they are unlikely to marry. The
Chinese tend to marry Chinese and produce more Chinese. The English tend to marry English
and produce more English. Even in England, because of the "class" barriers so colourfully
portrayed by George Bernard Shaw, Cockneys tend to marry Cockneys, and the essence of
the barrier from people speaking "regular" English is the difference in accent. Because of
other ("blending") factors at work in our society it is unlikely that this linguistic speciation
will continue to the extent that Cockney will become an independent language. However, the
point is that when there is "incipient" linguistic speciation, it is the secondary information
(dropped H's) , not the primary information, which constitutes the barrier.
Before the genetic code was deciphered in the early 1960s, researchers such as Wyatt (1952)
and Sueoka (1961) studied the base composition of DNAs with a major interest in the
primary information -- how a sequence of bases might be related to a sequence of amino
acids. However, their results have turned out to be of greater interest with respect to the
secondary information in DNA.
Sequence comparison is possibly the most useful computational tool to emerge for molecular
biologists. The World Wide Web has made it possible for a single public database of genome
sequence data to provide services through a uniform interface to a worldwide community of
users. With a commonly used computer program called fsBLAST, a molecular biologist can
compare an uncharacterized DNA sequence to the entire publicly held collection of DNA
sequences. In the next section, we present an example of how sequence comparison using the
BLAST program can help you gain insight into a real disease.
Figure 1-1. Form for submitting a BLAST search against nucleotide databases at NCBI
sequence labels is a significant find. BLAST differs from simple keyword searching in its
ability to detect partial matches along the entire length of a protein sequence.
The output shows local alignments of two high-scoring matching regions in the protein
sequences of the eyeless and aniridia genes. In each set of three lines, the query sequence (the
eyeless sequence that was submitted to the BLAST server) is on the top line, and the aniridia
sequence is on the bottom line. The middle line shows where the two sequences match. If
there is a letter on the middle line, the sequences match exactly at that position. If there is a
plus sign on the middle line, the two sequences are different at that position, but there is some
chemical similarity between the amino acids (e.g., D and E, aspartic and glutamic acid). If
there is nothing on the middle line, the two sequences don't match at that position.
In this example, you can see that, if you submit the whole eyeless gene sequence and look (as
standard keyword searches do) for an exact match, you won't find anything. The local
sequence regions make up only part of the complete proteins: the region from 24-169 in
eyeless matches the region from 17-161 in the human aniridia gene, and the region from 398477 in eyeless matches the region from 222-301 in aniridia. The rest of the sequence doesn't
match! Even the two regions shown, which match closely, don't match 100%, as they would
have to, in order to be found in a keyword search.
However, this partial match is significant. It tells us that the human aniridia gene, which we
don't know much about, is substantially related in sequence to the fruit fly's eyeless gene. And
we do know a lot about the eyeless gene, from its structure and function (it's a DNA binding
protein that promotes the activity of other genes) to its effects on the phenotype--the form of
the grown fruit fly.
BLAST finds local regions that match even in pairs of sequences that aren't exactly the same
overall. It extends matches beyond a single-character difference in the sequence, and it keeps
trying to extend them in all directions until the overall score of the sequence match gets too
small. As a result, BLAST can detect patterns that are imperfectly replicated from sequence
to sequence, and hence distant relationships that are inexact but still biologically meaningful.
Depending on the quality of the match between two labels, you can transfer the information
attached to one label to the other. A high-quality sequence match between two full-length
sequences may suggest the hypothesis that their functions are similar, although it's important
to remember that the identification is only tentative until it's been experimentally verified. In
the case of the eyeless and aniridia genes, scientists hope that studying the role of the eyeless
gene in Drosophila eye development will help us understand how aniridia works in human
eye development.
text might encompass several volumes of data, in the form of painstaking illustrations and
descriptions of each species encountered. Biologists were faced with the problem of how to
organize, access, and sensibly add to this information. It was apparent to the casual observer
that some living things were more closely related than others. A rat and a mouse were clearly
more similar to each other than a mouse and a dog. But how would a biologist know that a rat
was like a mouse (but that rat was not just another name for mouse) without carrying around
his several volumes of drawings? A nomenclature that uniquely identified each living thing
and summed up its presumed relationship with other living things, all in a few words, needed
to be invented.
The solution was relatively simple, but at the time, a great innovation. Species were to be
named with a series of one-word names of increasing specificity. First a very general division
was specified: animal or plant? This was the kingdom to which the organism belonged. Then,
with increasing specificity, came the names for class, genera, and species. This schematic
way of classifying species, as illustrated in Figure 1-3, is now known as the "Tree of Life."
Figure 1-3. The "Tree of Life" represents the nomenclature system that classifies
species
A modern taxonomy of the earth's millions of species is too complicated for even the most
zealous biologist to memorize, and fortunately computers now provide a way to maintain and
access the taxonomy of species. The University of Arizona's Tree of Life project and NCBI's
Taxonomy database are two examples of online taxonomy projects.
Taxonomy was the first informatics problem in biology. Now, biologists have reached a
similar point of information overload by collecting and cataloguing information about
individual genes. The problem of organizing this information and sharing knowledge with the
scientific community at the gene level isn't being tackled by developing a nomenclature. It's
being attacked directly with computers and databases from the start.
The evolution of computers over the last half-century has fortuitously paralleled the
developments in the physical sciences that allow us to see biological systems in increasingly
fine detail. Figure 1-4 illustrates the astonishing rate at which biological knowledge has
expanded in the last 20 years.
Figure 1-4. The growth of GenBank and the Protein Data Bank has been astronomical
Simply finding the right needles in the haystack of information that is now available can be a
research problem in itself. Even in the late 1980s, finding a match in a sequence database was
worth a five-page publication. Now this procedure is routine, but there are many other
questions that follow on our ability to search sequence and structure databases. These
questions are the impetus for the field of bioinformatics.
organism based only on its genome sequence. This is a grand goal, and one that will be
approached only in small steps, by many scientists working together.
You should have substantial experience with at least one or two major
Recently, the techniques of x-ray crystallography have been refined to a degree that allows a
complete set of crystallographic reflections for a protein to be obtained in minutes instead of
hours or days. Automated analysis software allows structure determination to be completed in
days or weeks, rather than in months. It has suddenly become possible to conceive of the
same type of high-throughput approach to structure determination that the Human Genome
Project takes to sequence determination. While crystallization of proteins is still the limiting
step, it's likely that the number of protein structures available for study will increase by an
order of magnitude within the next 5 to 10 years.
Parallel computing is a concept that has been around for a long time. Break a problem down
into computationally tractable components, and instead of solving them one at a time, employ
multiple processors to solve each subproblem simultaneously. The parallel approach is now
making its way into experimental molecular biology with technologies such as the DNA
microarray. Microarray technology allows researchers to conduct thousands of gene
expression experiments simultaneously on a tiny chip. Miniaturized parallel experiments
absolutely require computer support for data collection and analysis. They also require the
electronic publication of data, because information in large datasets that may be tangential to
the purpose of the data collector can be extremely interesting to someone else. Finding
information by searching such databases can save scientists literally years of work at the lab
bench.
The output of all these high-throughput experimental efforts can be shared only because of
the development of the World Wide Web and the advances in communication and
information transfer that the Web has made possible.
The increasing automation of experimental molecular biology and the application of
information technology in the biological sciences have lead to a fundamental change in the
way biological research is done. In addition to anecdotal research--locating and studying in
detail a single gene at a time--we are now cataloguing all the data that is available, making
complete maps to which we can later return and mark the points of interest. This is happening
in the domains of sequence and structure, and has begun to be the approach to other types of
data as well. The trend is toward storage of raw biological data of all types in public
databases, with open access by the research community. Instead of doing preliminary
research in the lab, scientists are going to the databases first to save time and resources.
of the operating system increases. However, it can't (yet) go head-to-head with the consumer
operating systems in these areas. Linux is no more difficult to maintain than a normal PC
operating system, once you know how, but the skills needed and the problems you'll
encounter will be new at first.
The protein-folding problem one of the major challenges of molecular biology in the
1990s could be thought of as a version of cryptography. Scientists like Peter Kollman are
the code breakers, trying to uncover a set of rules somehow embedded in a protein's sequence
of amino acids. This chemical alphabet of 20 characters, strung like beads on a chain along
the peptide backbone of a newborn protein, carries a blueprint that specifies the protein's
mature folded shape.
Within a matter of seconds or less after rolling off the protein assembly line (in the
cellular ribosome), the stretched-out chain wraps into a bundle, with twists and turns, helices,
sheets and other 3D features. For proteins, function follows from form the grooves and
crevices of its complex folds are what allow it to latch onto other molecules and carry out its
biological role.
But what are the rules? How is it that a particular sequence of amino acids results in a
particular folded shape? "The protein-folding problem is still the single most exciting
problem in computational biochemistry," says Kollman, professor of pharmaceutical
chemistry at the University of California, San Francisco, and a leader in using a research tool
called "molecular dynamics," a method of computational simulation that tracks the minute
shifts of a protein's structure over time. "To be able to predict the structure of the protein
from just the amino-acid sequence would have tremendous impact in all of biotechnology and
drug design."
In 1997, Kollman and his coworkers Yong Duan and Lu Wang used the CRAY T3D at
Pittsburgh Supercomputing Center to develop molecular dynamics software that exploits
parallel systems like the T3D and CRAY T3E much more effectively than before. Using this
improved software on the T3D and, later, on a T3E at Cray Research, the researchers tracked
the folding of a small protein in water for a full microsecond, 100 times longer in time than
previous simulations. The result is a more complete view of how one protein folds in
effect, a look at what hasn't been seen before, and it offers precious new insight into the
folding process.
discussions with PSC scientist Michael Crowley as also being instrumental to his work on
this project.
Radius of Gyration over Time - The Quiet Time for Protein Folding
What did the researchers learn from viewing a full simulated microsecond of protein folding?
A burst of folding in the first 20 nanoseconds quickly collapses the unfolded structure,
suggesting that initiation of folding for a small protein can occur within the first 100
nanoseconds. Over the first 200 nanoseconds, the protein moves back and forth between
compact states and more unfolded forms. The researchers capture this behavior by plotting
the protein's radius of gyration how much the structure spreads out from its center as a
function of time. "If you look at those curves," notes Kollman, "they're very noisy the
structure is moving, wiggling and jiggling a lot."
The folded structures, often called "molten globules," have 3D features, such as partially
formed helices loosely packed together, that bear resemblance to the final folded form. They
are only marginally stable, notes Kollman, and unfold again before settling into other folded
structures.
The next 800 nanoseconds reveal an intriguing "quiet period" in the folding. From about 250
nanoseconds until 400 nanoseconds the fluctuating movement back and forth between
globules and unfolding virtually ceases. "For this period in the later part of the trajectory,"
says Kollman, "everything becomes quiet. And that's where the structure gets closest to the
native state. It's quite happy there for awhile, then it eventually drifts off again for the rest of
the period out to a microsecond."
For Kollman, this behavior suggests that folding may be characterized as a searching process.
"It's a tantalizing idea that the mechanism of protein folding is to bounce around until it
finds something close, and stay there for a period, but if it isn't good enough it eventually
leaves and keeps searching. It might have 10 of these quiet periods before it arrives at a
period where enough of the amino-acid sidechains are in a good enough environment that it
locks into the final structure."
Although only a partial glimpse even the fastest proteins need 10 to 100 microseconds to
fully fold these results represent a major step forward in protein-folding simulation. New
experimental methods are also providing more detailed looks at the process, offering the
possibility of direct comparison between experiment and simulation, which will further
advance understanding.
The challenge of the protein-folding problem is to have the ability to predict protein structure
more accurately. For the pharmaceutical industry, this holds the prospects of greatly reducing
the cost and expense of developing new therapeutic drugs. Recent research, furthermore,
suggests that certain diseases, such as "mad cow disease" and possibly Alzheimer's can be
understood as malfunctions in protein folding.
Ultimately, for Kollman and others using molecular dynamics, the goal is to follow the entire
folding process. With the promise of more powerful computing and higher level parallelism,
Kollman sees that goal as within reach. "We're getting some new insights that weren't
available before because the simulations weren't in the right time-scale. Being able to
visualize the folding process of even a small protein in a realistic environment has been a
goal of many researchers. We believe our work marks the beginning of a new era of the
active participation of full-scale simulations in helping to understand the mechanism of
protein folding."
Long years ago we had made a tryst with destiny and now the time comes
when we shall redeem our pledgenot wholly or in full measure, but very
substantially.
Jawaharlal Nehru
DEPARTMENT OF BIOTECHNOLOGY
MINISTRY OF SCIENCE & TECHNOLOGY
GOVERNMENT OF INDIA
1. OVERVIEW
2. PROGRAMME OBJECTIVES
3. CRITICAL EVALUATION OF THE PROGRAMME
3.1 Achievements
3.2 Pitfalls
4. BENEFICIARIES
5. STRATEGIES FOR THE FUTURE
5.1 Generate a National Level Resource Strength in Bioinformatics
5.1.1 Improvement of Bioinformatics Capabilities of India
5.1.2 Resource Integration
5.2 Promote R&D Strengths
5.2.1
5.2.2 Provision for Extramural Support in Bioinformatics Research Projects
5.2.3 Generation of a Bioinformatics Research Support System
5.2.4 Adequate Cover to Intellectual Property
5.3 Promote Entrepreneurial Development
5.3.1 Industry Participation in Building Academia-Industry Interfaces
5.3.2 Exploration of Marketing Channels for Bioinformatics Products &
Services
5.4 Globalise the National Bioinformatics Initiatives
5.4.1 International Collaboration in Resource Sharing
5.4.2 International Participation in Bioinformatics Capital Formation
5.5 Encourage Development of Quality Human Resources
5.5.1 Short term Training
5.5.2 Long-term Training
5.5.3 Continued Education in Bioinformatics
5.6 Restructure the BTIS Organisation for Optimised Performance
5.6.1 Enhanced Autonomy of the Apex Centre
5.6.2 Compilation of Work Areas for the Programme
5.6.3 Generation of Work Groups
5.6.4 Regional Decentralisation of the Network
5.6.5 Restructuring the BTIS into a Product Organisation
6. CONCLUSION
1. OVERVIEW
Growth of biotechnology has accelerated particularly during the last decade due to
accumulation of vast sequence and structure information as a result of sequencing of
genomes and solving of crystal structures. This, coupled with advances in information
technology, has made biotechnology increasingly dependent on computationally intensive
approaches. This has led to the emergence of a super- speciality discipline, called
bioinformatics.
Bioinformatics has become a frontline applied science and is of vital importance to
the study of new biology, which is widely recognised as the defining scientific endeavour of
the twenty-first century. The genomic revolution has underscored the central role of
bioinformatics in understanding the very basics of life processes.
Indias predominantly agrarian economy, the vast biodiversity and ethnically diverse
population makes biotechnology a crucial determinant in achieving national development. As
Indias population crossed one billion figure, the country is faced with newer challenges of
conservation of biodiversity to ensure food security, healthcare, tackling bio-piracy and
safe guarding IPR of Plant Genetic Resources (PGR) and associated knowledge systems,
environment protection and education. The liberalisation and globalisation of the economy
pose further challenge to society and the government to modernise and respond to the
increasingly competitive international environment. As rapid technological advancements
and innovation impact the basic activities like agriculture, industry, environment and
services, the country has to evolve programmes that would aid in economic development
driven by science and technology. It is therefore of utmost importance that India participates
in and contributes to the ensuing global bioinformatics revolution.
In recognition of its importance, the Department of Biotechnology, Government of
India has identified bioinformatics as an area of high priority during the tenth plan period in
order to ensure that this sector attains levels demanded in the international arena. This can be
achieved through organisational and functional restructuring; integration and optimal
utilisation of the available resources; planned expansion based on actual market demand;
increasing autonomy of the system; transfer of technology from laboratory to the industry;
sustainable development of human resources; and finally, enhancing accountability of the
participating institutions.
Beginning early last decade, India has endeavoured to create an infrastructure that
would enable it to harness biotechnology through the application of bioinformatics. The
2. PROGRAMME OBJECTIVES
The principal aim of the bioinformatics programme was to ensure that India emerged
a key international player in the field of bioinformatics; enabling a greater access to
information wealth created during the post-genomic era and catalysing the countrys
attainment of lead position in medical, agricultural, animal and environmental biotechnology.
India should make a niche in Bioinformatics industry and would work to create
bioinformatics industry with US$10 billion by the end of 10th Plan period. It was felt that
these could be achieved through a focussed approach in terms of information acquisition,
storage, retrieval and distribution.
The Department of Biotechnology had adopted the following strategies to achieve
these objectives:
o Develop the programme as an array of distributed resource repositories in areas of
specialisation pertinent to the needs of Indias economic development.
o Coordination of the network through an Apex Secretariat
Training: - There has been a significant success in training scientists and researchers
in diverse application of bioinformatics and computational biology.
3.2 Pitfalls
individual specialisation and thrust of the centres. This is partly due to absence of a
clearly defined framework for operation and partly due to the non-homogenous
pattern of setting up of the centres. There is need for a more central theme around
which the entire network should function.
4. BENEFICIARIES
o Agriculture, Health and Environment Sector
o Industry
o National resources: Capacity building in conservation and sustainable utilization
(bio prospecting) of biodiversity including protection of IPR and prevention of
biopiracy.
o National resources in higher education
POLICY RECOMMENDATIONS
This policy formulation suggests the following programmes for fulfilling the abovementioned objectives.
5.1 Generate National Level Resource Strength in Bioinformatics:
5.1.1 Improvement of Bioinformatics Capabilities of India
Towards development of strong bioinformatics and bio-computational capabilities the
programme should focus heavily upon the following activities:
o Modernisation of infrastructure: Internet bandwidth for resource sharing:
All the centres of the network should have access to the Internet and should
have sufficient bandwidth as demanded by the applications. The Internet
connectivity might be in the form of leased lines so as to optimise cost
constraints.
o Uninterrupted Network Access: The computer infrastructure of the centre
should preferably be arranged into a network so as to optimise usage and
internal sharing of information and resources.
o Popularise Use/ Access to Public Domain Utilities: Presently, a large number
of public domain utilities are availables for bioinformatics applications. The
centres are to be encouraged to use these utilities rather than acquiring costly
packages for this purpose. The apex body should ensure that the centre has the
necessary system configuration to use the concerned packages.
o Enhanced
Solutions/
tools
Development
Capabilities:
Successful
o The above resource sharing notwithstanding, the Apex Secretariat of the BTIS should
take adequate measures to ensure safeguard of strategic components of the
information such as biodiversity data, software and so on in order to prevent usage
that would be detrimental to the national interest.
5.2 Promote R& D Strengths
5.2.1 Provision of Extramural Support for Bioinformatics Research Projects
Completion of the genome projects and progress in structure elucidation has opened a
new vista for downstream research in bioinformatics. These studies range from modelling of
cellular function, metabolic pathways, validation of drug targets to understanding gene
function in health and disease. Currently, a number of scientists throughout the country
both within the BTIS programme and outside are involved in in-silico studies of biological
processes. The BTIS programme should encourage more such studies. Extramural support to
bioinformatics research projects should include:
o Funding of bioinformatics research projects
o Provide financial support for resource creation and maintenance
o Enable access to bioinformatics utilities required for research
o Provide financial support for travel/training of scientists
5.2.2 Generation of a Bioinformatics Research Support System
Downstream research in bioinformatics relies heavily on the availability of curetted
secondary and tertiary databases and knowledge resources. In this regard, the BTIS
programme should undertake the process of acquiring and mining useful information from
primary databases and compiling secondary, tertiary and quaternary databases covering
specialised areas of biology. Such databases and knowledge repositories, would serve as
research support systems for augmenting the bioinformatics research activities.
5.2.3 Adequate Cover to Intellectual Property
In its attempts to promote R&D activities, the BTIS programme should take adequate
measures for protection of intellectual property generated out of the projects.
The Industry and academia cooperation can be of two ways- (i) The
academic institutions can outsource the expenditure for the finishing and
packaging of any databases & software development for which the academic
institutions should retain the copy right/ patent, and (ii) The collaboration can be
for the entire project in which the academic institutions and the industry shall
share the copy right/patent.
world. Apart from this, the BTIS centres should maintain a curetted archive of the
major genome information for facilitating development of secondary/tertiary
databases and downstream research. The entire genome repository generated in the
process, should be integrated into a single-window platform through a National
Genome Information Network.
tests some fellowships may be awarded for pursuing higher studies such as
M.Tech., Ph.D. in bioinformatics.
5.5 Encourage Human Resource Development, Education, & Awareness
The dearth of adequate trained manpower in bioinformatics is a major global problem.
The case is no different for India. The programme should lay adequate emphasis on
manpower development to address this problem. These should be long-term, short-term and
continued training programmes. More M.Tech. & Ph.D. courses shall be introduced
at various institutions.
The Apex Centre should compile a list of work areas for general guidance of the centres,
keeping in view the national requirements, expertise available, funds available and basic
objectives of the Government of India. These would include:
o Identification of high priority research areas that need to be addressed.
o Identification of core interest areas of the Bioinformatics programme.
o Identification of high priority databases and software that need to be developed.
o Standardisation of course curriculum for bioinformatics education.
o Monitoring, analysing and publishing the market demand of different
bioinformatics applications (tools) and utilities
The Apex Centre, based on the reports of the individual centres, should evolve small
functional work groups and foster closer interaction within these individual groups. Each of
the work groups should be under the supervision of Group Coordinators, who would oversee
the overall functioning of the group. On the basis of the current trends of the network,
suggested work groups include:
o Medical Science
o Commercial Biotechnology & Intellectual Property Management
o Computational Biology & Algorithms
o Biodiversity & Environment
o Plant Science, Agriculture & Veterinary Science
o Molecular Biology, Cell Biology & Structural Biology
6. CONCLUSIONS
The chief feature of this policy proposal is to facilitate a paradigm shift for the
DBTs/ bioinformatics programme from infrastructure generation to resource building at
national, regional and international level. In this regard, the functional protocol might be
redefined as follows:
o To focus on resource building in bioinformatics using the infrastructure already
generated
Biocomputing in a Nutshell
A Short View onto the Development of Biology
The success of modern molecular biology might be considered a cartesian dream.
Reductionism, Rene Descartes' belief of understanding complex phenomena by reducing
them to their constituent parts - despite all its limitations - has turned out to be a home run in
molecular biology.
The developments in modern biology have their roots in the interdisciplinary work of
scientists from many fields. This was a crucial element in the breaking of the code of life;
Max Delbrck, Francis Crick and Maurice Wilkins all had backgrounds in physics. In fact, it
was the physicist Erwin Schrdinger (ever heard about Schrdinger's cat ?), who in "What is
life" was the first to suggest that the "gene" could be viewed as an information carrier whose
physical structure corresponds to a succession of elements in a hereditary code script. This
later turned out to be the DNA, one of the two types of molecules "on which life is built".
Number of sequences
(1996)
EMBL / GENBANK
Nucleotide sequences
827174
SWISSPROT
Protein sequences
52205
PDB
Protein structures
4525
The growth of one typical data bank is shown in below, the increasing number of sequences
in the SWISSPROT data bank as time goes by.
Introduction to Bioinformatics
What is Bioinformatics?
In the last few decades, advances in molecular biology and the equipment available for
research in this field have allowed the increasingly rapid sequencing of large portions of the
genomes of several species. In fact, to date, several bacterial genomes, as well as those of
some simple eukaryotes (e.g., Saccharomyces cerevisiae, or baker's yeast) and more complex
eukaryotes (C. elegans and Drosophila) have been sequenced in full. The Human Genome
Project, designed to sequence all 24 of the human chromosomes, is also progressing and a
rough draft was completed in the spring of 2000.
Popular sequence databases, such as GenBank and EMBL, have been growing at exponential
rates. This deluge of information has necessitated the careful storage, organization and
indexing of sequence information. Information science has been applied to biology to
produce the field called bioinformatics.
LANDMARK SEQUENCES COMPLETED
includes databases of the sequences and structural information as well methods to access,
search, visualize and retrieve the information.
Sequence data can be used to make predictions of the functions of newly identified
genes,estimate evolutionary distance in phylogeny reconstruction, determine the active sites
of enzymes, construct novel mutations and characterize alleles of genetic diseases to name
just a few uses. Sequence data facilitates:
Analysis of the organization of genes and genomes and their evolution
Protein sequence can be predicted from DNA sequence which further facilitates
possible prediction of protein properties, structure, and function (proteins rarely
sequenced in entirety today)
Identification of regulatory elements in genes or RNAs
Identification of mutations that lead to disease, etc.
Bioinformatics is the field of science in which biology, computer science, and information
technology merge into a single discipline. The ultimate goal of the field is to enable the
discovery of new biological insights as well as to create a global perspective from which
unifying principles in biology can be discerned.
There are three important sub-disciplines within bioinformatics involving computational
biology:
the development of new algorithms and statistics with which to assess relationships
among members of large data sets;
the analysis and interpretation of various types of data including nucleotide and amino
acid sequences, protein domains, and protein structures; and
the development and implementation of tools that enable efficient access and
management of different types of information.
One of the simpler tasks used in bioinformatics concern the creation and maintenance of
databases of biological information. Nucleic acid sequences (and the protein sequences
derived from them) comprise the majority of such databases. While the storage and or
ganization of millions of nucleotides is far from trivial, designing a database and developing
an interface whereby researchers can both access existing information and submit new entries
is only the beginning.
The most pressing tasks in bioinformatics involve the analysis of sequence information.
Computational Biology is the name given to this process, and it involves the following:
Data-mining is the process by which testable hypotheses are generated regarding the function
or structure of a gene or protein of interest by identifying similar sequences in better
characterized organisms. For example, new insight into the molecular basis of a disease may
come from investigating the function of homologs of the disease gene in model organisms.
Equally exciting is the potential for uncovering phylogenetic relationships and evolutionary
patterns.The process of evolution has produced DNA sequences that encode proteins with
very specific functions. It is possible to predict the three-dimensional structure of a protein
using algorithms that have been derived from our knowledge of physics, chemistry and most
importantly, from the analysis of other proteins with similar amino acid sequences.
polymerase chain reactions to amplify existing DNA sequences or to modify these sequences.
With these techniques in place, progress in biological research increased exponentially.
For researchers to benefit from all this information, however, two additional things were
required: 1) ready access to the collected pool of sequence information and 2) a way to
extract from this pool only those sequences of interest to a given researcher. Simply
collecting, by hand, all necessary sequence information of interest to a given project from
published journal articles quickly became a formidable task. After collection, the
organization and analysis of this data still remained. It could take weeks to months for a
researcher to search sequences by hand in order to find related genes or proteins.
Computer technology has provided the obvious solution to this problem. Not only can
computers be used to store and organize sequence information into databases, but they can
also be used to analyze sequence data rapidly. The evolution of computing power and storage
capacity has, so far, been able to outpace the increase in sequence information being created.
Theoretical scientists have derived new and sophisticated algorithms which allow sequences
to be readily compared using probability theories. These comparisons become the basis for
determining gene function, developing phylogenetic relationships and simulating protein
models. The physical linking of a vast array of computers in the 1970's provided a few
biologists with ready access to the expanding pool of sequence information. This web of
connections, now known as the Internet, has evolved and expanded so that nearly everyone
has access to this information and the tools necessary to analyze it.
Databases of protein and nucleic acid sequences
In the US, the repository of this information is The National Center for Biotechnology
Information (NCBI)
The database at the NCBI is a collated and interlinked dataset known as the Entrez
Databases
o Description of the Entrez Databases
o Examples of a selected database files
Protein
Most protein sequence is derived from conceptual translation
Chromosome with genes and predicted proteins (Accession #D50617,
Yeast Chromosome VI: Entrez)
Genome (C. elegans)
Protein Structure (TPI database file or Chime structure)
Expressed sequence tags (Ests)(Summary of current data)
o Neighboring
transcribed away from their promoters, the definitive location of this element can reduce the
number of possible frames to three. There is not a strong concensus between different species
surrounding translation start codons. Therefore, location of the appropriate start codon will
include a frame in which they are not apparent abrupt stop codons. Knowledge of a proteins
predicted molecular mass can assist this analysis. Incorrect reading frames usually predict
relatively short peptide sequences. Therefore, it might seem deceptively simple to ascertain
the correct frame. In bacteria, such is frequently the case. However, eukaryotes add a new
obstacle to this process: INTRONS!
Prediction and Identification of the Exon/Intron Splice Sites. In eukaryotes, the reading
frame is discontinuous at the level of the DNA because of the presence of introns. Unless one
is working with a cDNA sequence in analysis, these introns must be spliced out and the exons
joined to give the sequence that actually codes for the protein. Intron/exon splice sites can be
predicted on the basis of their common features. Most introns begin with the nucleotides GT
and end with the nucleotides AG. There is a branch sequence near the downstream end of
each intron involved in the splicing event. There is a moderate concensus around this branch
site.
Prediction of Protein 3-D Structure. With the completed primary amino acid sequence in
hand, the challenge of modelling the three-dimensional structure of the protein awaits. This
process uses a wide range of data and CPU-intensive computer analysis. Most often, one is
only able to obtain a rough model of the protein, and several conformations of the protein
may exist that are equally probable. The best analyses will utilize data from all the following
sources.
Pattern Comparison
X-ray Diffraction
Data
Physical
Forces/Energy States
All of this information is used to determine the most probable locations of the atoms of the
protein in space and bond angles. Graphical programs can then use this data to depict a threedimensional model of the protein on the two-dimensional computer screen.
organization and analysis of this data still remained. It could take weeks to months for a
researcher to search sequences by hand in order to find related genes or proteins.
Computer technology has provided the obvious solution to this problem. Not only can
computers be used to store and organize sequence information into databases, but they can
also be used to analyze sequence data rapidly. The evolution of computing power and storage
capacity has, so far, been able to outpace the increase in sequence information being created.
Theoretical scientists have derived new and sophisticated algorithms which allow sequences
to be readily compared using probability theories. These comparisons become the basis for
determining gene function, developing phylogenetic relationships and simulating protein
models. The physical linking of a vast array of computers in the 1970's provided a few
biologists with ready access to the expanding pool of sequence information. This web of
connections, now known as the Internet, has evolved and expanded so that nearly everyone
has access to this information and the tools necessary to analyze it.
Bioinformatic Databases include (but not limited to):
Literature
o PubMed (Medline)
Nucleic Acid Sequences
o Genomes
o Chromosomes
o Genes and mRNAs (cDNA)
o Est, GSS, SNP, etc.
Est - expressed sequence tag (sequence of an expressed mRNA)
SNP - single nucleotide polymorphism
o RNA
Proteins
Structure
o Proteins (and their interactions)
o Nucleic Acids
o Protein-Nucleic acid interactions (protein and DNA or protein and RNA)
Secondary Databases
o Protein
Protein Families
Motifs, consensus sequencs, profiles, etc
Protein domains and folds
Structural families
RefSeq
Clusters of Orthologous Groups (COGs)
o Nucleic Acid
Unigene
RefSeq
Expression profiles of mRNAs and proteins
o Networks of expression patterns
Interaction Networks (proteins)
Accessible
Searchable
Retrievable
Analyzed using a variety of programs
Integrated or linked with other database information
Historically individual databases were constructed for protein or nucleic acid sequences.
In the case of DNA sequences, the main repositories varied in different parts of the
world:
o Within the US in GenBank
o Japan in DDBJ (DNA Database of Japan),
o Europe in the EMBL database (European Molecular Biology Laboratory)
A variety of protein databases were also developed and collated information.
For years since the information in each database was not exactly the same (some
sequences might be present in one database but not in another) it required concerted
effort to carry out analyses on each separate database to ensure that one examined all
the data.
More recently the databases have shared and integrated information from a variety of
databases to produce complex, comprehensive sites with access to information in most
databases from a single site, and in some cases an integrated database.
Three of the main integrated databases in use today include
A great strength of these databases is that they not only allow you to search them but they
provide links and handy pointers to additional information in related databases. The three
systems differ in their databases and the links they make to other information.
Entrez at the NCBI is a collated and interlinked molecular biology database and retrieval
system which serves as an entry point for exploring distinct but integrated databases. Entrez
provides access to for example nucleotide and protein sequence databases, a molecular
modeling 3-D structure database (MMDB), a genomes and maps database, disease
information, and the literature. Of the three, Entrez is the simplest to use, although it offers
more limited information to search.
The The Sequence Retrieval System is available from the European Molecular Biology
Organization (EMBO). This integrated database offers greater flexibility in several areas with
a homogeneous interface to over 80 biological databases developed at the European
Bioinformatics Institute (EBI) at Hinxton. The types of databases included for example are
sequence and sequence related, metabolic pathways, transcription factors, applications,
protein 3-D structures, genome, mapping, mutations and locus-specific mutations.
What makes Entrez powerful is that most of its records are linked to other records,
both within a given database (such as PubMed) and between databases.
o Links within the entrez database are called "neighbors".
Protein and Nucleotide neighbors are determined by performing similarity searches
using the algorithm BLAST on the amino acid or DNA sequence in the entry and the
results saved as above.
o What this means is that if you find one or a few documents that match what
you are looking for, pressing the "Related Articles/Sequences" button will find
a great many more documents that are likely to be relevant, in order from most
useful to least.
o This allows you to find what you want with much greater speed and accuracy:
instead of having to flip through thousands of documents to assure yourself
that nothing germane to your query was missed, you can find just a few, then
look at their neighbors.
In addition, some documents are linked to others for reasons other than computed
similarity.
o For instance, if a protein sequence was published in a PubMed article, the two
will be linked to one another.
Example
o Author = Davis RE
o Organism = Schistosoma
o Combined = Davis RE AND Schistosoma
Use of related links options within the
Other examples for literature and sequences related to
o a specific protein (telomerase)
o an RNA (snoRNA or SL RNA)
o disease (oroticaciduria)
o obesity
o the molecular cloning of the human cystic fibrosis gene
Use MEDLINE with Keyword searching.
Use neighbor feature to find related articles.
Switch to Nucleotide database to see sequence.
Save a copy of sequence to local disk.
Use MESH terms to find similar articles.
Search the Nucleotide database by gene name.
Iterative searching: one primary search topic on which you can build constraints
upon the initial search.
Phrase Searching (forcing PubMed to search for a phrase)
PubMed consults a phrase index and groups terms into logical phrases. For
example, if you enter poison ivy, PubMed recognizes these two words as a
phrase and searches it as one search term.
o However, it is possible that PubMed may fail to find a phrase that is essential
to a search.
o For example, if you enter, single cell, it is not in the Phrase List and PubMed
searches for "single" and "cell" separately. To force PubMed to search for a
specific phrase enter double quotes (" ") around the phrase, e.g., "single cell".
Complex Boolean searching: Conjunction of searches or terms using Operators
A search can be performed all at once by specifying the terms to search, their fields,
and the boolean operations to perform on them.
o This is the default (Basic) mode for PubMed
Boolean Syntax
o search term [tag] BOOLEAN OPERATOR search term [tag]
term [field] OPERATOR term [field] ...etc
o OPERATOR must be upper case
AND
OR
NOT
o Default is AND between words
o Use * for truncation
o [field] identifies the specific axis through which the search will be limited
Examples of Boolean Search Statements:
o Find citations on DNA that were authored by Dr. Crick in 1993.
dna [mh] AND crick [au] AND 1993 [dp]
o Find articles that deal with the effects of heat or humidity on multiple
sclerosis, where these words appear in all fields in the citation.
(heat OR humidity) AND multiple sclerosis
o Find English language review articles that discuss the treatment of asthma in
preschool children.
asthma/therapy [mh] AND review [pt] AND child, preschool [mh]
AND english [la]
o
Genomic Analysis
The introduction of molecular markers (segments of DNA whose pattern of inheritance can
be determined), the polymerase chain reaction (PCR) to amplify genomic DNA fragments,
and high throughput DNA sequencers which determine nucleotide sequences of amplified
DNA by gel or capillary electrophoresis methods, are all examples of advances in genomic
analysis. These advances have led to the sequencing of the genomes of several "model
organisms" including human, bacteria (Escherichia coli), yeast (Saccharomyces cerevisiae),
and plants such as Aradopsis and rice (Oryza sativa).
Barley researchers and breeders were quick to make use of the molecular markers for barley
fingerprinting based on restriction fragment length polymorphisms (RFLPs) and PCRgenerated random amplified polymorphic DNAs (RAPDs), sequenced tagged sites (STSs),
microsatellites and amplified fragment length polymorphisms (AFLPs). The North American
Barley Genome Mapping Project (NABGMP) and other initiatives made great strides in
locating quantitative trait loci (QTL) - specific genomic regions that affect quantitative traits
(Han et al., 1997; Mather et al., 1997) - and in exploiting molecular markers in breeding for
specific traits (Swanston et al., 1999) or for barley malt fingerprinting (Faccioli et al., 1999).
known locations to create microarrays, or gene chips. The same reverse transcription process
that produces cDNAs for microarrays is used to prepare cDNA fragments from cell
populations. These cDNAs can be labelled with different fluorescent tags and allowed to
hybridize with the cDNA on the chip so that differences in mRNA expression between the
cell populations can be examined. Genome-wide analyses of gene expression patterns are
thus enabled, and can be used to study, for example, abiotic stress responses in cereal crops,
including wheat and barley. Jones (personal communication) has used the commercially
available Rice Chip (Affymetrix Inc., Santa Clara, CA, USA) to investigate signalling
pathways and metabolic regulation in barley aleurone layers, and noted an 80% hybridization
of barley cDNA to the rice oligochip.
Proteomic Analysis
The need to go beyond nucleic acid analysis, and reach an understanding of total protein
expression and regulation in biological systems is motivating the field of Proteomics (Dutt
and Lee, 2000). The field of proteome analysis has largely been driven by technological
developments which have improved upon the basic strategy of separating proteins by twodimensional gel electrophoresis (OFarrell, 1975). As with genomic information or data
derived from microarray analysis, an informatic framework is required to organize proteomic
data, as well as to generate structural and predictive models of the molecules and interactions
being studied.
The use of 2-D gel elctrophoresis coupled with mass spectrometry (MS) is the most familiar
proteomics approach, with advances in the MS ionization techniques and mass analyzers
enabling the generation of different types of structural information about proteins of interest.
The techniques of nuclear magnetic resonance spectroscopy (NMR) and X-ray
crystallography rank alongside mass spectrometry as the major tools of proteomics. The
development of protein chips (analogous to microarrays) and protein-protein interaction maps
will offer new ways to rapidly characterize expression levels and relationships among
proteins in normal or perturbed cell populations, helping researchers to bridge the
genotype-phenotype gap.
Perspectives
A combination of functional genomics and proteomics approaches allows researchers to cross
interdisciplinary boundaries in "trawling" the genome, transcriptome, proteome and
metabolome for new insights into structure and function that can be applied to improving the
agronomic and end use traits of malting barley. As Fincher (2001) has pointed out, functional
genomics and related technologies are now being used to re-visit the difficult questions of
cell wall polysaccharide biosynthesis, involving synthase enzymes that had proved
impossible to purify and characterize by classical biochemical methods. In addition to the
resolution of their structure, the genes encoding for the more extensively studied hydrolase
enzymes, responsible for cell wall breakdown and starch degradation, are being identified,
generating knowledge of how characteristics such as thermostability might be enhanced to
improve malting and brewing performance (Kristensen et al., 1999; Ziegler, 1999) Structural
and functional studies can thus be linked back to protein and mRNA expression patterns, and
ultimately to the families of genes that we might wish to conserve or alter in breeding
programs.
Ultimately, the extent to which advances in functional genomics and proteomics will be
embraced and adopted by the malting and brewing industry depends heavily on the industrys
ability to "keep up" with the rapidly moving field, and on the economics (perhaps the most
important of the "-omics") of doing so. Brewing Science has a long and distinguished history,
closely associated with the need of the industry to profitably apply science in pursuit of
product quality and consistency. To this end, generations of chemists, biochemists,
microbiologists, botanists and plant breeders have applied their skills to elucidating and
manipulating the structural and functional characteristics of barley, yeast and hops. In the
case of barley, one of the results of the close linkage between science and industry has been a
relatively strong base of research and breeding activities - certainly more than would be
expected from barleys position among the worlds major crops. It is perhaps ironic that
today, when the emerging technologies of genomics and proteomics are allowing us to revisit
and build upon the knowledge generated in the past, industrys engagement in the research
process has become less active. The danger in moving from a position of active engagement
in the generation of new knowledge lies in the potential loss of "absorptive capacity" (Cohen
& Levinthal, 1990). According to these authors, once an organization ceases investing in its
absorptive capacity in a quickly moving field, it may never assimilate and exploit new
information in that field, regardless of the value of that information.
complex data. Similar efforts are being undertaken in other programming languages, with the
goal of eventual interoperability across all environments.
Ensuring Accessibility
To be useful, it's not enough for data to be in the right format - it also has to be made
available on the public Internet, and in such a way that both human beings and computer
programs running automated searches can find it. The situation is somewhat delicate given
the proprietary nature of much of the research data, particularly in areas of interest to the
pharmaceutical industry.
Lincoln Stein has likened the current state of affairs in bioinformatics to that of Italy in the
time of the city states - a collection of fragmented, divided, and inward-looking principalities,
incapable of sustained cooperation. Much of what is truly revolutionary about computational
biology won't get off the ground until there is this seamless framework of data that
researchers can combine, sift through, or rearrange without having to worry about its
provenance. All of these efforts require time, goodwill, and hard work on the part of people
who would probably much rather be doing new research.
Gene sequencing
The most visible and active branch of bioinformatics is the art of gene sequencing,
sometimes called ``genomics'', including the much-publicized Human Genome Project and its
many offspring. While scientists have long known that DNA molecules carry hereditary
information, only in the 1990s did advances in sequencing technology make it feasible to
sequence the entire genome of anything more complex than a bacterium.
Understanding the significance of gene sequencing ( as well as its limitations ) requires a
little background.
Very few genes in a cell are actually active in protein production at any given time. Different
sections of DNA may be dormant or active over the life of a cell, their expression triggered
by other genes and changes in the cell's internal environment. How genes interact, why they
express at certain times and not others, and how the mechanisms of gene suppression and
activation work are all topics of intense interest in microbiology.
The Mechanics of Sequencing
The goal of genome sequencing projects is to record all of the genetic information contained
in a given organism - that is, create a sequential list of the base pairs comprising the DNA of
a particular plant or animal. Since chromosomes consist of long, unbroken strands of DNA, a
very convenient way to sequence a genome would be to unravel each chromosome and read
off the base pairs like punch tape.
Unfortunately, there is no machine available that can read a single strand of DNA ( ribosomes
are very good at it, but nature has not seen fit to provide them with an output jack ). Instead,
scientists have to use a cruder, shotgun technique that first chops the DNA into short pieces (
which we know how to identify ) and then tries to reassemble the original sequence based on
how the short fragments overlap.
To illustrate the difficulty of the task, imagine that someone gave you a bin containing ten
shredded copies of the United States tax code, and asked you to reconstruct the original. The
only way you could do it would be to hunt through the shredded bits, look for other shreds
with overlapping regions, and assemble larger and larger pieces at a time until you had at
least one copy of the whole thing.
Much of the work involved in DNA sequencing is exactly this kind of painstaking labor, with
the added caveat that the short fragments invariably contain errors, and that reliably
sequencing a single stretch of DNA might involve combining many dozens of duplicate data
sets to arrive at an acceptable level of fidelity.
All of this work is done using computers that sift through sequencing data and apply various
alignment algorithms to look for overlaps between short DNA fragments. Since DNA in its
computer representation is just a long string of letters, these algorithms are close cousins of
text analysis techniques that have been used for years on electronic documents, creating a
curious overlap between genetics and natural language processing.
The work of assembling a sequence requires that many data sets be combined together, and
common errors ( dropped sequences, duplications, backwards sequences ) detected and
eliminated from the final result. Much of the work taking place right now on data from the
Human Genome Project is just this kind of careful computerized analysis and correction.
Finding the Genes
Once a reliable DNA sequence has been established, there still remains the task of finding the
actual genes ( coding regions ) embedded within the DNA strand. Since a large proportion of
the DNA in a genome is non-coding ( and presumably unused ), it is important to find and
delimit the coding regions as a first step to figuring out what they do.
This search for coding regions is also done with the help of computer algorithms, and once
again there is a good amount of borrowing from areas like signal processing, cryptography,
and natural language processing - all techniques that are good at distinguishing information
from random noise.
Finding the coding regions is an important step in genome analysis, but it is not the end of the
road. An ongoing debate in genetics has been the question of how much information is
actually encoded in the genome itself, and how much is carried by the complex system
surrounding gene expression.
A useful analogy here (borrowed from Douglas Hofstadter) is that of the ant colony. While a
single ant has only a small repertoire of behaviors, and responds predictably to a handful of
chemical signals, a colony of many thousands of ants will show very subtle and sophisticated
behavior in how it forages for food, raises its young, manages resources and deals with
external threats. No amount of study of an individual ant can reveal anything about the
behavior of a colony, because the properties of the colony don't reside in any single ant - they
arise spontaneously out of the interactions between many thousands of ants, all obeying
simple rules. This phenomenon of emergent behavior is known to play a part in gene
expression, but nobody knows to what extent.
Emergent behavior is very hard to simulate, because there is no way to infer the simple rules
from the complex behavior - we are forced to proceed by trial and error. Still, computers give
us a way to try many different rule sets and test hypotheses about gene interaction that would
take decades to work out on pencil and paper.
Unexplored Territory
Surprisingly enough, for all the intense research taking place in genomics, only a very few
organisms have had their genome fully sequenced ( we are in the proud company of the
puffer fish, the fruit fly, brewer's yeast, and several kinds of bacteria). Many technical and
computational challenges remain before sequencing becomes an automatic process - some
species are still very difficult for us to sequence, and much remains to be learned about the
role and origin of that entire non-coding DNA. Progress will require both advances in the lab
and advances in our ability to analyze and process the genome data with computers.
Molecular Structure and Function
Closely related to gene sequencing is a second major field of interest in bioinformatics - the
search for a mapping between the chemical structure of a protein and its function within the
cell.
We noted above that proteins are assembled from amino acids based on instructions encoded
in an organism's DNA. Even though all proteins are made out of the same building blocks,
they come in a bewildering variety of forms and serve many functions within the cell. Some
proteins are purely structural, others have special receptors that fit other molecules, and still
others serve as signals or messages that can pass information from cell to cell. The role a
protein plays depends solely on its shape, and the shape of a protein depends on the sequence
of amino acids that compose it.
Amino acid chains start out as linear molecules, but as more amino acids are added, the chain
begins to fold up. This spontaneous folding result in a complex, three-dimensional structure
unique to the particular protein being generated. The pattern of folding is automatic and
reproducible, so that a given amino acid sequence will always create a protein with a certain
configuration.
The ability to predict the final shape of a protein from its amino acid composition is the Holy
Grail of pharmacology. If we could design protein molecules from scratch, it would become
possible to create potent new drugs and therapies, tailored to the individual. Finding
treatments for a disease would be as simple as creating a protein to fit around the business
ends of an infectious agent and render it harmless.
The protein folding problem, as it is called, is computationally very difficult. It may even be
an intractable problem. An enormous amount of effort continues to go into finding a mapping
between the amino acid sequence of a molecule, its ultimate configuration, and how its
structure affects its function.
Distributed computing plays a critical role in studying protein folding. As computing power
increases, it will become possible to test more sophisticated models of folding behavior, and
more accurately estimate the intramolecular forces within individual proteins to understand
why they fold the way they do.
A better understanding of protein folding is also critical to finding therapies for prion-based
diseases like bovine spongiform encephalopathy ( BSE, or mad cow disease ) which appears
to be caused by a 'rogue' configuration of a common protein that in turn reconfigures other
proteins it comes into contact with. Prion diseases are poorly understood and present a grave
risk, since one protein molecule is presumably enough to infect an entire organism, and
common sterilization techniques are not sufficient to destroy infectious particles.
Molecular Evolution
A third application of bioinformatics, closely related to genomics and protein analysis, is the
study of molecular evolution.
Molecular evolution is a statistical science that uses changes in genes and proteins as a kind
of evolutionary clock. Over time, any species will accumulate minor mutations in its genetic
code, due to inevitable transcription errors and environmental factors like background
radiation. Some of these mutations prove fatal; a tiny fraction prove beneficial, but the vast
majority have no noticeable effect on the organism.
Because this trickle of small changes is slow, cumulative, and essentially random, it can serve
as a useful indicator of how long two species have been out of genetic contact - that is, when
they last shared a common ancestor. By comparing molecules across many species, scientists
can determine the relative genetic distance between them, and learn more about their
evolutionary history.
The study of molecular markers in evolution builds on on the same sequencing and analysis
techniques discussed in the section on DNA. Because it relies on tiny changes to the genome,
ths kind of statistical analysis requires very precise sequencing data and sophisticated
while causing no harm to the human. In order to determine the role our potential drug target
plays in a particular disease mechanism we use DNA and protein chips. These chips can
measure the amount of transcript or protein expressed by a cell at different times or in
different states (healthy versus diseased).
Clustering algorithms are used to organise this expression data into different biologically
relevant clusters. We can then compare the expression profiles from the diseased and healthy
cells to help us understand the role our gene or protein plays in a disease process. All of these
computational tools can help to compose a detailed picture about a protein family, its
involvement in a disease process and its potential as a possible drug target.
Following on from the genomics explosion and the huge increase in the number of potential
drug targets, there has been a move from the classical linear approach of drug discovery to a
non linear and high throughput approach. The field of bioinformatics has become a major
part of the drug discovery pipeline playing a key role for validating drug targets. By
integrating data from many inter-related yet heterogeneous resources, bioinformatics can help
in our understanding of complex biological processes and help improve drug discovery.
Identify Target Disease: One needs to know all about the disease and existing or
traditional remedies. It is also important to look at very similar afflictions and their
known treatments.
Target identification alone is not sufficient in order to achieve a successful treatment
of a disease. A real drug needs to be developed. This drug must influence the target
protein in such a way that it does not interfere with normal metabolism. One way to
achieve this is to block activity of the protein with a small molecule. Bioinformatics
methods have been developed to virtually screen the target for compounds that bind
and inhibit the protein. Another possibility is to find other proteins that regulate the
activity of the target by binding and forming a complex.
Study Interesting Compounds: One needs to identify and study the lead compounds
that have some activity against a disease. These may be only marginally useful and
may have severe side effects. These compounds provide a starting point for
refinement of the chemical structures.
Detect the Molecular Bases for Disease: If it is known that a drug must bind to a
particular spot on a particular protein or nucleotide then a drug can be tailor made to
bind at that site. This is often modeled computationally using any of several different
techniques. Traditionally, the primary way of determining what compounds would be
tested computationally was provided by the researchers' understanding of molecular
interactions. A second method is the brute force testing of large numbers of
compounds from a database of available structures.
Rational drug design techniques: These techniques attempt to reproduce the
researchers' understanding of how to choose likely compounds built into a software
package that is capable of modeling a very large number of compounds in an
automated way. Many different algorithms have been used for this type of testing,
many of which were adapted from artificial intelligence applications. The complexity
of biological systems makes it very difficult to determine the structures of large
biomolecules. Ideally experimentally determined (x-ray or NMR) structure is desired,
but biomolecules are very difficult to crystallize.
Refinement of compounds: Once you got a number of lead compounds have been
found, computational and laboratory techniques have been very successful in refining
the molecular structures to give a greater drug activity and fewer side effects. This is
done both in the laboratory and computationally by examining the molecular
structures to determine which aspects are responsible for both the drug activity and
the side effects.
Quantitative Structure Activity Relationships (QSAR): This computational
technique should be used to detect the functional group in your compound in order to
refine your drug. This can be done using QSAR that consists of computing every
possible number that can describe a molecule then doing an enormous curve fit to find
out which aspects of the molecule correlate well with the drug activity or side effect
severity. This information can then be used to suggest new chemical modifications for
synthesis and testing.
Solubility of Molecule: One need to check whether the target molecule is water
soluble or readily soluble in fatty tissue will affect what part of the body it becomes
concentrated in. The ability to get a drug to the correct part of the body is an
important factor in its potency. Ideally there is a continual exchange of information
between the researchers doing QSAR studies, synthesis and testing. These techniques
are frequently used and often very successful since they do not rely on knowing the
biological basis of the disease which can be very difficult to determine.
Drug Testing: Once a drug has been shown to be effective by an initial assay
technique, much more testing must be done before it can be given to human patients.
Animal testing is the primary type of testing at this stage. Eventually, the compounds,
which are deemed suitable at this stage, are sent on to clinical trials. In the clinical
trials, additional side effects may be found and human dosages are determined.
Proteins attach to the DNA and help the strands coil up into a chromosome when the cell
gets ready to divide.
The DNA is organized into stretches of genes, stretches where proteins attach to coil the
DNA into chromosomes, stretches that "turn a gene on" and "turn a gene off", and large
stretches whose purpose is not yet known to scientists.
The genes carry the instructions for making all the thousands of proteins that are found in a
cell. The proteins in a cell determine what that cell will look like and what jobs that cell will
do. The genes also determine how the many different cells of a body will be arranged. In
these ways, DNA controls how many fingers you have, where your legs are placed on your
body, and the color of your eyes.
A chromosome is made up of DNA and the proteins attached to it. There are 23 pairs of
chromosomes in a human cell. One of each pair was inherited from your mother and the
other from your father. DNA is a particular bio-molecule. All of the DNA in a cell is found
in individual pieces, called chromosomes. This would be like muffins. Muffins are made up
of muffin-matter and paper cups. All of the muffin-matter in your kitchen is found in
individual pieces, called muffins.
A revolution has occurred in the last few decades that explains how DNA makes us look like
our parents and how a faulty gene can cause disease. This revolution opens the door to
curing illness, both hereditary and contracted. The door has also been opened to an ethical
debate over the full use of our new knowledge.
Components of DNA
DNA is a polymer. The monomer units of DNA are nucleotides, and the polymer is known as
a "polynucleotide." Each nucleotide consists of a 5-carbon sugar (deoxyribose), a nitrogen
containing base attached to the sugar, and a phosphate group. There are four different types
of nucleotides found in DNA, differing only in the nitrogenous base. The four nucleotides are
given one letter abbreviations as shorthand for the four bases.
A is for adenine
G is for guanine
C is for cytosine
T is for thymine
Purine Bases
Adenine and guanine are purines. Purines are the larger of the two types of bases found in
DNA. Structures are shown below:
Structure of A and G
The 9 atoms that make up the fused rings (5 carbon, 4 nitrogen) are numbered 1-9. All ring
atoms lie in the same plane.
Pyrimidine Bases
Cytosine and thymine are pyrimidines. The 6 stoms (4 carbon, 2 nitrogen) are numbered 1-6.
Like purines, all pyrimidine ring atoms lie in the same plane.
Structure of C and T
Deoxyribose Sugar
The deoxyribose sugar of the DNA backbone has 5 carbons and 3 oxygens. The carbon atoms
are numbered 1', 2', 3', 4', and 5' to distinguish from the numbering of the atoms of the purine
and pyrmidine rings. The hydroxyl groups on the 5'- and 3'- carbons link to the phosphate
groups to form the DNA backbone. Deoxyribose lacks an hydroxyl group at the 2'-position
when compared to ribose, the sugar component of RNA.
Structure of deoxyribose
Nucleosides
A nucleoside is one of the four DNA bases covalently attached to the C1' position of a sugar.
The sugar in deoxynucleosides is 2'-deoxyribose. The sugar in ribonucleosides is ribose.
Nucleosides differ from nucleotides in that they lack phosphate groups. The four different
nucleosides of DNA are deoxyadenosine (dA), deoxyguanosine (dG), deoxycytosine (dC),
and (deoxy)thymidine (dT, or T).
Structure of dA
In dA and dG, there is an "N-glycoside" bond between the sugar C1' and N9 of the purine.
Nucleotides
A nucleotide is a nucleoside with one or more phosphate groups covalently attached to the 3'and/or 5'-hydroxyl group(s).
DNA Backbone
The DNA backbone is a polymer with an alternating sugar-phosphate sequence. The
deoxyribose sugars are joined at both the 3'-hydroxyl and 5'-hydroxyl groups to phosphate
groups in ester links, also known as "phosphodiester" bonds.
Example of DNA Backbone: 5'-d(CGAAT):
Two DNA strands form a helical spiral, winding around a helix axis in a right-handed
spiral
The two polynucleotide chains run in opposite directions
The sugar-phosphate backbones of the two DNA strands wind around the helix axis
like the railing of a sprial staircase
The bases of the individual nucleotides are on the inside of the helix, stacked on top
of each other like the steps of a spiral staircase.
Base Pairs
Within the DNA double helix, A forms 2 hydrogen bonds with T on the opposite strand, and
G forms 3 hyrdorgen bonds with C on the opposite strand.
Example of dA-dT base pair as found within DNA double helix
dA-dT and dG-dC base pairs are the same length, and occupy the same space within a
DNA double helix. Therefore the DNA molecule has a uniform diameter.
dA-dT and dG-dC base pairs can occur in any order within DNA molecules
Experimental Data
Much experimental data can aid the structure prediction process. Some of these are:
Disulphide bonds, which provide tight restraints on the location of cysteines in space
Spectroscopic data, which can give you and idea as to the secondary structure content
of your protein
Site directed mutagenesis studies, which can give insights as to residues involved in
active or binding sites
Knowledge of proteolytic cleavage sites, post-translational modifictions, such as
phosphorylation or glycosylation can suggest residues that must be accessible
Etc.
Remember to keep all of the available data in mind when doing predictive work. Always ask
yourself whether a prediction agrees with the results of experiments. If not, then it may be
necessary to modify what you've done.
If the answer to any of the above questions is yes, then it is worthwhile trying to break your
sequence into pieces, or ignore particular sections of the sequence, etc.
One of the most important advances in sequence comparison recently has been the
development of both gapped BLAST and PSI-BLAST (position specific interated BLAST).
Both of these have made BLAST much more sensitive, and the latter is able to detect very
remote homologues by taking the results of one search, constructing a profile and then using
this to search the database again to find other homologues (the process can be repeated until
no new sequences are found). It is essential that one compare any new protein sequence to the
database with PSI-BLAST to see if known structures can be found prior to doing any of the
other methods discussed in the next sections.
Other methods for comparing a single sequence to a database include:
It is also possible to use multiple sequence information to perform more sensitive searches.
Essentially this involves building a profile from some kind of multiple sequence alignment. A
profile essentially gives a score for each type of amino acid at each position in the sequence,
and generally makes searches more sentive. Tools for doing this include:
A different approach for incorporating multiple sequence information into a database search
is to use a MOTIF. Instead of giving every amino acid some kind of score at every position in
an alignment, a motif ignores all but the most invariant positions in an alignment, and just
describes the key residues that are conserved and define the family. Sometimes this is called a
"signature". For example, "H-[FW]-x-[LIVM]-x-G-x(5)-[LV]-H-x(3)-[DE]" describes a
family of DNA binding proteins. It can be translated as "histidine, followed by either a
phenylalanine or tryptophan, followed by an amino acid (x), followed by leucine, isoleucine,
valine or methionine, followed by any amino acid (x), followed by glycine,... [etc.]".
PROSITE (ExPASy Geneva) contains a huge number of such patterns, and several sites allow
you to search these data:
ExPASy
EBI
It is best to search a few different databases in order to find as many homologues as possible.
A very important thing to do, and one which is sometimes overlooked, is to compare any new
sequence to a database of sequences for which 3D structure information is available. Whether
or not your sequence is homologous to a protein of known 3D structure is not obvious in the
output from many searches of large sequence databases. Moreover, if the homology is weak,
the similarity may not be apparent at all during the search through a larger database.
One last thing to remember is that one can save a lot of time by making use of pre-prepared
protein alignments. Many of these alignments are hand edited by experts on the particular
protein families, and thus represent probably the best alignment one can get given the data
they contain (i.e. they are not always as up to date as the most recent sequence databases).
These databases include:
SMART (Oxford/EMBL)
PFAM (Sanger Centre/Wash-U/Karolinska Intitutet)
COGS (NCBI)
PRINTS (UCL/Manchester)
BLOCKS (Fred Hutchinson Cancer Research Centre, Seatle)
SBASE (ICGEB, Trieste)
Generally one can compare a protein sequence to these databases via a variety of techniques.
These can also be very useful for the domain assignment.
Locating domains
If you have a sequence of more than about 500 amino acids, you can be nearly certain that it
will be divided into discrete functional domains. If possible, it is preferable to split such large
proteins up and consider each domain separately. You can predict the location of domains in
a few different ways. The methods below are given (approximately) from most to least
confident.
If homology to other sequences occurs only over a portion of the probe sequence and
the other sequences are whole (i.e. not partial sequences), then this provides the
strongest evidence for domain structure. You can either do database searches yourself
or make use of well-curated, pre-defined databases of protein domains. Searches of
these databases (see links below) will often assign domains easily.
o SMART (Oxford/EMBL)
o PFAM (Sanger Centre/Wash-U/Karolinska Intitutet)
o COGS (NCBI)
o PRINTS (UCL/Manchester)
o BLOCKS (Fred Hutchinson Cancer Research Centre, Seatle)
o SBASE (ICGEB, Trieste)
Regions of low-complexity often separate domains in multidomain proteins. Long
stretches of repeated residues, particularly Proline, Glutamine, Serine or Threonine
often indicate linker sequences and are usually a good place to split proteins into
domains.
Low complexity regions can be defined using the program SEG which is generally
available in most BLAST distributions or web servers (a version of SEG is also
contained within the GCG suite of programs).
Transmembrane segments are also very good dividing points, since they can easily
separate extracellular from intracellular domains. There are many methods for
predicting these segments, including:
o TMAP (EMBL)
o PredictProtein (EMBL/Columbia)
o TMHMM (CBS, Denmark)
o TMpred (Baylor College)
o DAS (Stockholm)
Something else to consider are the presence of coiled-coils. These unusual structural
features sometimes (but not always) indicate where proteins can be divided into
domains.
Secondary structure prediction methods will often predict regions of proteins to have
different protein structural classes. For example one region of sequence may be
predicted to contain only alpha helices and another to contain only beta sheets. These
can often, though not always, suggest likely domain structure (e.g. an all alpha
domain and an all beta domain)
If you have separated a sequence into domains, then it is very important to repeat all the
database searches and alignments using the domains separately. Searches with sequences
containing several domains may not find all sub-homologies, particularly if the domains are
abundant in the database (e.g. kinases, SH2 domains, etc.). There may also be "hidden"
domains. For example if there is a stretch of 80 amino acids with few homologues nested in
between a kinase and an SH2 domain, then you may miss matches found when searching the
whole sequence against a database.
than families of homologous sequences, and there were relatively few known 3D structures
from which to derive parameters. Probably the most famous early methods are those of Chou
& Fasman, Garnier, Osguthorbe & Robson (GOR) and Lim. Although the authors originally
claimed quite high accuracies (70-80 %), under careful examination, the methods were shown
to be only between 56 and 60% accurate (see Kabsch & Sander, 1984 given below). An early
problem in secondary structure prediction had been the inclusion of structures used to derive
parameters in the set of structures used to assess the accuracy of the method.
Recent improvments
The availability of large families of homologous sequences revolutionised secondary
structure prediction. Traditional methods, when applied to a family of proteins rather than a
single sequence proved much more accurate at identifying core secondary structure elements.
The combination of sequence data with sophisticated computing techniques such as neural
networks has lead to accuracies well in excess of 70 %. Though this seems a small percentage
increase, these predictions are actually much more useful than those for single sequence,
since they tend to predict the core accurately. Moreover, the limit of 70-80% may be a
function of secondary structure variation within homologous proteins.
There are numerous automated methods for predicting secondary structure from multiply
aligned protein sequences. Nearly all of these now run via the world wide web.
Manual intervention
It has long been recognised that patterns of residue conservation are indicative of particular
secondary structure types. Alpha helices have a periodicity of 3.6, which means that for
helices with one face buried in the protein core, and the other exposed to solvent, will have
residues at positions i, i+3, i+4 & i+7 (where i is a residue in an a helix) will lie on one face
of the helix. Many alpha helices in proteins are amphipathic, meaning that one face is
pointing towards the hydrophobic core and the other towards the solvent. Thus patterns of
hydrophobic residue conservation showing the i, i+3, i+4, i+7 pattern are highly indicative of
an alpha helix.
For example, this helix in myoglobin has this classic pattern of hydrophobic and polar residue
conservation (i = 1):
Similarly, the geometry of beta strands means that adjacent residues have their side chains
pointing in oppposite directions. Beta strands that are half buried in the protein core will tend
to have hydrophobic residues at positions i, i+2, i+4, i+8 etc, and polar residues at positions
i+1, i+3, i+5, etc.
For example, this beta strand in CD8 shows this classic pattern:
Beta strands that are completely buried (as is often the case in proteins containing both alpha
helices and beta strands) usually contain a run of hydrophobic residues, since both faces are
buried in the protein core.
This strand from Chemotaxis protein CheY is a good example:
like those shown above. It has been shown in numerous successful examples
that this strategy often leads to nearly perfect predictions.
In this figure, three automated secondary structure predictions (PHD, SOPMA and SSPRED)
appear below the alignment of 12 glutamyl tRNA reductase sequences. Positions within the
alignment showing a conservation of hydrophobic side-chain character are shown in yellow,
and those showing near total conservation of non-hydrophobic residues (often indicative of
active sites) are coloured green.
Predictions of accessibility performed by PHD (PHD Acc. Pred.) are also shown (b = buried,
e = exposed), as is a prediction I performed by looking for patterns indicative of the three
secondary structure types shown above. For example, positions (within the alignment) 38-45
exhibit the classical amphipathic helix pattern of hydrophobic residue conservation, with
positions i, i+3, i+4 and i+7 showing a conservation of hydrophobicity, with intervening
positions being mostly polar. Positions 13-16 comprise a short stretch of conserved
hydrophobic residues, indicative of a beta-strand, similar to the example from CheY protein
shown above.
By looking for these patterns I built up a prediction of the secondary structure for most
regions of the protein. Note that most methods - automated and manual - agree for many
regions of the alignment.
Given the results of several methods of predicting secondary structure, one can build up a
consensus picture of the secondary structure, such as that shown at the bottom of the
alignment above.
Thus for many proteins (~ 70%) there will be a suitable structure in the database from which
to build a 3D model. Unfortuantely, the lack of sequence similarity will mean that many of
these go undetected until after 3D structure determination.
The goal of fold recognition
Methods of protein fold recognition attempt to detect similarities between protein 3D
structure that are not accompanied by any significant sequence similarity. There are many
approaches, but the unifying theme is to try and find folds that are compatable with a
particular sequence. Unlike sequence-only comparison, these methods take advantage of the
extra information made available by 3D structure information. In effect, the turn the protein
folding problem on it's head: rather than predicting how a sequence will fold, they predict
how well a fold will fit a sequence.
The realities of fold recognition
Despite initially promising results, methods of fold recognition are not always accurate.
Guides to the accuracy of protein fold recognition can be found in the proceedings of the
Critical Assessment of Structure Predictions (CASP) conferences. At the first meeting in
1994 (CASP1) the methods were found to be about 50 % accurate at best with respect to their
ability to place a correct fold at the top of a ranked list. Though many methods failed to
detect the correct fold at the top of a ranked list, a correct fold was often found in the top 10
scoring folds. Even when the methods were successful, alignments of sequence on to protein
3D structure were usually incorrect, meaning that comparative modelling performed using
such models would be inaccurate.
The CASP2 meeting held in December 1996, showed that many of the methods had
improved, though it is difficult to compare the results of the two assessments (i.e. CASP1 &
CASP2) since very different criteria were used to assess correct answers. It would be foolish
and over-ambitious for me to present a detailed assessment of the results here. However, and
important thing to note, was that Murzin & Bateman managed to attain near 100% success by
the use of careful human insight, a knowledge of known structures, secondary structure
predictions and thoughts about the function of the target sequences. Their results strongly
support the arguments given below that human insight can be a powerful aid during fold
recognition. A summary of the results from this meeting can be found in the PROTEINS issue
dedicated to the meeting .
The CASP3 meeting was held in December 1998. It showed some progress in the ability of
fold recognition methods to detect correct protein folds and in the quality of alignments
obtained. A detailed summary of the results will appear towards the end of 1999 in the
PROTEINS supplement.
For my talk, I did a crude assessment of 5 methods of fold recognition. I took 12 proteins of
known structure (3 from each folding class) an ran each of the five methods using default
parameters. I then asked how often was a correct fold (not allowing trival sequence detectable
folds) found in the first rank, or in the top 10 scoring folds. I also asked how often the method
found the correct folding class in the first rank.
Perhaps the worst result from this study is shown below:
One method suggested that the sequence for the Probe (left) (a four helix bundle) would best
fit onto the structure shown on the right (an OB fold, comprising a six stranded barrel).
The results suggest that one should use caution when using these methods. In spite of this, the
methods remain very useful.
A practical approach:
Although they are not 100 % accurate, the methods are still very useful. To use the methods I
would suggest the following:
Run as many methods as you can, and run each method on as many sequences (from
your homologous protein family) as you can. The methods almost always give
somewhat different answers with the same sequences. I have also found that a single
method will often give different results for sets of homologous sequences, so I would
also suggest running each method on as many homologoues as possible. After all of
these runs, one can build up a consensus picture of the likely fold in a manner similar
to that used for secondary structure prediction above.
Remember the expected accuracy of the methods, and don't use them as black-boxes.
Remember that a correct fold may not be at the top of the list, but that it is likely to be
in the top 10 scoring folds.
Think about the function of your protein, and look into the function of the proteins
that have been found by the various methods. If you see a functional similarity, then
you may have detected a weak sequence homologue, or remote homologue. At
CASP2, as said above, Murzin & Bateman managed to obtain remarkably accurate
predictions by identification of remote homologues.
If your predicted fold has many "relatives", then have a look at what they are. Ask:
Are there non-core secondary structure elements that might not be present in all
members of the fold?
Core secondary structure elements, such as those comprising a beta-barrel, should really be
present in a fold. If your predicted secondary structures can't be made to match up with what
you think is the core of the protein fold, then your prediction of fold may be wrong (but be
careful, since your secondary structure prediction may contain errors). You can also use your
prediction together with the core secondary structure elements to derive an alignment of
predicted and observed secondary structures.
Most places with groups studying structural biology also have commercial packages, such as
Quanta, SYBL or Insight, which contain more features than the visualisation packages
described above.
PROTEOMICS
After the sequencing of the human genome, it is now clear that much of the complexity of the
human body resides at the level of proteins rather than the DNA sequences. This view is
supported by the unexpectedly low number of human genes (approximately 35,000) and the
estimated number of proteins (which is currently about 300,000 - 450,000 and steadily
rising), which are generated from these genes. For example, it is estimated that on average
human proteins exist as ten to fifteen different post- transitionally modified forms all of
which presumably - have different functions. Much of the information processing in healthy
and diseased human cells can only be studied at the protein level, and there is increasing
evidence to link minor changes in expression of some of these modifications with specific
diseases. Together with rapidly improving technology for characterising the proteome, there
is now a unique chance to make an impact in this area.
medicine
biotechnology
food sciences
agriculture
animal genetics and horticulture
environmental surveillance
pollution
Within the field of medicine alone in Odense, proteomics has been applied to:
protein changes during normal processes like differentiation, development and ageing
abnormal protein expression in disease development (and is especially suited for
studies of diseases of multigenic origin)
diagnosis, prognosis
identification of novel drug targets
selection of candidate drugs
surrogate markers
targets for gene therapy
toxicology
mechanism of drug action
Challenges in Proteomics
Proteomics for understanding gene function
Proteomics has been the ideal tool for investigating the function of unknown genes for many
years. The deletion (or insertion) of a single gene induces a number of changes in the
expression of groups of proteins with related function. These groups are often proteins in the
same complex or within the me pathway and so these studies can map the activity of proteins
even when their function is unknown or they are expressed at levels too low to be detected.
The final step in understanding gene function is the transfection of particular genes into
functionally related cells which do not posses the gene (or for example where the endogenous
gene is deleted and replaced with a genetically modified version).
Proteomics for understanding the molecular regulation of the cell
In the organism, cells exist in tissues in a highly specialised state and each cell type within
each type of tissue is able to carry out certain particular functions which are often important
or essential for the whole organism to perform correctly. When a cell is removed from its
normal environment, it rapidly undergoes a process of de- differentiation. During this process
the cell looses many of its specialised functions and for this reason alone, much of the early
cancer work done until now has very little value as it has been based on established cell lines.
Therefore, one of the key tasks currently in hand is a quest to define growth conditions which
press the cell to either retain its differentiated phenotype or even better to induce the cell to
develop into a structure that resembles the tissue. Early successes were the derivation of so
called "raft cultures" for the development of a multilayered keratinocyte tissue that is
indistinguishable from the epidermis (minus the hair and sweat follicles) when examined by
the electron microscope. Currently the quest is to induce the differentiation of other cell
types, (for example hepatocytes) into a structure that possesses the enzymatic functions of the
original tissue. In the case of liver, this would have a huge potential value for studies of for
example drug toxicology and environmental chemical surveillance. The use of microgravity
devices (developed in collaboration with the North American Space Agency (NASA)) have
been shown to induce liver like structures in vitro and thus appear to offer the solution.
Current studies now in hand have demonstrated that many more specialised enzyme systems
are active in the hepatocytes.
Proteomics for identification of multiprotein complexes
The classical approach to studying multiprotein complexes has been to isolate them, either by
direct purification or by immunoprecipitation and then identify the components of the
complex one by one. Their function has been studied by interlink analysis of protein
expression, or by site directed mutagenesis of particular proteins in the complex. Novel mass
spectrometry based technology now for the first time is making it possible to study a large
number of protein interactions in the same experiment. This capability is used commercially
to "surround" known disease genes with interaction partners which are potentially more
amenable to drug treatment. Furthermore, delineating the role of novel gene products in
protein networks and in large multi-protein complexes in the cell has proven to be one of the
best ways to learn about their function. This knowledge is indispensable both in basic cellular
biology and in biomedicine. Novel proteomics technologies which are now in academic
development in Odense and elsewhere promise to further extend the applicability of
proteomics to measure changes in the cell. These capabilities include the direct and
automated measurement of protein expression level changes (which can be applied to, for
example, cancer patient material) as well as the large scale study of signalling processes
through the analysis of phosphorylation by mass spectrometry. Together, these developments
have the potential to revolutionize the way cellular processes are studied in basic biology and
in disease.
Proteomics for studying cellular dynamics and organization
Post-translational modifications of proteins (of which there are now known to be over 200
different types) govern key cellular processes, such as timing of the cell cycle, orchestration
of cell differentiation and growth, and cell-cell communication processes. Specific labelling
techniques and mass spectrometry are exquisitely suited to study post-translationally
modified proteins, for example phosphoproteins, glycoproteins and acylated protein, that are
present at defined stages of the cells life cycle or targeted to specific organelles within the
cell. Specific radioisotopes (e.g. [32P]-orthophosphate) or fluorescent dyes (for Snitrosylation) allow a proteome-wide scanning for all proteins with particular types of
modification identifying which proteins are modified, and what percentage of these proteins
are modified under which growth conditions. Alternatively, a number of mass spectrometry
based methods are in development that will enable systematic studies of whole populations of
post-translationally modified proteins isolated from cells or tissues, leading to the concept of
"modification-specific proteomics". These developments are pivotal for gaining further
insights into the complexity of life. Delineating the dynamics of post-translational
modification of proteins is a prerequisite for understanding the organization and dynamics of
normal or perturbed cellular systems.
Introduction
Proteomics is the study of total protein complements, proteomes, e.g. from a given tissue or
cell type. Nowadays proteomics can be divided into classical and functional proteomics.
Classical proteomics is focused on studying complete proteomes, e.g. from two differentially
treated cell lines, whereas functional proteomics studies more limited protein sets. Classical
proteome analyses are usually carried out by using two-dimensional gel electrophoresis (2DE) for protein separation followed by protein identification by mass spectrometry (MS) and
database searches. The functional proteomics approach uses a subset of proteins isolated from
the starting material, e.g. with an affinity-based method. This protein subset can then be
separated by using normal SDS-PAGE or by 2-DE. Proteome analysis is complementary to
DNA microarray technology: with the proteomics approach it is possible to study changes in
protein expression levels and also protein-protein interactions and post-translational
modifications.
Methods
Two-dimensional electrophoresis
Proteins are separated in 2-DE according to their pI and molecular weight. In 2-DE analysis
the first step is sample preparation; proteins in cells or tissues to be studied have to be
solubilized and DNA and other contaminants must be removed. The proteins are then
separated by their charge using isoelectric focusing. This step is usually carried out by using
immobilized pH-gradient (IPG) strips, which are commercially available. The second
dimension is a normal SDS-PAGE, where the focused IPG strip is used as the sample. After
2-DE separation, proteins can be visualized with normal dyes, like Coomassie or silver
staining.
with MALDI-TOF, and sequence tags by nano-ESI tandem mass spectrometry (MS/MS). The
sensitivity of protein identification by MS is in the femtomole range.
Protocols
SILVER STAINING
Fixation
1 h (-overnight)
Rinse
20% EtOH
MQ-water
15 min
15 min
90 sec
Rinse
MQ-water
2* 20 sec
Silver
30 min
Rinse
MQ-water
2* 10 sec
Important:
-use plenty of MQ-water for rinsing
-make Sensitizer, Silver and Developer just prior to use
-if you stain the gels in plastic boxes, wipe the boxes with ethanol before use
1. Excise the stained protein band from the gel with a scalpel. Cut the gel slice into ca. 1x1
mm cubes, and put the pieces in an Eppendorf tube. (Use small tubes that are rinsed with
ethanol/methanol.)
2. Wash the gel pieces twice in 200 l 0.2 M NH4HCO3/ACN (1:1) for 15 min at 37 oC.
3. Shrink the gel pieces by adding 100 l ACN (wait until the pieces become white, about 10
min), and then remove ACN. Dry in vacuum centrifuge for 5 min.
4. Rehydrate the gel pieces in 100 l 20 mM DTT in 0.1 M NH4HCO3 for 30 min at 56oC.
(3.08 mg DTT/1 ml 0.1 M NH4HCO3)
5. Remove excess liquid, and shrink the pieces with ACN as above (no vacuum is needed).
Remove ACN, and add 100 l 55 mM iodoacetamide in 0.1 M NH4HCO3. Incubate 15 min in
the dark at RT. (10.2 mg IAA/1 ml 0.1 M NH4HCO3)
6. Remove excess liquid, and wash the pieces twice with 100 l 0.1 M NH4HCO3 and shrink
again with ACN. Dry in vacuum centrifuge for 5 min.
7. Add 10 l 0.04 g/l mTr solution (in 0.1 M NH4HCO3 ,10% ACN), and allow to absorb
for 10 min. Then add 0.1 M NH4HCO3, 10% ACN to completely cover the gel pieces.
Incubate overnight at 37 oC at incubator. (Remember to check the pH of 0.1 M NH4HCO3,
should be ~8)
8. Add 20-50 l 5% HCOOH (equal volume with the digestion mixture), vortex and incubate
15 min at 37 oC and collect the supernatant for tipping. HCOOH extraction can be repeated.
OR
8. Add 20-50 l ACN (equal volume with the digestion mixture), vortex and incubate 5 min
at 37 oC and collect the supernatant. Extract peptides twice at 37 oC with 150 l 5 %
HCOOH/50 % ACN (pool the supernatant and extracts). Dry down the peptide extracts in a
vacum centrifuge.
Tipping (MALDI)
results from one laboratory to the next, to the lack of high throughput techniques. Leading the
team of top scientists is Dr Christoph Eckerskorn, who pioneered key analytical approaches
used today in proteomics while at the Max Planck Institute in Germany. When Tecan
Proteomics was established, a majority share in Dr Weber GmbH was also secured. Dr Weber
GmbH develops and commercializes proprietary free-flow electrophoresis, a key enabling
technology in protein fractionation. This technology significantly reduces the complexity of a
proteome, a major hurdle in proteomics research today.
Q. Can you tell us more about proteomics, which I understand is a relatively new term?
A. The more recent developments in recombinant DNA technologies and other biological
techniques have endowed scientists with the unprecedented power of expressing proteins in
large quantities, in a variety of conditions, and in manipulating their structures. While
scientists were usually accustomed to studying proteins one at a time, proteomics represents a
comprehensive approach to studying the total proteomes of different organisms. Thus,
proteomics is not just about identification of proteins in complex biological systems, a huge
task in itself, but also about their quantitative proportions, biological activities, localization in
the living cells and their small compartments, interactions of proteins with each other and
with other biomolecules. And ultimately, their functions. Because even the lower organisms
can feature many thousands of proteins, proteomics-related activities are likely to keep us
busy for two or three decades.
Q. You are a biochemist. What is the connection in proteomics between biology and
chemistry?
A. With the boundaries between traditional scientific disciplines blurring all the time, this is
somewhat difficult to answer. While proteomics has some clear connections to modern
biological research, its current research tools and methodologies are chemical, some might
even say, physio-chemical. You prominently see there are large, sophisticated machines like
mass spectrometers, which in some parts of the world are still considered within the physics
domain. But things are continuously changing in this dynamic field. Genomic research, a
distinct area of modern biology, has significantly changed the way in which we view the task
of various proteomes.
Q. What has changed in the past five or so years?
A. The field of genomics, with its major emphasis on sequencing the basic building blocks of
the central molecule, DNA, already has yielded highly significant information on the
previously unknown secrets of living cells. The stories of newly discovered genotypephenotype relationships and strategies for understanding genetic traits and genetic diseases
now regularly flood top scientific journals and popular literature alike. In parallel with
providing the blueprint of the human genome, the genomes of many bacterial species, yeast
and fruit flies for instance, have been sequenced, providing a valuable resource for modern
biomedical research. The mouse genome also has been completed recently. Likewise, in the
area of plant sciences, some important genomic advances have been reported. Yet, only a part
of genetic information is of a direct use to proteomics.
Q. How is that?
A. While the entire protein sequences are encoded in the respective genomes, numerous
variations occur as well. Some proteins may remain intact as the direct products of genes, but
most will undergo structural alterations due to a variety of epigenetic factors. Thus, a given
proteome is not merely a protein complement dictated by a genome, but rather a reflection of
the dynamic situations in which different cells express themselves under various conditions
of their environment. Here lie some of the most challenging problems of contemporary
proteomicsto understand the dynamics, altered structural and quantitative attributes of
proteins and their effects on metabolic pathways.
Q. What other areas are you working in?
A. Many proteins must become post-translationally modified to fulfill their biological roles.
To assess such modifications, in a qualitative and quantitative sense, is thus at the heart of
this field. Fortunately, we have been active for a number of years in studying one of the most
prevalent and perhaps the most important posttranslational modification, called
glycosylation, which is the attachment of complex carbohydrate molecules to selected sites
on a protein. Glycosylation is now being increasingly implicated in a number of diseaserelated conditions, including cancer, cardiovascular disease, neurological disorders and a
great number of other medically interesting conditions.
Also, in collaboration with members of the medical faculty in Indianapolis, we will
investigate the molecular attributes of human diseases. Studying the proteomics of highly
complex multicellular systems of mammals provides some unusually exciting opportunities
and challenges.
Proteomics is bound to become a more quantitative science in the future. Unlike with the
negatively charged nucleic acids that are relatively easy to handle in solutions and on
surfaces, many proteins suffer from surface interactions and consequently, poor recoveries.
We certainly wish to improve acquisition of the quantitative profiles of proteins for different
protein classes.
Q. What are some practical applications of proteomics research?
Due to the multilateral importance of proteins in living systems, the scope of biomedical
application is apparently wide. In fine-tuned functioning of the human body, various proteins
act as the catalysts of chemical reactions, cell growth mediators, molecular transporters and
receptors, immune agents against microorganisms and more. Consequently, various human
diseases manifest themselves in the altered concentrations or structures of proteins, so that
finding protein markers of a disease can, for example, result in devising better means of
diagnosis or follow-up therapy. Numerous proteins have now been used therapeutically, so
that there have to be perfect ways of manufacturing quality control for such therapeutics, or
even to trace their action in the human body. Pharmaceutical companies, thus, have
considerable interest in proteomics. Various activities in this area also provide considerable
stimulus to the instrument industry. Consequently, coming up with new ways and better
means to analyze complex protein mixtures is a high priority. These are just a few examples
of how proteomics can impact our future.
CS905 BIOINFORMATICS 4
Introduction to Bioinformatics, Biological Databanks, Sequence Analysis, Structure
Prediction, Protein Folding, Emerging Areas in Bioinformatics.
Krane S.E &Raymer M.: Fundamental Concepts of Bioinformatics, Pearson, 2003
Attwood & Parrysmith: Introduction to Bioinformatics, Pearson Ed. 2003
Gibas &Jambeck: Developing Bioinformatics Computer Skills, OReilly, 2003