0% found this document useful (0 votes)
620 views

04 Computer Applications in Pharmacy Full Unit IV

Bioinformatics is the science of storing, retrieving, and analyzing large amounts of biological data. It is highly interdisciplinary, involving biologists, computer scientists, mathematicians and more. The goals of bioinformatics include increasing understanding of biological processes through computationally intensive techniques like pattern recognition, data mining, and machine learning. Biological databases play an important role by making vast amounts of biological data accessible to researchers. This supports efforts in areas like genome sequencing, protein structure prediction, and drug discovery.

Uploaded by

Sinh Luyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
620 views

04 Computer Applications in Pharmacy Full Unit IV

Bioinformatics is the science of storing, retrieving, and analyzing large amounts of biological data. It is highly interdisciplinary, involving biologists, computer scientists, mathematicians and more. The goals of bioinformatics include increasing understanding of biological processes through computationally intensive techniques like pattern recognition, data mining, and machine learning. Biological databases play an important role by making vast amounts of biological data accessible to researchers. This supports efforts in areas like genome sequencing, protein structure prediction, and drug discovery.

Uploaded by

Sinh Luyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

UNIT – IV

Bioinformatics: Introduction, Objective of Bioinformatics, Bioinformatics Databases,


Concept of Bioinformatics, Impact of Bioinformatics in Vaccine Discovery

An overview on bioinformatics and its applications

Put simply, bioinformatics is the science of storing, retrieving and analysing large amounts of biological
information. It is a highly interdisciplinary field involving many different types of specialists, including
biologists, molecular life scientists, computer scientists and mathematicians.

The term bioinformatics was coined by Paulien Hogeweg and Ben Hesper to describe "the study
of informatic processes in biotic systems" and it found early use when the first biological sequence
data began to be shared. Whilst the initial analysis methods are still fundamental to many large-
scale experiments in the molecular life sciences, nowadays bioinformatics is considered to be a
much broader discipline, encompassing modelling and image analysis in addition to the classical
methods used for comparison of linear sequences or three-dimensional structures.

A broad overview of the different types of data that fall within the scope of bioinformatics.
Traditionally, bioinformatics was used to describe the science of storing and analysing biomolecular
sequence data, but the term is now used much more broadly, encompassing computational
structural biology, chemical biology and systems biology (both data integration and the modelling of
systems).

The molecular life sciences have become increasingly data driven by and reliant on data sharing
through open-access databases. This is as true of the applied sciences as it is of fundamental research.
Furthermore, it is not necessary to be a bioinformatician to make use of bioinformatics databases,
methods and tools. However, as the generation of large data-sets becomes more and more central to
biomedical research, it’s becoming increasingly necessary for every molecular life scientist to
understand what can (and, importantly, what cannot) be achieved using bioinformatics, and to be able
to work with bioinformatics experts to design, analyse and interpret their experiments.
The role of public databases
There are a small number of bioinformatics centres of excellence worldwide that have taken on the
responsibility to collect, catalogue and provide open access to published biological data (Figure 3).
Among these centres are:
• The EMBL-European Bioinformatics Institute (EMBL-EBI)
• The US National Center for Biotechnology Information (NCBI)
• The National Institute of Genetics in Japan (NIG)
This work began in the early 1980s when DNA sequence data began to accumulate in the scientific
literature. The EMBL Data Library (now the European Nucleotide Archive) was developed to store
DNA sequences published in the scientific literature. The NCBI’s GenBank and NIG’s DDBJ
followed.

The role of bioinformatics centres of excellence in making biological data available for the research
community.

Goals of Bioinformatics

To study how normal cellular activities are altered in different disease states, the biological data must
be combined to form a comprehensive picture of these activities. Therefore, the field of bioinformatics
has evolved such that the most pressing task now involves the analysis and interpretation of various
types of data. This includes nucleotide and amino acid sequences, protein domains, and protein
structures.[16] The actual process of analyzing and interpreting data is referred to as computational
biology. Important sub-disciplines within bioinformatics and computational biology include:
• Development and implementation of computer programs that enable efficient access to,
management and use of, various types of information
• Development of new algorithms (mathematical formulas) and statistical measures that assess
relationships among members of large data sets. For example, there are methods to locate a
gene within a sequence, to predict protein structure and/or function, and to cluster protein
sequences into families of related sequences.

The primary goal of bioinformatics is to increase the understanding of biological processes. What sets
it apart from other approaches, however, is its focus on developing and applying computationally
intensive techniques to achieve this goal. Examples include: pattern recognition, data
mining, machine learning algorithms, and visualization. Major research efforts in the field
include sequence alignment, gene finding, genome assembly, drug design, drug discovery, protein
structure alignment, protein structure prediction, prediction of gene expression and protein–protein
interactions, genome-wide association studies, the modeling of evolution and cell division/mitosis.

Bioinformatics now entails the creation and advancement of databases, algorithms, computational
and statistical techniques, and theory to solve formal and practical problems arising from the
management and analysis of biological data.

*******************

Biological databases and their uses

Biological databases emerged as a response to the huge data generated by low-cost DNA sequencing
technologies. One of the first databases to emerge was GenBank, which is a collection of all available
protein and DNA sequences. It is maintained by the National Institutes of Health (NIH) and
the National Center for Biotechnology Information (NCBI). GenBank paved the way for the
Human Genome Project (HGP). The HGP allowed complete sequencing and reading of the genetic
blueprint. The data stored in biological databases is organized for optimal analysis and consists of two
types: raw and curated (or annotated). Biological databases are complex, heterogeneous,
dynamic, and yet inconsistent.

Why are these Important?

Earlier, databases and databanks were considered quite different. However, over the time, database
became a preferable term. Data is submitted directly to biological databases for indexing,
organization, and data optimization. They help researchers find relevant biological data by making it
available in a format that is readable on a computer. All biological information is readily accessible
through data mining tools that save time and resources. Biological databases can be broadly classified
as sequence and structure databases. Structure databases are for protein structures, while sequence
databases are for nucleic acid and protein sequences.

Kinds of Biological Databases

Biological databases can be further classified as primary, secondary, and composite databases.
Primary databases contain information for sequence or structure only. Examples of primary biological
databases include:
• Swiss-Prot and PIR for protein sequences
• GenBank and DDBJ for genome sequences
• Protein Databank for protein structures
Secondary databases contain information derived from primary databases. Secondary databases store
information such as conserved sequences, active site residues, and signature sequences. Protein
Databank data is stored in secondary databases. Examples include:

• SCOP at Cambridge University


• CATH at the University College of London
• PROSITE of the Swiss Institute of Bioinformatics
• eMOTIF at Stanford Composite databases contain a variety of primary databases, which
eliminates the need to search each one separately. Each composite database has different search
algorithms and data structures. The NCBI hosts these databases, where links to the Online
Mendelian Inheritance in Man (OMIM) is found.

The Future

Because of high-performance computational platforms, these databases have become important in


providing the infrastructure needed for biological research, from data preparation to data extraction.
The simulation of biological systems also requires computational platforms, which further
underscores the need for biological databases. The future of biological databases looks bright, in part
due to the digital world.

In terms of research, bioinformatics tools should be streamlined for analyzing the growing amount of
data generated from genomics, metabolomics, proteomics, and metagenomics. Another future trend
will be the annotation of existing data and better integration of databases.

With a large number of biological databases available, the need for integration, advancements, and
improvements in bioinformatics is paramount. Bioinformatics will steadily advance when
problems about nomenclature and standardization are addressed. The growth of biological databases
will pave the way for further studies on proteins and nucleic acids, impacting therapeutics,
biomedical, and related fields.

***************

The role of bioinformatics in drug and vaccine development.

Vaccines are the pharmaceutical products that offer the best cost‐benefit ratio in the prevention or
treatment of diseases. In that a vaccine is a pharmaceutical product, vaccine development and
production are costly and it takes years for this to be accomplished. Several approaches have been
applied to reduce the times and costs of vaccine development, mainly focusing on the selection of
appropriate antigens or antigenic structures, carriers, and adjuvants.

One of these approaches is the incorporation of bioinformatics methods and analyses into vaccine
development. This chapter provides an overview of the application of bioinformatics strategies in
vaccine design and development, supplying some successful examples of vaccines in
which bioinformatics has furnished a cutting edge in their development. Reverse
vaccinology, immunoinformatics, and structural vaccinology are described and addressed in
the design and development of specific vaccines against infectious diseases caused by bacteria,
viruses, and parasites.

These include some emerging or re‐emerging infectious diseases, as well as therapeutic vaccines to
fight cancer, allergies, and substance abuse, which have been facilitated and improved by using
bioinformatics tools or which are under development based on bioinformatics strategies.
The success of vaccination is reflected in its worldwide impact by improving human and veterinary
health and life expectancy. It has been asserted that vaccination, as well as clean water, has had such a
major effect on mortality reduction and population growth. In addition to the invaluable role of
traditional vaccines to prevent diseases, the society has observed remarkable scientific and
technological progress since the last century in the improvement of these vaccines and the generation
of new ones.

This has been possible by the fusion of computational technologies with the application of
recombinant DNA technology, the fast growth of biological and genomic information in
database banks, and the possibility of accelerated and massive sequencing of complete genomes. This
has aided in expanding the concept and application of vaccines beyond their traditional
immunoprophylactic function of preventing infectious diseases, and also serving as therapeutic
products capable of modifying the evolution of a disease and even cure it.

Vaccines are the pharmaceutical products that offer the best cost‐benefit ratio in the prevention or
treatment of diseases. In that it is a pharmaceutical product, a vaccine development and production
are costly and it takes years for this to be accomplished. Several approaches have been applied to
reduce the times and costs of their development, mainly focusing on the selection of appropriate
antigens or antigenic structures, carriers, and adjuvants. One of these approaches is the incorporation of
bioinformatics methods and analyses into vaccine development.

At present, there are many alternative strategies to design and develop effective and safe new‐
generation vaccines, based on bioinformatics approaches through reverse vaccinology,
immunoinformatics, and structural vaccinology.

Reverse vaccinology
Reverse vaccinology is a methodology that uses bioinformatics tools for the identification of
structures from bacteria, virus, parasites, cancer cells, or allergens that could induce an immune
response capable of protecting against a specific disease

Immunoinformatics

The immunological system can be classified as cellular or humoral and, depending on the disease, it
can be induced the expected immune response. If a vaccine that induces a cellular response is
needed, for example a tuberculosis vaccine or a parasite vaccine against leishmaniasis [23], the
software must search for antigens that can be recognized by the major histocompatibility complex
(MHC) molecules present in T lymphocytes. Software for this purpose include TEpredict,
CTLPred, nHLAPred, ProPred‐I, MAPPP, SVMHC, GPS‐MBA, PREDIVAC, NetMHC,
NetCTL, MHC2 Pred, IEDB,
BIMAS, SVMHC, POPI, Epitopemap, iVAX, FRED2, Rankpep, BIMAS, PickPocket, KISS, and
MHC2MIL.

Structural vaccinology

Structural vaccinology focuses on the conformational features of macromolecules, mainly proteins


that make them good candidate antigens. This approach to vaccine design has been used mainly to
select or design peptide‐based vaccines or cross‐reactive antigens with the capability of generating
immunity against different antigenically divergent pathogens.
********

A brief timeline of the major events in the history and the origins of bioinformatics.

A Chronological Hlstory of Blolnformatlcs

• 19S3 - Watson & Crick pmposed the double helix model for DNA based x-ray data
obtained by Franklin & Wilkins.
• 19S4 - Perutz’s group develop heavy atom methods to solve the phase problem in
protein crystallography.
• 19S5 - The sequence of the first pmtein to be analysed, bovine insulin, is announed by

• 19G9 - The ARPANET is created by linking computers at Standford and UCLA.


• 1970 - The details of the Needleman-Wiinsch algorithm for sequence comparison are
published.
• 1972 - The first recombinant DNA molecule is created by Paul Berg and his gmup.
• 1973 - The Brookhaven Pmtein DataBank is annoiineced (Acta.Ctyst.B,1973,29:1764).
Roberi Metcalfe receives his Ph.D finn Harvard University. His thesis describes
Ethernet.
• 1974 - Vint Cerf and Robert Khan develop the concept of connecting networks of
computers into an "internet“ and develop the Transmission Control Protocol {TCP}.
• 1975 - Microsofi Corporation is founded by Bill Gates and Paul Allen.
Two-dimensional electrophoresis, where separation of proteins on SDS
polyacrylamide gel is combined with separation according to isoelectric points, is
announced by P.H.OTarre1.
• 1988 - The National Centre for Biotechnology Information {NCBI) is established at the
National Cancer Institute.
The Human Genome Intiative is started (commission on Life Sciences, National
Research council. Mapping and sequencing the Human Genome, National Academy
Press: wahington, D.C.), 1988.
The FASTA algorith for sequence comparison is published by Pearson and Lupman. A
new pmgram, an Internet computer vinis defined by a student, infects 6,000 military
computers in the US.

• 1989 - The genetics Computer Group (GCG) becomes a privatae company.


Oxford Molecular Gmup,Ltd.{OMG) founded, UK by Anthony Marchigton, David Ricketts,
James Hiddleston, Anthony Rsss, and W.Graham Richards. Primary products: Anaconds,
Asp, Cameleon and others (molecular modeling, drug design, protein design).
• 1990 - The BLAST program (Altschul,st.al.) is implemented.
Molecular applications group is founded in California by Michael Levitt and Chris Lee. Their
primary pmducts are Look and SegMod which are used for molecular modeling and protein
deisign.
InforMax is founded in Bsthesda, MD. The company's products address sequence analysis,
database and data management, searching, publication graphics, clone oonsbuction, trapping
and primsr design.
• 1991 - The research institute in Geneva (CERN} announces the creation of the protocols
which make -up the World Wide Web.
The creation and use of expressed sequence tags (ESTs) is described.
Incyte Pharmaceuticals, a genomics company headqiiartered in Palo Alto California, is
formed.
Myriad Genetics, Inc. is founded in Utah. The company’s goal is to lead in the discovery of
major common human disease genes and their related pathways. The company has discovered
and sequenced, with its academic cnllaboratnrs, the
********

Nucleic acid and protein databases with an example.

The Nucleic Acid Database (NDB) (http://ndbserver.rutgers.edu) is a web portal providing access
to information about 3D nucleic acid structures and their complexes.

Protein sequence databases Introduction: The Protein database is a collection of


sequences from several sources, including translations from annotated coding regions in
GenBank, RefSeqand TPA, as well as records from SwissProt, PIR, PRF, and PDB.

DNA databases
Primary databases
International Nucleotide Sequence Database (INSD) consists of the following databases.
• DNA Data Bank of Japan (National Institute of Genetics)
• EMBL (European Bioinformatics Institute)
• GenBank (National Center for Biotechnology Information)
DDBJ (Japan), GenBank (USA) and European Nucleotide Archive (Europe) are repositories for
nucleotide sequence data from all organisms. All three accept nucleotide sequence submissions, and
then exchange new and updated data on a daily basis to achieve optimal synchronisation between
them. These three databases are primary databases, as they house original sequence data. They
collaborate with Sequence Read Archive (SRA), which archives raw reads from high-
throughput sequencing instruments.
Secondary databases
• 23andMe's database
• HapMap
• OMIM (Online Mendelian Inheritance in Man): inherited diseases
• RefSeq
• 1000 Genomes Project: launched in January 2008. The genomes of more than a thousand
anonymous participants from a number of different ethnic groups were analyzed and made
publicly available.
• EggNOG Database: a hierarchical, functionally and phylogenetically annotated
orthology resource based on 5090 organisms and 2502 viruses. It provides multiple
sequence alignments and maximum-likelihood trees, as well as broad functional
annotation.[

RNA databases
• miRBase: the microRNA database
• Rfam: a database of RNA families

Amino acid / protein databases

Protein sequence databases


• Database of Interacting Proteins (Univ. of California)
• DisProt: database of experimental evidences of disorder in proteins (Indiana University
School of Medicine, Temple University, University of Padua)
• InterPro: classifies proteins into families and predicts the presence of domains and sites
• MobiDB: database of intrinsic protein disorder annotation (University of Padua)
• neXtProt: a human protein-centric knowledge resource
• Pfam: protein families database of alignments and HMMs (Sanger Institute)
• PRINTS: a compendium of protein fingerprints from (Manchester University)
• PROSITE: database of protein families and domains
• Protein Information Resource (Georgetown University Medical Center [GUMC])
• SUPERFAMILY: library of HMMs representing superfamilies and database of
(superfamily and family) annotations for all completely sequenced organisms
• Swiss-Prot: protein knowledgebase (Swiss Institute of Bioinformatics)
• NCBI: protein sequence and knowledgebase (National Center for Biotechnology
Information)

Protein structure databases


• Protein Data Bank (PDB), comprising:
o Protein DataBank in Europe (PDBe)
o ProteinDatabank in Japan (PDBj)
o Research Collaboratory for Structural Bioinformatics (RCSB)
• Structural Classification of Proteins (SCOP)
********

Genome annotation and its importance


DNA annotation or genome annotation is the process of identifying the locations of genes and all of
the coding regions in a genome and determining what those genes do. An annotation (irrespective
of the context) is a note added by way of explanation or commentary. Once a genome is sequenced,
it needs to be annotated to make sense of it.

Process
Genome annotation consists of three main steps:.
1. identifying portions of the genome that do not code for proteins
2. identifying elements on the genome, a process called gene prediction
3. attaching biological information to these elements
Automatic annotation tools attempt to perform these steps via computer analysis, as opposed to
manual annotation (a.k.a. curation) which involves human expertise. Ideally, these approaches co-
exist and complement each other in the same annotation pipeline.

A simple method of gene annotation relies on homology based search tools, like BLAST, to search
for homologous genes in specific databases, the resulting information is then used to annotate genes
and genomes. However, as information is added to the annotation platform, manual annotators
become capable of deconvoluting discrepancies between genes that are given the same annotation.
Some databases use genome context information, similarity scores, experimental data, and
integrations of other resources to provide genome annotations through their Subsystems approach.
Other databases (e.g. Ensembl) rely on curated data sources as well as a range of different software
tools in their automated genome annotation pipeline

Bioinformatics in understanding molecular evolution


Molecular evolution is the process of change in the sequence composition of cellular molecules such
as DNA, RNA, and proteins across generations. The field of molecular evolution uses
principles of evolutionary biology and population genetics to explain patterns in these changes.

Molecular systematics is the product of the traditional fields of systematics and molecular genetics. It
uses DNA, RNA, or protein sequences to resolve questions in systematics, i.e. about their
correct scientific classification or taxonomy from the point of view of evolutionary biology.

Molecular systematics has been made possible by the availability of techniques for DNA
sequencing, which allow the determination of the exact sequence of nucleotides or bases in either
DNA or RNA. At present it is still a long and expensive process to sequence the entire genome of an
organism, and this has been done for only a few species. However, it is quite feasible to determine
the sequence of a defined area of a particular chromosome. Typical molecular systematic
analyses require the sequencing of around 1000 base pairs.
********

Bioinformatics help in understanding gene regulation

Gene regulation is the complex orchestration of events by which a signal, potentially an extracellular
signal such as a hormone, eventually leads to an increase or decrease in the activity of one or
more proteins. Bioinformatics techniques have been applied to explore various steps in this process.

For example, gene expression can be regulated by nearby elements in the genome. Promoter analysis
involves the identification and study of sequence motifs in the DNA surrounding the coding region
of a gene. These motifs influence the extent to which that region is transcribed into
mRNA. Enhancer elements far away from the promoter can also regulate gene expression, through
three-dimensional looping interactions. These interactions can be determined by bioinformatic
analysis of chromosome conformation capture experiments.

Expression data can be used to infer gene regulation: one might compare microarray data from a wide
variety of states of an organism to form hypotheses about the genes involved in each state. In a single-
cell organism, one might compare stages of the cell cycle, along with various stress conditions (heat
shock, starvation, etc.). One can then apply clustering algorithms to that expression data to determine
which genes are co-expressed. For example, the upstream regions (promoters) of co-expressed genes
can be searched for over-represented regulatory elements. Examples of clustering algorithms applied
in gene clustering are k-means clustering, self-organizing maps (SOMs), hierarchical clustering,
and consensus clustering methods.

********

OMIM (Online Mendelian Inheritance in Man)


Online Mendelian Inheritance in Man (OMIM) is a continuously updated catalog of human
genes and genetic disorders and traits, with a particular focus on the gene-phenotype relationship. As
of 28 June 2019, approximately 9,000 of the over 25,000 entries in OMIM represented
phenotypes; the rest represented genes, many of which were related to known phenotypes.

OMIM is the online continuation of Dr. Victor A. McKusick's Mendelian Inheritance in


Man (MIM), which was published in 12 editions between 1966 and 1998. Nearly all of the 1,486
entries in the first edition of MIM discussed phenotypes.
MIM/OMIM is produced and curated at the Johns Hopkins School of Medicine (JHUSOM).
OMIM became available on the internet in 1987 under the direction of the Welch Medical Library at
JHUSOM with financial support from the Howard Hughes Medical Institute. From 1995 to
2010, OMIM was available on the World Wide Web with informatics and financial support from the
National Center for Biotechnology Information. The current OMIM website (OMIM.org),
which was developed with funding from JHUSOM, is maintained by Johns Hopkins
University with financial support from the National Human Genome Research Institute.
********
The importance of PUBMED

PubMed is a free search engine accessing primarily the MEDLINE database of references and
abstracts on life sciences and biomedical topics. The United States National Library of
Medicine (NLM) at the National Institutes of Health maintain the database as part of the Entrez
system of information retrieval.
From 1971 to 1997, online access to the MEDLINE database had been primarily through
institutional facilities, such as university libraries. PubMed, first released in January 1996, ushered
in the era of private, free, home- and office-based MEDLINE searching. The PubMed system was
offered free to the public starting in June 1997.

Content

In addition to MEDLINE, PubMed provides access to:

• older references from the print version of Index Medicus, back to 1951 and earlier
• references to some journals before they were indexed in Index Medicus and MEDLINE,
for instance Science, BMJ, and Annals of Surgery
• very recent entries to records for an article before it is indexed with Medical Subject
Headings (MeSH) and added to MEDLINE
• a collection of books available full-text and other subsets of NLM records
• PMC citations
• NCBI Bookshelf

Many PubMed records contain links to full text articles, some of which are freely available, often
in PubMed Central and local mirrors, such as UK PubMed Central.

Information about the journals indexed in MEDLINE, and available through PubMed, is found in
the NLM Catalog.

You might also like