04 Computer Applications in Pharmacy Full Unit IV
04 Computer Applications in Pharmacy Full Unit IV
Put simply, bioinformatics is the science of storing, retrieving and analysing large amounts of biological
information. It is a highly interdisciplinary field involving many different types of specialists, including
biologists, molecular life scientists, computer scientists and mathematicians.
The term bioinformatics was coined by Paulien Hogeweg and Ben Hesper to describe "the study
of informatic processes in biotic systems" and it found early use when the first biological sequence
data began to be shared. Whilst the initial analysis methods are still fundamental to many large-
scale experiments in the molecular life sciences, nowadays bioinformatics is considered to be a
much broader discipline, encompassing modelling and image analysis in addition to the classical
methods used for comparison of linear sequences or three-dimensional structures.
A broad overview of the different types of data that fall within the scope of bioinformatics.
Traditionally, bioinformatics was used to describe the science of storing and analysing biomolecular
sequence data, but the term is now used much more broadly, encompassing computational
structural biology, chemical biology and systems biology (both data integration and the modelling of
systems).
The molecular life sciences have become increasingly data driven by and reliant on data sharing
through open-access databases. This is as true of the applied sciences as it is of fundamental research.
Furthermore, it is not necessary to be a bioinformatician to make use of bioinformatics databases,
methods and tools. However, as the generation of large data-sets becomes more and more central to
biomedical research, it’s becoming increasingly necessary for every molecular life scientist to
understand what can (and, importantly, what cannot) be achieved using bioinformatics, and to be able
to work with bioinformatics experts to design, analyse and interpret their experiments.
The role of public databases
There are a small number of bioinformatics centres of excellence worldwide that have taken on the
responsibility to collect, catalogue and provide open access to published biological data (Figure 3).
Among these centres are:
• The EMBL-European Bioinformatics Institute (EMBL-EBI)
• The US National Center for Biotechnology Information (NCBI)
• The National Institute of Genetics in Japan (NIG)
This work began in the early 1980s when DNA sequence data began to accumulate in the scientific
literature. The EMBL Data Library (now the European Nucleotide Archive) was developed to store
DNA sequences published in the scientific literature. The NCBI’s GenBank and NIG’s DDBJ
followed.
The role of bioinformatics centres of excellence in making biological data available for the research
community.
Goals of Bioinformatics
To study how normal cellular activities are altered in different disease states, the biological data must
be combined to form a comprehensive picture of these activities. Therefore, the field of bioinformatics
has evolved such that the most pressing task now involves the analysis and interpretation of various
types of data. This includes nucleotide and amino acid sequences, protein domains, and protein
structures.[16] The actual process of analyzing and interpreting data is referred to as computational
biology. Important sub-disciplines within bioinformatics and computational biology include:
• Development and implementation of computer programs that enable efficient access to,
management and use of, various types of information
• Development of new algorithms (mathematical formulas) and statistical measures that assess
relationships among members of large data sets. For example, there are methods to locate a
gene within a sequence, to predict protein structure and/or function, and to cluster protein
sequences into families of related sequences.
The primary goal of bioinformatics is to increase the understanding of biological processes. What sets
it apart from other approaches, however, is its focus on developing and applying computationally
intensive techniques to achieve this goal. Examples include: pattern recognition, data
mining, machine learning algorithms, and visualization. Major research efforts in the field
include sequence alignment, gene finding, genome assembly, drug design, drug discovery, protein
structure alignment, protein structure prediction, prediction of gene expression and protein–protein
interactions, genome-wide association studies, the modeling of evolution and cell division/mitosis.
Bioinformatics now entails the creation and advancement of databases, algorithms, computational
and statistical techniques, and theory to solve formal and practical problems arising from the
management and analysis of biological data.
*******************
Biological databases emerged as a response to the huge data generated by low-cost DNA sequencing
technologies. One of the first databases to emerge was GenBank, which is a collection of all available
protein and DNA sequences. It is maintained by the National Institutes of Health (NIH) and
the National Center for Biotechnology Information (NCBI). GenBank paved the way for the
Human Genome Project (HGP). The HGP allowed complete sequencing and reading of the genetic
blueprint. The data stored in biological databases is organized for optimal analysis and consists of two
types: raw and curated (or annotated). Biological databases are complex, heterogeneous,
dynamic, and yet inconsistent.
Earlier, databases and databanks were considered quite different. However, over the time, database
became a preferable term. Data is submitted directly to biological databases for indexing,
organization, and data optimization. They help researchers find relevant biological data by making it
available in a format that is readable on a computer. All biological information is readily accessible
through data mining tools that save time and resources. Biological databases can be broadly classified
as sequence and structure databases. Structure databases are for protein structures, while sequence
databases are for nucleic acid and protein sequences.
Biological databases can be further classified as primary, secondary, and composite databases.
Primary databases contain information for sequence or structure only. Examples of primary biological
databases include:
• Swiss-Prot and PIR for protein sequences
• GenBank and DDBJ for genome sequences
• Protein Databank for protein structures
Secondary databases contain information derived from primary databases. Secondary databases store
information such as conserved sequences, active site residues, and signature sequences. Protein
Databank data is stored in secondary databases. Examples include:
The Future
In terms of research, bioinformatics tools should be streamlined for analyzing the growing amount of
data generated from genomics, metabolomics, proteomics, and metagenomics. Another future trend
will be the annotation of existing data and better integration of databases.
With a large number of biological databases available, the need for integration, advancements, and
improvements in bioinformatics is paramount. Bioinformatics will steadily advance when
problems about nomenclature and standardization are addressed. The growth of biological databases
will pave the way for further studies on proteins and nucleic acids, impacting therapeutics,
biomedical, and related fields.
***************
Vaccines are the pharmaceutical products that offer the best cost‐benefit ratio in the prevention or
treatment of diseases. In that a vaccine is a pharmaceutical product, vaccine development and
production are costly and it takes years for this to be accomplished. Several approaches have been
applied to reduce the times and costs of vaccine development, mainly focusing on the selection of
appropriate antigens or antigenic structures, carriers, and adjuvants.
One of these approaches is the incorporation of bioinformatics methods and analyses into vaccine
development. This chapter provides an overview of the application of bioinformatics strategies in
vaccine design and development, supplying some successful examples of vaccines in
which bioinformatics has furnished a cutting edge in their development. Reverse
vaccinology, immunoinformatics, and structural vaccinology are described and addressed in
the design and development of specific vaccines against infectious diseases caused by bacteria,
viruses, and parasites.
These include some emerging or re‐emerging infectious diseases, as well as therapeutic vaccines to
fight cancer, allergies, and substance abuse, which have been facilitated and improved by using
bioinformatics tools or which are under development based on bioinformatics strategies.
The success of vaccination is reflected in its worldwide impact by improving human and veterinary
health and life expectancy. It has been asserted that vaccination, as well as clean water, has had such a
major effect on mortality reduction and population growth. In addition to the invaluable role of
traditional vaccines to prevent diseases, the society has observed remarkable scientific and
technological progress since the last century in the improvement of these vaccines and the generation
of new ones.
This has been possible by the fusion of computational technologies with the application of
recombinant DNA technology, the fast growth of biological and genomic information in
database banks, and the possibility of accelerated and massive sequencing of complete genomes. This
has aided in expanding the concept and application of vaccines beyond their traditional
immunoprophylactic function of preventing infectious diseases, and also serving as therapeutic
products capable of modifying the evolution of a disease and even cure it.
Vaccines are the pharmaceutical products that offer the best cost‐benefit ratio in the prevention or
treatment of diseases. In that it is a pharmaceutical product, a vaccine development and production
are costly and it takes years for this to be accomplished. Several approaches have been applied to
reduce the times and costs of their development, mainly focusing on the selection of appropriate
antigens or antigenic structures, carriers, and adjuvants. One of these approaches is the incorporation of
bioinformatics methods and analyses into vaccine development.
At present, there are many alternative strategies to design and develop effective and safe new‐
generation vaccines, based on bioinformatics approaches through reverse vaccinology,
immunoinformatics, and structural vaccinology.
Reverse vaccinology
Reverse vaccinology is a methodology that uses bioinformatics tools for the identification of
structures from bacteria, virus, parasites, cancer cells, or allergens that could induce an immune
response capable of protecting against a specific disease
Immunoinformatics
The immunological system can be classified as cellular or humoral and, depending on the disease, it
can be induced the expected immune response. If a vaccine that induces a cellular response is
needed, for example a tuberculosis vaccine or a parasite vaccine against leishmaniasis [23], the
software must search for antigens that can be recognized by the major histocompatibility complex
(MHC) molecules present in T lymphocytes. Software for this purpose include TEpredict,
CTLPred, nHLAPred, ProPred‐I, MAPPP, SVMHC, GPS‐MBA, PREDIVAC, NetMHC,
NetCTL, MHC2 Pred, IEDB,
BIMAS, SVMHC, POPI, Epitopemap, iVAX, FRED2, Rankpep, BIMAS, PickPocket, KISS, and
MHC2MIL.
Structural vaccinology
A brief timeline of the major events in the history and the origins of bioinformatics.
• 19S3 - Watson & Crick pmposed the double helix model for DNA based x-ray data
obtained by Franklin & Wilkins.
• 19S4 - Perutz’s group develop heavy atom methods to solve the phase problem in
protein crystallography.
• 19S5 - The sequence of the first pmtein to be analysed, bovine insulin, is announed by
The Nucleic Acid Database (NDB) (http://ndbserver.rutgers.edu) is a web portal providing access
to information about 3D nucleic acid structures and their complexes.
DNA databases
Primary databases
International Nucleotide Sequence Database (INSD) consists of the following databases.
• DNA Data Bank of Japan (National Institute of Genetics)
• EMBL (European Bioinformatics Institute)
• GenBank (National Center for Biotechnology Information)
DDBJ (Japan), GenBank (USA) and European Nucleotide Archive (Europe) are repositories for
nucleotide sequence data from all organisms. All three accept nucleotide sequence submissions, and
then exchange new and updated data on a daily basis to achieve optimal synchronisation between
them. These three databases are primary databases, as they house original sequence data. They
collaborate with Sequence Read Archive (SRA), which archives raw reads from high-
throughput sequencing instruments.
Secondary databases
• 23andMe's database
• HapMap
• OMIM (Online Mendelian Inheritance in Man): inherited diseases
• RefSeq
• 1000 Genomes Project: launched in January 2008. The genomes of more than a thousand
anonymous participants from a number of different ethnic groups were analyzed and made
publicly available.
• EggNOG Database: a hierarchical, functionally and phylogenetically annotated
orthology resource based on 5090 organisms and 2502 viruses. It provides multiple
sequence alignments and maximum-likelihood trees, as well as broad functional
annotation.[
RNA databases
• miRBase: the microRNA database
• Rfam: a database of RNA families
Process
Genome annotation consists of three main steps:.
1. identifying portions of the genome that do not code for proteins
2. identifying elements on the genome, a process called gene prediction
3. attaching biological information to these elements
Automatic annotation tools attempt to perform these steps via computer analysis, as opposed to
manual annotation (a.k.a. curation) which involves human expertise. Ideally, these approaches co-
exist and complement each other in the same annotation pipeline.
A simple method of gene annotation relies on homology based search tools, like BLAST, to search
for homologous genes in specific databases, the resulting information is then used to annotate genes
and genomes. However, as information is added to the annotation platform, manual annotators
become capable of deconvoluting discrepancies between genes that are given the same annotation.
Some databases use genome context information, similarity scores, experimental data, and
integrations of other resources to provide genome annotations through their Subsystems approach.
Other databases (e.g. Ensembl) rely on curated data sources as well as a range of different software
tools in their automated genome annotation pipeline
Molecular systematics is the product of the traditional fields of systematics and molecular genetics. It
uses DNA, RNA, or protein sequences to resolve questions in systematics, i.e. about their
correct scientific classification or taxonomy from the point of view of evolutionary biology.
Molecular systematics has been made possible by the availability of techniques for DNA
sequencing, which allow the determination of the exact sequence of nucleotides or bases in either
DNA or RNA. At present it is still a long and expensive process to sequence the entire genome of an
organism, and this has been done for only a few species. However, it is quite feasible to determine
the sequence of a defined area of a particular chromosome. Typical molecular systematic
analyses require the sequencing of around 1000 base pairs.
********
Gene regulation is the complex orchestration of events by which a signal, potentially an extracellular
signal such as a hormone, eventually leads to an increase or decrease in the activity of one or
more proteins. Bioinformatics techniques have been applied to explore various steps in this process.
For example, gene expression can be regulated by nearby elements in the genome. Promoter analysis
involves the identification and study of sequence motifs in the DNA surrounding the coding region
of a gene. These motifs influence the extent to which that region is transcribed into
mRNA. Enhancer elements far away from the promoter can also regulate gene expression, through
three-dimensional looping interactions. These interactions can be determined by bioinformatic
analysis of chromosome conformation capture experiments.
Expression data can be used to infer gene regulation: one might compare microarray data from a wide
variety of states of an organism to form hypotheses about the genes involved in each state. In a single-
cell organism, one might compare stages of the cell cycle, along with various stress conditions (heat
shock, starvation, etc.). One can then apply clustering algorithms to that expression data to determine
which genes are co-expressed. For example, the upstream regions (promoters) of co-expressed genes
can be searched for over-represented regulatory elements. Examples of clustering algorithms applied
in gene clustering are k-means clustering, self-organizing maps (SOMs), hierarchical clustering,
and consensus clustering methods.
********
PubMed is a free search engine accessing primarily the MEDLINE database of references and
abstracts on life sciences and biomedical topics. The United States National Library of
Medicine (NLM) at the National Institutes of Health maintain the database as part of the Entrez
system of information retrieval.
From 1971 to 1997, online access to the MEDLINE database had been primarily through
institutional facilities, such as university libraries. PubMed, first released in January 1996, ushered
in the era of private, free, home- and office-based MEDLINE searching. The PubMed system was
offered free to the public starting in June 1997.
Content
• older references from the print version of Index Medicus, back to 1951 and earlier
• references to some journals before they were indexed in Index Medicus and MEDLINE,
for instance Science, BMJ, and Annals of Surgery
• very recent entries to records for an article before it is indexed with Medical Subject
Headings (MeSH) and added to MEDLINE
• a collection of books available full-text and other subsets of NLM records
• PMC citations
• NCBI Bookshelf
Many PubMed records contain links to full text articles, some of which are freely available, often
in PubMed Central and local mirrors, such as UK PubMed Central.
Information about the journals indexed in MEDLINE, and available through PubMed, is found in
the NLM Catalog.