0% found this document useful (0 votes)

14 views

03 Databases

Uploaded by

mrguochengzong

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

03 Databases

Uploaded by

mrguochengzong

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Sequence databases

BINF2010

Sara Ballouz, PhD

CSE, UNSW
Slides adapted from Marc Wilkins (BABS)
K17 401A and Bruno Gaeta (CSE)
Learning outcomes
• Discuss the limitations of current
sequence databases
• Describe the general structure of a
sequence database entry
• Discuss the types of data available
in GenBank and UniProt
• Use GenBank and UniProt to retrieve
sequence information
• Be aware of the UniProt API for
retrieving sequence information
programmatically
Sequence data

Life is about information

• DNA and proteins are information-carrying molecules
Sequence data

Reductionist and synthetic

approaches in biology
Biological System
(Organism)
Reductionist Synthetic
Approach Approach
(Experiments) (Bioinformatics)

Building Blocks
(Genes/Molecules)
Kanehisa (2000) Post-genome Informatics
Sequence data

Biology…
not so
simple!

From Attwood and Parry-Smith, 1999

Databases

Databases: terminology
• Database: collection of
information related to a specific
subject (e.g., a phone book)

• Record: an entry in a database

(e.g., your entry in the phone book)

• Field: a component of a record

(e.g., your address & number)
Databases

Databases: types

• Flat-file: store data as text files

• Relational: interconnected tables, use

a database management system
Databases

Flat-file databases: example

Databases

Flat-file databases
Pros Cons
• Easy to put together and • Detailed targeted searching is
distribute difficult
• No need for expensive or • Searching is not efficient
complicated database
management software
Databases

Relational database: example tables

Databases

Relational databases
• Require a Relational Database Management System (RDBMS)
• Queried using SQL (or more commonly, a GUI front-end)

SELECT protab1.protein-name, protab2.protein-sequence

FROM protab1, protab2
WHERE protab1.protein-code = protab2.protein-code
AND protab1.protein-code = ‘P1002’;
Databases

Sequence data in a database

• Primary data
• e.g., DNA sequence, protein sequence, protein 3D structure
coordinates

• Annotations (metadata)
• e.g., Authors, literature references, protein function, organism
of origin, location of coding regions in DNA sequence, etc.
Sequence data

Sequence database record structure

• A sequence database record contains
both sequence and annotations
• Record divided broadly into 3 sections:
• Header
• Feature table
• Sequence
Sequence data

Sequence databases
• Nucleic acids: DNA/RNA: GenBank
• Proteins: UniProt
• Specialized/other:
• Non-coding RNA databases: RNAcentral
• Variation databases: gnomAD
• Cancer genomes: TCGA
Nucleic acid data

Nucleotide sequence database: GenBank

• Part of an international consortium to manage sequence data (INSDC)
• GenBank – NIH, USA
• DDBJ – DNA Database of Japan
• EMBL – European Molecular Biology Laboratory
• A collection of nucleic acid sequence data
• DNA, RNA (mRNA, rRNA, tRNA, microRNA, ncRNA…)
• Some sequences are translated to protein sequence
• Is not actively curated
• Is available to the public at no cost

http://www.ncbi.nlm.nih.gov/Genbank/index.html
Nucleic acid data
Nucleic acid data

GenBank: statistics GenBank: 3,387,240,663,231 bases,

from 251,094,334 reported sequences
Release #261 (June 2024) WGS: 27,900,199,328,333 bases, from
3,380,877,515 reported sequences

https://www.ncbi.nlm.nih.gov/genbank/statistics/
Nucleic acid data

GenBank anatomy: header

LOCUS HUMSOMI 2667 bp DNA linear PRI 13-JAN-1995
DEFINITION Human somatostatin I gene and flanks.
ACCESSION J00306
VERSION J00306.1 GI:338287
KEYWORDS neuropeptide Y; somatostatin; somatostatin I; somatostatin-14;
somatostatin-28.
SOURCE Homo sapiens (human)
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
Catarrhini; Hominidae; Homo.
REFERENCE 1 (bases 1126 to 1368; 2246 to 2605)
AUTHORS Shen,L.P., Pictet,R.L. and Rutter,W.J.
TITLE Human somatostatin I: sequence of the cDNA
JOURNAL Proc. Natl. Acad. Sci. U.S.A. 79 (15), 4575-4579 (1982)
PUBMED 6126875
REFERENCE 2 (bases 1 to 2667)
AUTHORS Shen,L.P. and Rutter,W.J.
TITLE Sequence of the human somatostatin I gene
JOURNAL Science 224 (4645), 168-171 (1984)
PUBMED 6142531
COMMENT Original source text: Human fetal liver DNA, Charon 4A library,
clone pHSI-1-2.7 [2], and pancreatic somatostatinoma tissue,
Nucleic acid data

GenBank anatomy: features

FEATURES Location/Qualifiers
source 1..2667
/organism="Homo sapiens"
/mol_type="genomic DNA"
/db_xref="taxon:9606"
/map="3q28"
prim_transcript 1126..2605
/note="som I mRNA"
CDS join(1231..1368,2246..2458)
/note="preprosomatostatin I"
/codon_start=1
/protein_id="AAA60566.1"
/db_xref="GI:338288"
/translation="MLSCRLQCALAALSIVLALGCVTGAPSDPRLRQFLQKSLAAAAG
KQELAKYFLAELLSEPNQTENDALEPEDLSQAAEQDEMRLELQRSANSNPAMAPRERK
AGCKNFFWKTFTSC"
sig_peptide 1231..1302
/note="prosomatostatin I signal peptide"
mat_peptide 2372..2455
/product="somatostatin-28 peptide"
mat_peptide 2414..2455
/product="somatostatin-14 peptide"
gene 1231..1368
/gene="SST"
exon <1231..1368
/gene="SST"
/note="preprosomatostatin I; G00-119-604”
Nucleic acid data

GenBank anatomy: sequence

ORIGIN Chromosome 3q28; 1 bp upstream of EcoRI site.
1 gaattcaagg acaggttttc ttaaactttc tttgtttcta ggagatcagg cagagctgaa
61 tttaaccaag aatcttttga tcctttccac atatagatat acaatagtgg tcacatatgt
121 tctgggagtt cctagacctt atatgtctaa actggggctt cctgacataa aactatgctt
181 accggcagga atctgttaga aaactcagag ctcagtagaa ggaacactgg ctttggaatg
241 tggaggtctg gttttgctca aagtgtgcag tatgtgaagg agaacaattt actgaccatt
301 actctgcctt actgattcaa attctgaggt ttattgaata atttcttaga ttgccttcca
361 gctctaaatt tctcagcacc aaaatgaagt ccatttcaat ctctctctct ctctttccct
421 cccgtacata tacacacact catacatata tatggtcaca atagaaaggc aggtagatca
481 gaagtctcag ttgctgagaa agagggaggg agggtgagcc agagtacttc tcccccattg
541 tagagaaaag tgaagttctt ttagagcccc gttacatctt caaggccttt tatgagataa
601 tggaggaaat aaagagggct cagtccttct accgtccata tttcattctc aaatctgtta
661 ttagaggaat gattctgatc tccacctacc atacacatgc cctgttgctt gttgggcctt
721 acactaaaat gttagagtat gatgacagat ggagttgtct gggtacattt gtgtgcattt
781 aagggtgata gtgtatttgc tctttaagag ctgagtgttt gagcctctgt ttgtgtgtaa
841 ttgagtgtgc atgtgtggga gtgaaattgt ggaatgtgta tgctcatagc actgagtgaa
901 aataaaagat tgtataaatc gtggggcatg tggaattgtg tgtgcctgtg cgtgtgcagt
961 attttttttt ttttaagtaa gccactttag atcttgtcac ctcccctgtc ttctgtgatt
1021 gattttgcga ggctaatggt gcgtaaaagg gctggtgaga tctgggggcg cctcctagcc
1081 tgacgtcaga gagagagttt aaaacagagg gagacggttg agagcacaca agccgcttta
1141 ggagcgaggt tcggagccat cgctgctgcc tgctgatccg cgcctagagt ttgaccagcc
1201 actctccagc tcggctttcg cggcgccgag atgctgtcct gccgcctcca gtgcgcgctg
1261 gctgcgctgt ccatcgtcct ggccctgggc tgtgtcaccg gcgctccctc ggaccccaga
1321 ctccgtcagt ttctgcagaa gtccctggct gctgccgcgg ggaagcaggt aaggagactc
1381 cctcgacgtc tcccggattc tccagccctc cctaagcctt gctcctgccc cattggtttg
1441 gacgtaaggg atgctcagtc cttctaaaga gttttggtgc ttttctgggt ccctcagctc
Nucleic acid data

GenBank anatomy: sequence types

• Genomic DNA
• Genomic RNA (from RNA viruses)
• Precursor RNA
• mRNA (cDNA)
• Ribosomal RNA (in ribosomes)
• Transfer RNA
• Small nuclear RNA (associated with RNA splicing)
• Small cytoplasmic RNA
• MicroRNA…
Nucleic acid data

GenBank: reliability
GenBank is highly redundant GenBank may contain errors

• There are 27,702,323 human • Entries are made by researchers.

entries in GenBank. • Very few researchers update their
• But there are only ~22,000 protein entries in the database.
coding regions. • It can be difficult to resolve
• Why the difference? conflicts between different
entries.
Nucleic acid data

Entries Bases Species

8849611 263280740770 Severe acute respiratory syndrome coronavirus 2

GenBank: 1943950
111961
255975379562
205666444201
Triticum aestivum
Hordeum vulgare

diversity
1347585 126087260773 Hordeum vulgare subsp. vulgare
520 106587373982 Hordeum bulbosum
164 93011095388 Viscum album
29876 92980158773 Hordeum vulgare subsp. spontaneum
10049258 43637339408 Mus musculus
27810338 36825525087 Homo sapiens
Contains lots of 175081 24021381303 Escherichia coli
20 most
information for a 29811 21128005736 Avena sativa
sequenced
2640663 20263158983 Arabidopsis thaliana
small number of 33665 16333452591 Klebsiella pneumoniae organisms
species 2243780 16210185539 Bos taurus
1732296 13758456861 Danio rerio
312035 13104402122 Arachis hypogaea
Note many 195 11554711366 Sambucus nigra
28766 11286173222 Vicia faba
plants…why? 14812 10342955730 Triticum monococcum
23130 9981582961 Triticum turgidum subsp. durum

https://www.ncbi.nlm.nih.gov/genbank/release/current/
Nucleic acid data

Accessing sequence databases

• Searching the header
• Searching the annotations for keywords (organism, gene name, etc.)

• Searching the sequences

• Searching for sequences similar to a query sequence using programs
such as BLAST
• Searching for sequences containing particular patterns
Protein sequence data

Protein sequence database: UniProt

• Unified protein database incorporating multiple protein databases
• Collaboration between Swiss Institute for Bioinformatics, European Bioinformatics
Institute (UK) and Protein Information Resource (USA)

http://www.uniprot.org
Protein sequence data

UniProt: components
• UniProt Knowledgebase (UniProtKB)
• central access point for extensive curated protein information, including
function, classification, and cross-reference
• UniProt Non-redundant Reference (UniRef)
• combines closely related sequences into a single record to speed
searches
• UniProt Archive (UniParc)
• comprehensive repository, reflecting the history of all protein
sequences.
Protein sequence data
Protein sequence data

UniProt/Swiss-Prot: statistics
Release 2024_03 (May 2024)

571,609 sequence entries

curated from 299,621
unique references and
comprising 206,878,625
amino acids

https://web.expasy.org/docs/relnotes/relstat.html
Protein sequence data

UniProt/TrEMBL: statistics
Release 2024_03 (May 2024)

244,910,918 sequence
entries, comprising
86,585,019,224 amino
acids

https://www.ebi.ac.uk/uniprot/TrEMBLstats
Protein sequence data

UniProt/Swiss-Prot: top 20 species

Number Frequency Species
1 20435 Homo sapiens (Human)
2 17212 Mus musculus (Mouse)
3 16386 Arabidopsis thaliana (Mouse-ear cress)
4 8199 Rattus norvegicus (Rat)
5 6727 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) (Baker's yeast)
6 6046 Bos taurus (Bovine)
7 5121 Schizosaccharomyces pombe (strain 972 / ATCC 24843) (Fission yeast)
8 4530 Escherichia coli (strain K12)
9 4472 Caenorhabditis elegans
10 4191 Bacillus subtilis (strain 168)
11 4187 Oryza sativa subsp. japonica (Rice)
12 4160 Dictyostelium discoideum (Social amoeba)
13 3778 Drosophila melanogaster (Fruit fly)
14 3506 Xenopus laevis (African clawed frog)
15 3332 Danio rerio (Zebrafish) (Brachydanio rerio)
16 2309 Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv)
17 2309 Gallus gallus (Chicken)
18 2218 Pongo abelii (Sumatran orangutan) (Pongo pygmaeus abelii)
19 2046 Escherichia coli O157:H7
20 1899 Mycobacterium tuberculosis (strain CDC 1551 / Oshkosh)
Protein sequence data

UniProt/TrEMBL: taxonomic origins

Kingdom Eukaryota
Protein sequence data

UniProt/Swiss-Prot anatomy: header

Alpha-1-antitrypsin protein 1
Information
Code
ID ID A1AT_HUMAN STANDARD; PRT; 418 AA.
Accession AC P01009; Q13672; Q5U0M1; Q96BF9; Q96ES1; Q9P1P0;
Date DT 21-JUL-1986 (Rel. 01, Created)
DT 01-OCT-1996 (Rel. 34, Last sequence update)
DT 13-SEP-2005 (Rel. 48, Last annotation update)
Description DE Alpha-1-antitrypsin precursor (Alpha-1 protease inhibitor)
DE (Alpha-1- antiproteinase).
Gene Name GN Name=SERPINA1; Synonyms=AAT, PI; ORFNames=PRO0684, PRO2209;
Organism Species OS Homo sapiens (Human).
Classification OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; ;
OC Mammalia; Eutheria; Euarchontoglires; Primates; Hominidae;
OC Homo.
Cross reference OX NCBI_TaxID=9606;
Protein sequence data

UniProt/Swiss-Prot anatomy: literature

Reference number RN [1]
Reference type RP NUCLEOTIDE SEQUENCE [MRNA].
External IDs RX MEDLINE=84107980; PubMed=6319097 [NCBI, ExPASy, EBI, Israel, Japan];
Authors RA Bollen A., Herzog A., Cravador A., Herion P., Chuchana P.,
RA van der Straten A., Loriau R., Jacobs P., van Elsen A.;
Title RT "Cloning and expression in Escherichia coli of full-length
RT complementary DNA coding for human alpha 1-antitrypsin.";
Journal information RL DNA 2:255-264(1983).
...
RN [22]
RP X-RAY CRYSTALLOGRAPHY (2.9 ANGSTROMS) OF 26-418.
RX PubMed=9466920 [NCBI, ExPASy, EBI, Israel, Japan];
RA Elliott P.R., Abrahams J.P., Lomas D.A.;
RT "Wild-type alpha 1-antitrypsin is in the canonical inhibitory
RT conformation.";
RL J. Mol. Biol. 275:419-425(1998).
...
RN [49]
RP VARIANT Z-BRISTOL MET-109.
RX MEDLINE=98120621; PubMed=9459000 [NCBI, ExPASy, EBI, Israel, Japan];
RA Lovegrove J.U., Jeremiah S., Gillett G.T., Temple I.K., Povey S.;
RT "A new alpha 1-antitrypsin mutation, Thr-Met 85, (PI ZBristol)
RT associated with novel electrophoretic properties.";
RL Ann. Hum. Genet. 61:385-391(1997).
Protein sequence data

UniProt/Swiss-Prot anatomy: functional

information
CC -!- FUNCTION: Inhibitor of serine proteases. Its primary target is
CC elastase, but it also has a moderate affinity for plasmin and
CC thrombin.
CC -!- SUBCELLULAR LOCATION: Secreted.
CC -!- TISSUE SPECIFICITY: Plasma.
CC -!- DOMAIN: The reactive center loop (RCL) extends out from the body
CC of the protein and directs binding to the target protease. The
CC protease cleaves the serpin at the reactive site within the RCL.
CC -!- POLYMORPHISM: The sequence shown is that of the M1V allele which
CC is the most common form of PI (44 to 49%). Other frequent alleles
CC are: M1A 20 to 23%; M2 10 to 11%; M3 14 to 19%.
CC -!- DISEASE: The major physiological function of AAT is the protection
CC of the lower respiratory tract against proteolytic destruction by
CC human leukocyte elastase (HLE).

This information summarises data from the literature.

This knowledge is of enormous scientific value.
NOTE: the information may not be completely up-to-date. Why?
Protein sequence data

UniProt/Swiss-Prot anatomy: references

and links
DNA sequence DR EMBL; K01396; AAB59375.1; -; mRNA.
DR EMBL; K02212; AAB59495.1; -; Genomic_DNA

Protein sequence DR PIR; A21853; ITHU.

Protein structure DR PDB; 1ATU; X-ray; @=45-418.

DR PDB; 1D5S; X-ray; A=44-377, B=378-418.

2-D PAGE DR SWISS-2DPAGE; P01009; HUMAN.

DR HSC-2DPAGE; P01009; HUMAN.

Domains DR InterPro; IPR000215; Prot_inh_serpin.

DR InterPro; Graphical view of domains.
DR Pfam; PF00079; Serpin; 1.
DR Pfam; Graphical view of domain structure.

Keywords KW 3D-structure; Acute phase; Sequencing;

KW Serine protease inhibitor; Serpin.
Protein sequence data

UniProt/Swiss-Prot anatomy: features

Secretion signal FT SIGNAL 1 24
FT CHAIN 25 418 Alpha-1-antitrypsin.

Protein modifications FT CARBOHYD 70 70 N-linked (GlcNAc...).

FT CARBOHYD 107 107 N-linked (GlcNAc...).
FT CARBOHYD 271 271 N-linked (GlcNAc...).

Polymorphisms FT VARIANT 4 4 S -> L (in Z-Wrexham).

FT /FTId=VAR_006978.

Known errors FT CONFLICT 12 12 Missing (in Ref. 4).

Structure FT TURN 49 50
FT HELIX 51 68
FT STRAND 74 76
FT HELIX 78 89
FT TURN 90 91
Protein sequence data

UniProt/Swiss-Prot anatomy: sequence

Length Molecular weight Checksum (64-bit cyclic redundancy)

SQ SEQUENCE 418 AA; 46737 MW; 7016555F273B7F16 CRC64;
MPSSVSWGIL LLAGLCCLVP VSLAEDPQGD AAQKTDTSHH DQDHPTFNKI TPNLAEFAFS
LYRQLAHQSN STNIFFSPVS IATAFAMLSL GTKADTHDEI LEGLNFNLTE IPEAQIHEGF
Amino acid QELLRTLNQP DSQLQLTTGN GLFLSEGLKL VDKFLEDVKK LYHSEAFTVN FGDTEEAKKQ
sequence INDYVEKGTQ GKIVDLVKEL DRDTVFALVN YIFFKGKWER PFEVKDTEEE DFHVDQVTTV
KVPMMKRLGM FNIQHCKKLS SWVLLMKYLG NATAIFFLPD EGKLQHLENE LTHDIITKFL
ENEDRRSASL HLPKLSITGT YDLKSVLGQL GITKVFSNGA DLSGVTEEAP LKLSKAVHKA
VLTIDEKGTE AAGAMFLEAI PMSIPPEVKF NKPFVFLMIE QNTKSPLFMG KVVNPTQK
Specialized databases

Specialized sequence databases

• Focus on a specific type of sequences
• Sequences are often modified or specially annotated
• Usage depends on the database
• Examples:
• Non-coding RNA databases
• Immunology databases
(e.g., ImMunoGeneTics IMGT)
• Cancer databases
(e.g., The Cancer Genome Atlas TCGA)
https://portal.gdc.cancer.gov/
https://rnacentral.org/expert-databases
Specialized databases

20,000 primary cancer and 76,156 genomes from ~9

matched normal samples ancestries/populations
spanning 33 cancer types

https://portal.gdc.cancer.gov/ https://gnomad.broadinstitute.org/
Specialized databases

Using the right words: ontologies

MOLECULAR FUNCTION The Gene Ontology (GO)

Nucleic acid binding enzyme

DNA binding helicase Adenosine triphophatase

Chromatin binding DNA helicase ATP-dependant DNA-dependant

helicase Adenosine triphosphatase

ATP-dependant DNA helicase

http://geneontology.org/docs/introduction-to-go
Programmatic access

Programmatic access to databases:

RESTful APIs
• REST: Representational State Transfer
• API: Application Programming Interface
• Many databases are designed so that their content can be accessed in a
consistent way, to allow automation and scripting
• Note that the syntax is consistent within one database but will differ
between databases
• (Broadly) access:
• BaseURL: where the content sits
• Query: search phrase/syntax used to extract the information
• Fields: which part of the database/table to search within
• Format: how you want the output
Programmatic access

Demo: UniProt
Guide: http://www.uniprot.org/help/programmatic_access
BaseURL: https://rest.uniprot.org/uniprotkb/
Examples:
Web browser:
https://rest.uniprot.org/uniprotkb/search?query=reviewed:true+AND+
organism_id:9606&format=tsv
Terminal:
Format
curl -H "Accept: text/plain; format=tsv"
"https://rest.uniprot.org/uniprotkb/search?query=reviewed:true+AND+organism_id:9606"
BaseURL Query Fields
Programmatic access

Other useful APIs

• Gene Ontology:
• Guide: http://geneontology.org/docs/tools-guide/
• BaseURL: https://api.geneontology.org/api
• Example:
http://api.geneontology.org/api/bioentity/function/GO:0006915
• NCBI (E-utils)
• Guide: https://www.ncbi.nlm.nih.gov/books/NBK25501/
• BaseURL: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/
• Example:
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubme
d&term=science[journal]+AND+breast+cancer+AND+2008[pdat]
Programmatic access

Programmatic access: local download

• Common flat file formats
• Tabular (TSV or CSV, or some other delimiter)
• OBO (Open Biomedical Ontologies) biology-oriented language for building
ontologies based on OWL
• OWL (Web Ontology Language)
• JSON (JavaScript Object Notation)
• XML (Extensible Markup Language)
• Relational formats
• SQL or SQLite
• Download either from an FTP site or an API site
• Access through FTP or tools like ASPERA
• Search, parse or query local versions
File formats

OBO vs OWL: example GO:0003723

File formats

JSON: example GO:0003723

Further reading/resources
• http://www.ncbi.nlm.nih.gov/Genbank/index.html
• https://www.ncbi.nlm.nih.gov/genbank/release/current/
• https://web.expasy.org/docs/relnotes/relstat.html
• https://www.ebi.ac.uk/uniprot/TrEMBLstats
• http://geneontology.org
• https://www.ncbi.nlm.nih.gov/books/NBK25501/
• http://www.uniprot.org
• https://portal.gdc.cancer.gov/
• https://gnomad.broadinstitute.org/
• https://rnacentral.org/expert-databases

V7 BIRT Report Development Guide - Rev3
No ratings yet
V7 BIRT Report Development Guide - Rev3
90 pages
Database Lectur1
No ratings yet
Database Lectur1
58 pages
1 What Is Bioinformatics
No ratings yet
1 What Is Bioinformatics
34 pages
Lecture 3 Database
No ratings yet
Lecture 3 Database
81 pages
WINSEM2021-22 BIY1012 ETH VL2021220501045 Reference Material I 11-01-2022 Ntroduction To Databases
No ratings yet
WINSEM2021-22 BIY1012 ETH VL2021220501045 Reference Material I 11-01-2022 Ntroduction To Databases
42 pages
Bioinformatics Molecular Biology
No ratings yet
Bioinformatics Molecular Biology
24 pages
4Bioinformaticsdatabases
No ratings yet
4Bioinformaticsdatabases
71 pages
Biological Databases: - Bio-Informatics
No ratings yet
Biological Databases: - Bio-Informatics
16 pages
Fat Noews Docx (2)
No ratings yet
Fat Noews Docx (2)
32 pages
Introduction To Bioinformatics
No ratings yet
Introduction To Bioinformatics
30 pages
Bio in For Matics
No ratings yet
Bio in For Matics
26 pages
Data Mining & Sequence Retrieval Practical
No ratings yet
Data Mining & Sequence Retrieval Practical
46 pages
Saccharomyces Cerevisiae S288c Hexokinase Partial MRNA - Nucleotide - NCBI
No ratings yet
Saccharomyces Cerevisiae S288c Hexokinase Partial MRNA - Nucleotide - NCBI
2 pages
GenBank Access Keys NCBI Homepage - Aulacaspis Tubercularis and Associated Natural Enemies Isolate@
No ratings yet
GenBank Access Keys NCBI Homepage - Aulacaspis Tubercularis and Associated Natural Enemies Isolate@
78 pages
Lec2 Databases
No ratings yet
Lec2 Databases
135 pages
Pam PDF
No ratings yet
Pam PDF
49 pages
Bioinformatics Database and Applications
100% (3)
Bioinformatics Database and Applications
82 pages
Biological Information on Artificial Intelligence
No ratings yet
Biological Information on Artificial Intelligence
20 pages
SQH7001 Bioinformatics Task - Velda Rifka Almira
No ratings yet
SQH7001 Bioinformatics Task - Velda Rifka Almira
9 pages
UNIT II
No ratings yet
UNIT II
23 pages
Fasta 1
No ratings yet
Fasta 1
17 pages
BCH 505 Bioinformatics 3(2 2) Databases
No ratings yet
BCH 505 Bioinformatics 3(2 2) Databases
17 pages
Biological Databases: DR Z Chikwambi Biotechnology
No ratings yet
Biological Databases: DR Z Chikwambi Biotechnology
47 pages
38401062 Introduction
No ratings yet
38401062 Introduction
13 pages
GenBank Access Keys NCBI Homepage - Encarsia Citrina and E. Lounsburyi Newly Associated Parasitoids of Aulacaspis Tubercularis
No ratings yet
GenBank Access Keys NCBI Homepage - Encarsia Citrina and E. Lounsburyi Newly Associated Parasitoids of Aulacaspis Tubercularis
27 pages
lecture1_BIOF242_shuvadeep
No ratings yet
lecture1_BIOF242_shuvadeep
38 pages
Bioinformatics: Intended Learning Outcomes
No ratings yet
Bioinformatics: Intended Learning Outcomes
9 pages
Manual
No ratings yet
Manual
68 pages
Sequence and Structure Retrieval
No ratings yet
Sequence and Structure Retrieval
9 pages
Unit 6 - Bioinformatics
No ratings yet
Unit 6 - Bioinformatics
41 pages
Bioinform-Tica-Pdf-May-6-2010-12-38-Pm-3-5-Meg
No ratings yet
Bioinform-Tica-Pdf-May-6-2010-12-38-Pm-3-5-Meg
105 pages
Bioinformatics
No ratings yet
Bioinformatics
22 pages
#1 L1 BioDatabases
No ratings yet
#1 L1 BioDatabases
89 pages
Bs982 l08 Basic Blast
No ratings yet
Bs982 l08 Basic Blast
38 pages
Fanerogam
No ratings yet
Fanerogam
44 pages
Bio PPT
No ratings yet
Bio PPT
35 pages
Blast Primer Its1 Dan Its4
No ratings yet
Blast Primer Its1 Dan Its4
6 pages
Class05- Molecular sequence database - 2022
No ratings yet
Class05- Molecular sequence database - 2022
16 pages
Periplaneta Americana Proline-rich Protein MRNA, Complete Cds - Nucleotide - NCBI
No ratings yet
Periplaneta Americana Proline-rich Protein MRNA, Complete Cds - Nucleotide - NCBI
1 page
5 Microarray PDF
No ratings yet
5 Microarray PDF
79 pages
Bioinformatics
No ratings yet
Bioinformatics
47 pages
PREDICTED: Rhinopithecus Roxellana Tumor Protein p73 (TP73), Transcript Variant X4, mRNA
No ratings yet
PREDICTED: Rhinopithecus Roxellana Tumor Protein p73 (TP73), Transcript Variant X4, mRNA
3 pages
Bioinformatics: Nadiya Akmal Binti Baharum (PHD)
No ratings yet
Bioinformatics: Nadiya Akmal Binti Baharum (PHD)
54 pages
Bioinformatics Databases
No ratings yet
Bioinformatics Databases
10 pages
FALLSEM2019-20 BIT2001 ETH VL2019201000690 Reference Material I 11-Jul-2019 Unit I New
No ratings yet
FALLSEM2019-20 BIT2001 ETH VL2019201000690 Reference Material I 11-Jul-2019 Unit I New
48 pages
Bio (-) Informatics: Dr. Sudhir Kumar
100% (4)
Bio (-) Informatics: Dr. Sudhir Kumar
33 pages
Lecture 18 and 19
No ratings yet
Lecture 18 and 19
15 pages
Introduction To Bioinformatics: High-Throughput Biological Data and Evolution
No ratings yet
Introduction To Bioinformatics: High-Throughput Biological Data and Evolution
39 pages
Serves List
100% (1)
Serves List
34 pages
Lecture Bioinfo Databases
No ratings yet
Lecture Bioinfo Databases
27 pages
Computational Biology B.Tech - Biotech (Vith Semester)
No ratings yet
Computational Biology B.Tech - Biotech (Vith Semester)
34 pages
Xpressed Equence Ag: Ests - Outline
No ratings yet
Xpressed Equence Ag: Ests - Outline
26 pages
APPLICATION OF BIOINFORMATICS IN MOLECULAR BIOLOGY AND CURRENT RESEACRH-Dr. Ruchi Yadav
No ratings yet
APPLICATION OF BIOINFORMATICS IN MOLECULAR BIOLOGY AND CURRENT RESEACRH-Dr. Ruchi Yadav
105 pages
NCBI Part1
100% (2)
NCBI Part1
52 pages
Database Dalam Bioinformatika
No ratings yet
Database Dalam Bioinformatika
34 pages
Analytical Methods in Molecular Biology: Tutorial #1 - Gene
No ratings yet
Analytical Methods in Molecular Biology: Tutorial #1 - Gene
23 pages
Lecture 5 Information Retrieval From Databases
No ratings yet
Lecture 5 Information Retrieval From Databases
22 pages
Mini Projet 4
No ratings yet
Mini Projet 4
55 pages
BCH 516-1
No ratings yet
BCH 516-1
32 pages
Essential Info Notes-1
No ratings yet
Essential Info Notes-1
57 pages
Introduction to Bioinformatics, Sequence and Genome Analysis
From Everand
Introduction to Bioinformatics, Sequence and Genome Analysis
Jerry H. Swift
No ratings yet
Tourism
No ratings yet
Tourism
60 pages
Using DB2 For Linux, Windows, and UNIX Table Partitioning in SAP Systems
No ratings yet
Using DB2 For Linux, Windows, and UNIX Table Partitioning in SAP Systems
19 pages
Lecture 8 ER Model
No ratings yet
Lecture 8 ER Model
6 pages
Elasticsearch Performance Tuning
No ratings yet
Elasticsearch Performance Tuning
143 pages
DB Question
No ratings yet
DB Question
209 pages
Eliminando Blocos Corrompidos No Xe
No ratings yet
Eliminando Blocos Corrompidos No Xe
6 pages
CIS 4680 (5780) : Data Resources Management in Cloud Fall 2021
No ratings yet
CIS 4680 (5780) : Data Resources Management in Cloud Fall 2021
5 pages
Database Architecture
No ratings yet
Database Architecture
4 pages
Final Exam Hadi Nossair Data Base Seccion 4
No ratings yet
Final Exam Hadi Nossair Data Base Seccion 4
15 pages
Arnam Program File
No ratings yet
Arnam Program File
15 pages
Tsm Sql Interface: Tivoli 技术专家沙龙活动
No ratings yet
Tsm Sql Interface: Tivoli 技术专家沙龙活动
30 pages
Deployment of Fastapt Application On Docker Container
No ratings yet
Deployment of Fastapt Application On Docker Container
15 pages
DWDM question bank MCQ
No ratings yet
DWDM question bank MCQ
11 pages
Back To Basics: DB Time Performance Tuning: Theory and Practice
No ratings yet
Back To Basics: DB Time Performance Tuning: Theory and Practice
64 pages
Class8 - CCS - MOY 6465
No ratings yet
Class8 - CCS - MOY 6465
5 pages
Prachi Chanchlani | Senior Consultant | 3.5+ Years
No ratings yet
Prachi Chanchlani | Senior Consultant | 3.5+ Years
1 page
Lab Task1 Database
No ratings yet
Lab Task1 Database
6 pages
DBID Chnage Standalon Database
No ratings yet
DBID Chnage Standalon Database
7 pages
Informatica Best Practices
100% (3)
Informatica Best Practices
8 pages
Database Management Systems Notes
67% (3)
Database Management Systems Notes
2 pages
Unit 2.4 Star SnowFlake Schema ETl Process
No ratings yet
Unit 2.4 Star SnowFlake Schema ETl Process
14 pages
SqlConnection Class
No ratings yet
SqlConnection Class
23 pages
Big-Data-Pyq-2023-solution
No ratings yet
Big-Data-Pyq-2023-solution
18 pages
Quiz 4,5,67
No ratings yet
Quiz 4,5,67
37 pages
Recon NG 5.x Cheat Sheet Sheet1 1
No ratings yet
Recon NG 5.x Cheat Sheet Sheet1 1
1 page
Army Public School Bengdubi
No ratings yet
Army Public School Bengdubi
9 pages
07 - (Free Version) Client Bible Template
No ratings yet
07 - (Free Version) Client Bible Template
8 pages
GCP Products
100% (1)
GCP Products
33 pages
BI Unit4
No ratings yet
BI Unit4
83 pages

03 Databases

Uploaded by

03 Databases

Uploaded by

Sequence databases

Sara Ballouz, PhD

Life is about information

Reductionist and synthetic

From Attwood and Parry-Smith, 1999

• Record: an entry in a database

• Field: a component of a record

• Flat-file: store data as text files

• Relational: interconnected tables, use

Flat-file databases: example

Relational database: example tables

SELECT protab1.protein-name, protab2.protein-sequence

Sequence data in a database

Sequence database record structure

Nucleotide sequence database: GenBank

GenBank: statistics GenBank: 3,387,240,663,231 bases,

GenBank anatomy: header

GenBank anatomy: features

GenBank anatomy: sequence

GenBank anatomy: sequence types

• There are 27,702,323 human • Entries are made by researchers.

Entries Bases Species

Accessing sequence databases

• Searching the sequences

Protein sequence database: UniProt

571,609 sequence entries

UniProt/Swiss-Prot: top 20 species

UniProt/TrEMBL: taxonomic origins

UniProt/Swiss-Prot anatomy: header

UniProt/Swiss-Prot anatomy: literature

UniProt/Swiss-Prot anatomy: functional

This information summarises data from the literature.

UniProt/Swiss-Prot anatomy: references

Protein sequence DR PIR; A21853; ITHU.

Protein structure DR PDB; 1ATU; X-ray; @=45-418.

2-D PAGE DR SWISS-2DPAGE; P01009; HUMAN.

Domains DR InterPro; IPR000215; Prot_inh_serpin.

Keywords KW 3D-structure; Acute phase; Sequencing;

UniProt/Swiss-Prot anatomy: features

Protein modifications FT CARBOHYD 70 70 N-linked (GlcNAc...).

Polymorphisms FT VARIANT 4 4 S -> L (in Z-Wrexham).

Known errors FT CONFLICT 12 12 Missing (in Ref. 4).

UniProt/Swiss-Prot anatomy: sequence

Length Molecular weight Checksum (64-bit cyclic redundancy)

Specialized sequence databases

20,000 primary cancer and 76,156 genomes from ~9

Using the right words: ontologies

Nucleic acid binding enzyme

DNA binding helicase Adenosine triphophatase

Chromatin binding DNA helicase ATP-dependant DNA-dependant

ATP-dependant DNA helicase

Programmatic access to databases:

Other useful APIs

Programmatic access: local download

OBO vs OWL: example GO:0003723

JSON: example GO:0003723

You might also like