03 Databases
03 Databases
BINF2010
Building Blocks
(Genes/Molecules)
Kanehisa (2000) Post-genome Informatics
Sequence data
Biology…
not so
simple!
Databases: terminology
• Database: collection of
information related to a specific
subject (e.g., a phone book)
Databases: types
Flat-file databases
Pros Cons
• Easy to put together and • Detailed targeted searching is
distribute difficult
• No need for expensive or • Searching is not efficient
complicated database
management software
Databases
Relational databases
• Require a Relational Database Management System (RDBMS)
• Queried using SQL (or more commonly, a GUI front-end)
• Annotations (metadata)
• e.g., Authors, literature references, protein function, organism
of origin, location of coding regions in DNA sequence, etc.
Sequence data
Sequence databases
• Nucleic acids: DNA/RNA: GenBank
• Proteins: UniProt
• Specialized/other:
• Non-coding RNA databases: RNAcentral
• Variation databases: gnomAD
• Cancer genomes: TCGA
Nucleic acid data
http://www.ncbi.nlm.nih.gov/Genbank/index.html
Nucleic acid data
Nucleic acid data
https://www.ncbi.nlm.nih.gov/genbank/statistics/
Nucleic acid data
GenBank: reliability
GenBank is highly redundant GenBank may contain errors
GenBank: 1943950
111961
255975379562
205666444201
Triticum aestivum
Hordeum vulgare
diversity
1347585 126087260773 Hordeum vulgare subsp. vulgare
520 106587373982 Hordeum bulbosum
164 93011095388 Viscum album
29876 92980158773 Hordeum vulgare subsp. spontaneum
10049258 43637339408 Mus musculus
27810338 36825525087 Homo sapiens
Contains lots of 175081 24021381303 Escherichia coli
20 most
information for a 29811 21128005736 Avena sativa
sequenced
2640663 20263158983 Arabidopsis thaliana
small number of 33665 16333452591 Klebsiella pneumoniae organisms
species 2243780 16210185539 Bos taurus
1732296 13758456861 Danio rerio
312035 13104402122 Arachis hypogaea
Note many 195 11554711366 Sambucus nigra
28766 11286173222 Vicia faba
plants…why? 14812 10342955730 Triticum monococcum
23130 9981582961 Triticum turgidum subsp. durum
https://www.ncbi.nlm.nih.gov/genbank/release/current/
Nucleic acid data
http://www.uniprot.org
Protein sequence data
UniProt: components
• UniProt Knowledgebase (UniProtKB)
• central access point for extensive curated protein information, including
function, classification, and cross-reference
• UniProt Non-redundant Reference (UniRef)
• combines closely related sequences into a single record to speed
searches
• UniProt Archive (UniParc)
• comprehensive repository, reflecting the history of all protein
sequences.
Protein sequence data
Protein sequence data
UniProt/Swiss-Prot: statistics
Release 2024_03 (May 2024)
https://web.expasy.org/docs/relnotes/relstat.html
Protein sequence data
UniProt/TrEMBL: statistics
Release 2024_03 (May 2024)
244,910,918 sequence
entries, comprising
86,585,019,224 amino
acids
https://www.ebi.ac.uk/uniprot/TrEMBLstats
Protein sequence data
Structure FT TURN 49 50
FT HELIX 51 68
FT STRAND 74 76
FT HELIX 78 89
FT TURN 90 91
Protein sequence data
https://portal.gdc.cancer.gov/ https://gnomad.broadinstitute.org/
Specialized databases
http://geneontology.org/docs/introduction-to-go
Programmatic access
Demo: UniProt
Guide: http://www.uniprot.org/help/programmatic_access
BaseURL: https://rest.uniprot.org/uniprotkb/
Examples:
Web browser:
https://rest.uniprot.org/uniprotkb/search?query=reviewed:true+AND+
organism_id:9606&format=tsv
Terminal:
Format
curl -H "Accept: text/plain; format=tsv"
"https://rest.uniprot.org/uniprotkb/search?query=reviewed:true+AND+organism_id:9606"
BaseURL Query Fields
Programmatic access