Lecture01 Introduction FA24
Lecture01 Introduction FA24
1. Course overview:
– Staff, students, responses to student survey
– Foundations, frontiers, textbook, homework, quiz
– Final project: teams, mentorship, challenge, relevance,
originality, achievement, presentation
2. Why Computational Biology + Generative AI?
– Why Computation is well-suited to Biology+Genomes
– Why this time it’s different: Gen AI+DeepRepr.Learning
3. Overview of course modules
– Genomes, Expression, Epigenomics,
Networks, Genetics, Evolution, Frontiers
4. Biology primer (in the context of this course)
– Central Dogma of Molecular Biology
– DNA, Epigenomics, RNA, Protein, Networks
– Human genetics, Drug Discovery
I. Administrivia
Bookmarks
Closed captions
Chapters Playlists
Fall 2021, 2022: Panopto, and awesome search capabilities
Panopto (Fall 2021)
https://mit.hosted.panopto.com/Panopto/Pages/Sessions/List.aspx?folderID=7c716154-6516-4a49-9a81-adad0135dcb8
Panopto (Fall 2022)
https://mit.hosted.panopto.com/Panopto/Pages/Sessions/List.aspx?folderID=176f8b23-0433-403d-8c26-af090151a28d
Speaker video
Search function!
2X speed
Automatic chapters
(from slide headers)
Slide Navigation
Details on the in-class quiz
• Round 0: Self-introduction
(due Week 2 Friday)
• Round 1: Literature search and paper
description (due Week 4 Friday)
• Round 2: Team formation, project proposal,
feasibility (due Week 6 Friday)
• Round 3: Office Hours, Update, Feedback
(Meet Week 8 + Week 10 Fridays)
• Round 4: Midcourse report
(due Week 12, Friday)
• Round 5: Final report+slides
(due Week 14, Friday)
Course at a Glance
Details on the final project
• Milestones ensure sufficient planning / feedback
– Set-up: find project matching your skills and interests
– Team: common interests and complementary skills
– Inspiration: last year’s projects, and recent papers
– Proposal: establish milestones, deliverables, expectations
– Midcourse: see endpoint, outline report, methods, figures
• Periodic mentoring sessions
– Senior students and postdocs can serve as your mentors
– Group discussions to share ideas, guidance, feedback
– Peer-review: think critically about peer proposals, receive
feedback/suggestions, respond to critiques, adjust course
• Real-world experience, condensed in a single term
– Grant/fellowships proposals, peer review, yearly reports,
budget time/effort, collaboration, paper writing, give talk
Comm Lab: Help communicating your research!
A free resource for peer feedback from trained EECS
grad students and postdocs.
Committee Priestes
(peer review) s
(nurse)
Patient
Alois Alzheimer, 1911: AD Plaques+Tangles Human genome, genetic studies, GWAS 2007 Single-cell: 430 donors, 2M cells
Three major paradigm shifts: Data, Genomes, AI
Hypothesis-driven research: Data-driven research:
Formulate hypothesis gather data Gather data Ask questions later
Lots of thinking before target study Systematic datasets, build resources,
Problem: Highly biased, little novelty massive data sharing, comprehensive
Roadmap
Nature 15
5. Disseminate results
Claussnitzer
NEJM’15
Lean
scRNA of ApoE33, ApoE34, ApoE44 individuals Cholesterol transport & biosynthesis in oligos Cholesterol accumulates in ER, Myelination decrease
Blanchard,
Causality: Lack of myelination recapitulated in ApoE4 iPSC-derived oligodendrocytes Nature, 2022
With: Joel Blanchard, Leyla Akay, Jose Davila-Velderrain, Djuna von Maydel, Li-Huei Tsai
Reverse cancer w/ immunotherapy: scRNA + epig + TFs personalized combination treatment
Jackie Yang,
David Liu (Dana Farber),
Kunal Rai (MD Anderson),
Genevieve Boland (MGH)
What is GenAI and how can it help cure disease?
Manolis Kellis
GenAI Key idea: Representation learning
Learn complex
scenes/objects
from simpler Facial structure
parts
Bottom-up
building of world
representations
Convolutional filters
learn motifs (PSSM)
Deep Learning Architectures: Graph Neural Networks GNNs
Graph Convolutional Networks
𝑖𝑖
Positional encodings
Multi-modal generative AI: Image Text Translation
Paint a classroom of students
listening to a lecture on multi
modality with astronauts and
knights and princesses where the
lecturer is a giant bear
The image depicts a bright classroom scene. There are multiple rows of wooden
desks, each accommodating two students, and the room is filled with children who
appear to be in elementary school. The students are wearing casual clothing, with a
variety of patterns including stripes and plaids. The majority of the children have
their hands raised, signaling eagerness to participate or answer a question. In the
background, there is a teacher standing next to a whiteboard, which is partially
obscured in the image. The whiteboard appears to be blank. The room has large
windows that allow plenty of natural light to fill the space, and there are white walls
and a green chalkboard behind the teacher. The desks have open fronts where books
and notebooks can be stored, and there are papers and books on the desks. The
children's attention is focused on the teacher, indicating an interactive and engaging
class environment
Cross-modal “Visual-Semantic Embeddings”
WSABI (Weston et al 2010), DeVise (Frome et al 2013),
Cross-Modal Transfer (Socher et al 2013)
Genomics
Every person is a point, based on their cellular expression patterns Use ‘cartography’ from gene expression to map phenotype
Reveal impact of gene expression, phenotype, genotype Common foundational map reason about health impact
Multi-modal Embeddings of 2.4 million Human Cells
Every dot is a 20,000-dimensional vector
Integrate 2.4 million ‘documents’
Impact:
• Understand gene relationships
• Understand impact of phenotype
• Understand impact of age, sex
• Understand pathway correlations
• Understand gene co-variation
• Map phenotype to cell space
Functional knowledge graph of 20,000 human genes
Cardiovascular
Large language models Biological language models Single-cell AI models Geometric deep learning
Metastatic
melanoma
57
Papers + Grants + Patents + Startups + Offices Dynamics over time: Knowledge evolution 100,000 loans, clustered by description
Flow of Knowledge, resources Disciplines emerging, maturing, changing Context-specific predictive algorithms
Education: MIT, EdX, AP, High-school, YouTube Google News, Podcasts, Websites, Wikipedia Collaboration, Productivity, Team Progress
Multi-modal learning, interdisciplinary links Ontology creation and labeling Link projects across team members
Match CVs, job descriptions, skill sets Auto-link creation, paragraph level Within-meeting live track of productivity
The power of Maps for Physical Space Navigation
Maps give us
Landscape
Landmarks
Anchor points
Street names
Highway names
Simplification
Abstraction
Summarization
Decision making
The road ahead: Systematic understanding in biology + work
• AI as a discovery partner: multi-modal foundation • AI as “Google Maps” for navigating knowledge space
models • Visual search + integration through millions of documents
• Gain insights previous inaccessible to human scientists • Hierarchical interactive knowledge representation + manip.
• Build rich intuition on biological + therapeutic space
Goals for today: Course Introduction
1. Course overview:
– Staff, students, responses to student survey
– Foundations, frontiers, textbook, homework, quiz
– Final project: teams, mentorship, challenge,
relevance, originality, achievement, presentation
2. Why Computational Biology + Generative AI?
– Why Computation is well-suited to Biology+Genomes
– Why Gen AI + Representation Learning is different
3. Overview of course modules
– Genomes, Expression, Epigenomics,
Networks, Genetics, Evolution, Frontiers
4. Biology primer (in the context of this course)
– Central Dogma of Molecular Biology
– DNA, Epigenomics, RNA, Protein, Networks
– Human genetics, Drug Discovery
Course at a Glance
Challenges in Computational Biology
4 Genome Assembly
2 Sequence alignment
6 Comparative Genomics
TCATGCTAT
TCGTGATAA 3 Database lookup
7 Evolutionary Theory TGAGGATAT
TTATCATAT
TTATGATTT
RNA transcript
9 Cluster discovery 10 Gibbs sampling
11 Protein network analysis
12 Metabolic modelling
1
2
Vk(i)
x1 x2 x3 ………………………………………..xN
• Sequence alignment • Hidden Markov Models
• DP: Core computational technique
– Pervasive in computer science, and computational biology
– Fully explore exponential search spaces in poly time!
– Greedy algorithms will not work, back-tracking, saving soln
– Special requirements: Optimal substructure
– Found in: alignment, HMMs, phylogeny, genetics, pop gen…
Gene expression analysis and transcripts
• Computational foundations:
– Unsupervised Learning: Expectation Maximization
– Supervised learning: generative/discriminative models
– Read mapping, significance testing, splice graphs
• Biological frontiers:
– PS2: Modeling conservation, GC content, CpG islands
– L6/L7: Genome annotation and parsing
– L8: Gene expression analysis: cluster genes/conditions
– L9: Regulatory motif discovery: EM, gibbs sampling, info
Natural 1st step: group similar rows/columns
Clustering
Similar cell types Similarly-behaving groups of genes
Conditions
Conditions
Genes
Genes
• Computational Foundations
– Hidden Markov Models (HMMs): Central tool in CS
– Decoding, evaluation, parsing, likelihood, scoring
– Unsupervised Learning: Expectation Maximization
– Supervised learning: generative/discriminative models
• Biological frontiers:
– PS2: Modeling conservation, GC content, CpG islands
– L6/L7: Genome annotation and parsing
– L8: Gene expression analysis: cluster genes/conditions
– L9: Regulatory motif discovery: EM, gibbs sampling, info
Motifs summarize TF sequence specificity
• Summarize
information
• Integrate many
positions
• Measure of
information
• Distinguish motif
vs. motif instance
• Assumptions:
– Independence
– Fixed spacing
Starting positions Motif matrix
• given aligned sequences easy to compute profile matrix
shared motif sequence positions
1 2 3 4 5 6 7 8
Observed
chromatin
marks. Called
K4me1 K4me3 K4me3 K4me1 K36me3 K36me3 K36me3
based on a K36me3
poisson
distribution K27ac K4me1
Most likely
Hidden State 1 2 3 4 6 6 6 6 6 5 5 5
High Probability Chromatin Marks in State
0.8 0.8
200bp 1: K4me1 K27ac
0.7 4: All probabilities are
intervals K4me1
0.9 learned from the data
2: 0.8
5:
K4me3 K4me1
3: 0.9 6: 0.9 72
K4me3 K36me3
• Phylogenetics / Phylogenomics
– Phylogenetics: Evolutionary models, Tree building, Phylo inference
– Phylogenomics: gene/species trees, reconciliation, coalescent, pops
• Population genomics:
– Learning population history from genetic data (David Reich)
– Statistical genetics: disease mapping in populations (Mark Daly)
– Measuring natural selection in human populations (Pardis Sabeti)
– The missing heritability in genome-wide associations (Yaniv Erlich)
• And we’re done! Last pset Nov 21st, In-class quiz on Nov 22nd
– No lab 4! Then entire focus shifts to projects, Thanksgiving, Frontiers
Characterizing sub-threshold variants in heart arrhythmia
DNA
makes
RNA
makes
Protein
DNA: The double helix
• The most noble molecule of our time
DNA: the molecule of heredity
• Self-complementarity sets molecular basis of heredity
– Knowing one strand, creates a template for the other
– “It has not escaped our notice that the specific pairing we have postulated immediately
suggests a possible copying mechanism for the genetic material.” Watson & Crick, 1953
DNA: chemical details
2’ 3’
T 1’
4’ • Bases hidden on the inside
5’
5’
A • Phosphate
outside
• backbone
Weak hydrogen bonds hold the
two strands together
4’ 1’ 2’ 3’ • This allows low-energy opening
3’ 2’ C 1’
4’ and re-closing of two strands
5’
5’
G
• Anti-parallel strands
4’ 1’ 2’ 3’ • Extension 5’3’ tri-
3’ 2’ T 1’
4’ phosphate coming from
5’ newly added nucleotide
5’
A
4’ 1’ 2’ 3’ The only parings are:
3’ 2’ C 1’
4’
• A with T
5’
5’
G • C with G
4’ 1’
3’ 2’
DNA: the four bases
Purine Purine
Pyrimidine Pyrimidine
Weak Weak
Strong Strong
Amino Amino
Keto Keto
Goals for today: Course Introduction
1. Course overview:
– Staff, students, responses to student survey
– Foundations, frontiers, textbook, homework, quiz
– Final project: teams, mentorship, challenge,
relevance, originality, achievement, presentation
2. Why Computational Biology + Generative AI?
– What makes our field unique
3. Overview of course modules
– Genomes, Expression, Epigenomics,
Networks, Genetics, Evolution, Frontiers
4. Biology primer (in the context of this course)
– Central Dogma of Molecular Biology
– DNA, Epigenomics, RNA, Protein, Networks
– Human genetics, Drug Discovery
“Central dogma” of Molecular Biology
DNA Epigenomics
makes
RNA
makes
Protein
Chromosomes inside the cell
• Eukaryote cell
• Prokaryote
cell
DNA packaging
• Why packaging
– DNA is very long
– Cell is very small
• Compression
– Chromosome is 50,000
times shorter than
extended DNA
• Using the DNA
– Before a piece of DNA
is used for anything,
this compact structure
must open locally
• Now emerging:
– Role of accessibility
– State in chromatin itself
– Role of 3D interactions
Diverse epigenetic modifications
89
Image source: http://nihroadmap.nih.gov/epigenomics/
Diversity of epigenetic modifications
modifications • 100+ different histone modifications
• Histone protein H3/H4/H2A/H2B
• AA residue Lysine4(K4)/K36…
• Chemical modification Met/Pho/Ubi
Histone tails • Number Me-Me-Me(me3)
• Shorthand: H3K4me3, H2BK5ac
• In addition:
• DNA modifications
• Methyl-C in CpG / Methyl-Adenosine
• Nucleosome positioning
• DNA accessibility
• The constant struggle of gene regulation
DNA wrapped around
histone proteins • TF/histone/nucleo/GFs/Chrom compete 90
Epigenomics Roadmap across 100+ tissues/cell types
DNA
makes
RNA
makes
Protein
Genes control the making of cell parts
– Alternative splicing can yield different exon subsets for the same gene,
and hence different protein products
RNA can be functional
DNA
makes
RNA
makes
Protein
Proteins carry out the cell’s chemistry
Alpha-beta horseshoe
Beta-barrel this placental ribonuclease inhibitor is a
Helix-turn-helix Some antiparallel b-sheet cytosolic protein that binds extremely
domains are better described as strongly to any ribonuclease that may leak
Common motif for into the cytosol. 17-stranded parallel b
b-barrels rather than b-
DNA-binding proteins sheet curved into an open horseshoe shape,
sandwiches, for example
that often play a with 16 a-helices packed against the outer
streptavadin and porin. Note
regulatory role as surface. It doesn't form a barrel although it
that some structures are
mRNA level looks as though it should. The strands are
transcription factors intermediate between the only very slightly slanted, being nearly
extreme barrel and sandwich parallel to the central `axis'.
arrangements.
Protein building blocks
• Amino Acids
From RNA to protein: Translation
•tRNA
• Ribosome
The Genetic Code
Inheritance
Messages
Reactions
Goals for today: Course Introduction
1. Course overview:
– Staff, students, responses to student survey
– Foundations, frontiers, textbook, homework, quiz
– Final project: teams, mentorship, challenge,
relevance, originality, achievement, presentation
2. Why Computational Biology + Generative AI?
– What makes our field unique
3. Overview of course modules
– Genomes, Expression, Epigenomics,
Networks, Genetics, Evolution, Frontiers
4. Biology primer (in the context of this course)
– Central Dogma of Molecular Biology
– DNA, Epigenomics, RNA, Protein, Networks
– Human genetics, Drug Discovery
Cellular dynamics and regulation
How cells move through this Central Dogma
DNA
makes
Protein
Animal/Human gene regulation:
One genome Many cell types
ACCAGTTACGACGGTCA
GGGTACTGATACCCCAA
ACCGTTGACCGCATTTA
CAGACGGGGTTTGGGTT
TTGCCCCACACAGGTAC
GTTAGCTACTGGTTTAG
CAATTTACCGTTACAAC
GTTTACAGGGTTACGGT
TGGGATTTGAAAAAAAG
TTTGAGTTGGTTTTTTC
ACGGTAGAACGTACCGT
TACCAGTA
114
Image Source wikipedia
Eukaryotic Gene Regulation
Diverse roles for regulatory non-coding RNAs
• Activator and
repressor motifs
consistent with
tissues
Pouya Kheradpour
Network components reveal functional modules
Jim Collins
• Components with
known properties
• Assemble based
Metabolic Pathways
on engineering
goals / principles
Synthetic
• Implement within
engineered cells
and organisms
• Study behavior &
adjust as needed
Jay Keasling
Over-express a single microRNA leads to new wing
wing
w/bristles
wing haltere
WT
wing
sense Antisense
DNA
makes
RNA
makes
Protein
Brief intro to human genetics
• Human genome: 3.2B letters, 2 copies, 23 chromosomes,
20k genes, ~3M common SNPs, ~500k haplotype blocks
The power and challenge of disease-association studies
rs11209026 A G
Cases 22 976
IL23R cytokine receptor on a subset of effector T-cells
Controls 68 932
Chi-sq = 24.5, p=7.3 x 10-7
Genomewide association in schizophrenia
with 40,000 cases
ES
Liver
Brain
Digestive
Heart
T cells B cells
Methylation differences a causal component of AD
GMD
GMD
G D
AD predictive power reduced
M
after removing meQTL effect
Set-wise causality testing
Uncovering the molecular basis of top obesity gene
Lean
Obese