0% found this document useful (0 votes)
185 views

Lecture01 Introduction FA24

The document outlines the goals and structure of a computational biology course, including an overview of the course content, faculty introductions, and the significance of computational biology combined with generative AI. It details course modules, project milestones, grading criteria, and the importance of original research in the field. Additionally, it emphasizes the collaborative nature of the course and the resources available for student support.

Uploaded by

fruitzebra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
185 views

Lecture01 Introduction FA24

The document outlines the goals and structure of a computational biology course, including an overview of the course content, faculty introductions, and the significance of computational biology combined with generative AI. It details course modules, project milestones, grading criteria, and the importance of original research in the field. Additionally, it emphasizes the collaborative nature of the course and the resources available for student support.

Uploaded by

fruitzebra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 140

Goals for today: Course Introduction

1. Course overview:
– Staff, students, responses to student survey
– Foundations, frontiers, textbook, homework, quiz
– Final project: teams, mentorship, challenge, relevance,
originality, achievement, presentation
2. Why Computational Biology + Generative AI?
– Why Computation is well-suited to Biology+Genomes
– Why this time it’s different: Gen AI+DeepRepr.Learning
3. Overview of course modules
– Genomes, Expression, Epigenomics,
Networks, Genetics, Evolution, Frontiers
4. Biology primer (in the context of this course)
– Central Dogma of Molecular Biology
– DNA, Epigenomics, RNA, Protein, Networks
– Human genetics, Drug Discovery
I. Administrivia

Introduction to the course and its goals


Course organization and content
Homework and Quiz
Term Project
Introductions
• Lecturer: Manolis Kellis
– MIT CSAIL, CompBio, Broad, Disease mechanism, Epigenomics,
Cancer, Brain, Gene Regulation, Evolution, Single-cell genomics

• Lecturer: Eric Alm


– MIT Biological Engineering, Gen AI, Computational, theoretical,
experimental understanding & engineering human microbiome

• TA: Jared Zheng


– MIT CSAIL, Zhang Lab, Chemistry, Biophysics, protein-ligand
interactions, drug discovery, deep generative models, PLMs

• TA: Sarah Gurev


– MIT EECS, Debbie Marks Lab Harvard, Stanford BS in CS,
protein design and evolution

• TA: Benjamin James


– MIT EECS, CSAIL, Computational Biology, Broad, Epigenomics,
Regulatory Circuitry, Single-Cell, Addiction, Neuroscience
Course Information
• Lectures
– TR 1pm – 2:30, Room 32-144
• Recitations/Mentoring/OfficeHours:
– On Friday at 3pm in 32-144
– Recitations at MIT
• Course Website
– https://canvas.mit.edu/courses/28242
– or simply: http://compbio.mit.edu/MLCB (redirects to canvas)
– All handouts, lectures, notes, etc will be posted here.
• Course calendar:
– On Google, add public calendar: “MLCB24 Lectures”
Goals for the term

• Introduction to computational biology


– Fundamental problems in computational biology
– Algorithmic/machine learning techniques for data analysis
– Research directions for active participation in the field
– Understanding how methods work
• Ability to tackle research
– Problem set questions: algorithmic rigorous thinking
– Programming assignments:
 hands-on experience w/ real datasets
– Final project experience:
 propose and carry out independent original research
 present findings in conference format (written, oral)
Computation & Biology | Foundations & Frontiers

• Duality #1 (x-axis): Computation and Biology


– Important, relevant, current biology:
 Important biological problems
– Fundamental computer science:
 General techniques, principles
• Duality #2 (y-axis): Foundations and Frontiers
– Foundations:
– well-defined problems, general methodologies
– ‘The classics’ of the field
– Frontiers:
– in-depth look at complex, current problems, open questions
– combine techniques learned
– opens to projects, research directions
Course at a Glance
Fall 2020, 2019, 2018: YouTube, and ease of use anywhere
YouTube Playlist: (Fall 2021)
Fall 2020: https://www.youtube.com/playlist?list=PLypiXJdtIca6dEYlNoZJwBaz__CdsaoKJ
Fall 2019: https://www.youtube.com/playlist?list=PLypiXJdtIca6U5uQOCHjP9Op3gpa177fK
Fall 2018: https://www.youtube.com/playlist?list=PLypiXJdtIca6GBQwDTo4bIEDV8F4RcAgt

Bookmarks

Closed captions

Chapters Playlists
Fall 2021, 2022: Panopto, and awesome search capabilities
Panopto (Fall 2021)
https://mit.hosted.panopto.com/Panopto/Pages/Sessions/List.aspx?folderID=7c716154-6516-4a49-9a81-adad0135dcb8
Panopto (Fall 2022)
https://mit.hosted.panopto.com/Panopto/Pages/Sessions/List.aspx?folderID=176f8b23-0433-403d-8c26-af090151a28d

Speaker video

Search function!

Shared screen Automatic


transcript

2X speed

Automatic chapters
(from slide headers)
Slide Navigation
Details on the in-class quiz

• It’s not a midterm, and it’s not a final exam


– It’s a quiz, friendly, fun, interesting, cute, fuzzy
• Demonstrate mastery of the material in 4 modules
– Understand key points emphasized in lecture
– Understand subtleties revealed in the psets
– Ability to apply new skills to solve practical problems
• Types of questions
– Knowledge questions: T/F justify, multiple choice
– Deeper understanding questions: short answers
– Practical problems: work through simple algorithm
– Design problem(s): new/modified algorithm, need
both knowledge and new idea, argue correctness
Final Project: Original Research in Comp Bio
• A major aspect of the course is preparing you for
original research in computational biology.
– Framing a biological problem computationally
– Gathering relevant literature and datasets
– Solving it using new algorithms, machine learning
– Interpreting the results biologically
• Also ability to present your ideas and research
– Crafting a research proposal (fellowships/grants)
– Working in teams of complementary skill sets
– Review peer proposals, find flaws, suggest imprvmts
– Receiving feedback and revising your proposal
– Writing up your results in a scientific paper format
– Presenting a research talk to a scientific audience
• Term project experience mirrors this process
Project Milestones

• Round 0: Self-introduction
(due Week 2 Friday)
• Round 1: Literature search and paper
description (due Week 4 Friday)
• Round 2: Team formation, project proposal,
feasibility (due Week 6 Friday)
• Round 3: Office Hours, Update, Feedback
(Meet Week 8 + Week 10 Fridays)
• Round 4: Midcourse report
(due Week 12, Friday)
• Round 5: Final report+slides
(due Week 14, Friday)
Course at a Glance
Details on the final project
• Milestones ensure sufficient planning / feedback
– Set-up: find project matching your skills and interests
– Team: common interests and complementary skills
– Inspiration: last year’s projects, and recent papers
– Proposal: establish milestones, deliverables, expectations
– Midcourse: see endpoint, outline report, methods, figures
• Periodic mentoring sessions
– Senior students and postdocs can serve as your mentors
– Group discussions to share ideas, guidance, feedback
– Peer-review: think critically about peer proposals, receive
feedback/suggestions, respond to critiques, adjust course
• Real-world experience, condensed in a single term
– Grant/fellowships proposals, peer review, yearly reports,
budget time/effort, collaboration, paper writing, give talk
Comm Lab: Help communicating your research!
A free resource for peer feedback from trained EECS
grad students and postdocs.

Why people come to CommLab:


“Very, very valuable. Thank you!”
RESUME / CV 63
—Elena Glassman, EECS PhD
GRADUATE SCHOOL APPL. 43 alumna
OTHER (INCL. STARTUP PLANS, RQE) 38
FACULTY PACKAGE
OTHER REPORT OR ESSAY
35
Total: "I strongly encourage students to
34
ORAL PRESENTATION 33
400 appointments schedule a session; it’s a very
FELLOWSHIP / SCHOLARSHIP APPL. 33
impressive resource.”
THESIS You can be anywhere
—Dirk Englund,
31 in the process:
MANUSCRIPT 29 • Brainstorming professor
POSTER / VISUAL 27 • Outlining
THESIS PROPOSAL 14
• Revising “The experience and coaching
• Final polishing
ABSTRACT8 helped me apply successfully for
GRANT7
LAB REPORT
an important fellowship this
5
year.”
0 20 40 60 80
—Joel Jean, EECS grad
Number of appointments
Finding a research mentor / research advisor
• Chance to meet faculty at MIT/Broad/Harvard:
– Through guest lectures and mentoring
– Topics and papers covered in the lectures
– Experts on: (1) human comparative genomics, (2)
lincRNAs, (3) metabolic modeling, (4) disease mapping,
selection, evolution and ecology (following four modules)
• Chance to meet senior students and postdocs:
– On: coding genes, ncRNAs, regulatory motifs, networks,
epigenomics, phylogenomics (again on each module)
– Mentorship sessions with entire MIT CompBio group
• Your own personal research experience:
– collaborators, datasets
– learn active research directions, frontiers
– living, breathing changing field
Putting it all together
Course Activities: Mens et Manus
• Learning (25 lectures * 1.5 hours)
• Mentoring (4-7 meetings * 1 hour) Project 40% Psets 30% Quiz 25%

• 3 problem sets: [30% of your grade]


– Out on Tuesdays, due Mondays in 2-3 weeks. 5%
– Each problem set covers 1 module, contains ~4 problems.
– Algorithmic problems and programming assignments
• Final project [40% of your grade]
– Introduction to research in computational biology (full term!)
– Includes peer-reviewed NIH-style proposal and much feedback
• Quiz [25% of your grade]
– In-class quiz. No final exam.
• Office hours/recitations/lectures participation: 5% grade
• Collaboration policy [humans and AI]
– Collaboration allowed, but you must:
• Work independently on each problem before discussing it
• Write solutions on your own
• Acknowledge sources and collaborators. No outsourcing.
– ChatGPT / LLM policy
• Acknowledge the way you would for a collaboration partner
• Be transparent, save your chats, possibly submit w/ homework
Goals for today: Course Introduction
1. Course overview:
– Staff, students, responses to student survey
– Foundations, frontiers, textbook, homework, quiz
– Final project: teams, mentorship, challenge, relevance,
originality, achievement, presentation
2. Why Computational Biology + Generative AI?
– Why Computation is well-suited to Biology+Genomes
– Why this time it’s different: Gen AI+DeepRepr.Learning
3. Overview of course modules
– Genomes, Expression, Epigenomics,
Networks, Genetics, Evolution, Frontiers
4. Biology primer (in the context of this course)
– Central Dogma of Molecular Biology
– DNA, Epigenomics, RNA, Protein, Networks
– Human genetics, Drug Discovery
Why Computational Biology ?
Why Computational Biology: Last year’s answers
• Lots of data (* lots of data)
• There are rules
• Pattern finding
• It’s all about data
• Ability to visualize
• Simulations, temporal relationships
• Guess + verify (generate hypotheses for testing)
• Propose mechanisms / theory to explain observations
• Networks / combinations of variables
• Efficiency (reduce experimental space to cover)
• Informatics infrastructure (ability to combine datasets)
• Correlations, higher-order relationships
• Cycle from hypothesis generation to testing condensed
• Life itself is digital. Understand cellular instruction set
Why Computational Biology: Live in Zoom Chat F20
• Data-rich in a historically data-poor domain (Matthew West)
• potential to do whatever you want without waiting for experiments (Stuti Khandwala)
• DNA is a massive dataset (Pablo X Villalobos)
• More efficient and in depth way to explore biology (Lilly K Edwards)
• There're tons of biological datasets waiting to be analyzed (Hieu Q Dinh)
• Because you can use other people’s datasets and then get good research done on a budget
(Ari)
• Might be the biggest frontier of computing today (Erez Kaminski)
• More and more sequencing data are coming out (Evelyn Tong)
• New technologies - lots of data - (Manu Ponnapati)
• Biology benefits from approximation (Thomas Xiong)
• The need to integrate multi-omics data to gain more insights (Kathleen Sucipto)
• Its interesting and new (Daniel R Gutierrez)
• Can use expertise from other engineering fields to impact health (Swathi Manda)
• Complex patterns in biological data (Farhan Khodaee)
• impact real human lives, important applications (Lucy Zhang)
• answers questions not easily solvable by traditional experimental biology (Andrew D Hennes)
• Expands our horizons in asking biological questions (Dylan McCormick)
• Computational biology and simulations can help deconvolve results from experiments (Raina
Thomas)
ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA
TATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC
TAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC
TGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT
CTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG
AATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA
GCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT
TTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
CTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG
TTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT
TGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAA
TCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTG
CACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCA
TGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATAT
CCAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTT
GGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAA
TAGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCAC
CAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAG
TTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAG
GCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGA
AATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT
TTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG
CGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC
ATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA
GAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA
ATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACT
AGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATA
GTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGG
ACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAG
CTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTAC
GAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACA
AAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGA
AATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCA
TTCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCAT
CCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATT
AGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAA
GTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATA
GCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACA
CAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATC
CACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGT
ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA
TATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC
TAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC
TGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT
CTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG
AATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA
GCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT
TTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
CTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG
TTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT
TGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAA
TCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTG
Genes
CACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCA
Regulatory motifs
TGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATAT
CCAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTT
Encode
GGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAA
Control
TAGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCAC
proteins
CAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAG
gene expression
TTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAG
GCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGA
AATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT
TTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG
CGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC
ATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA
GAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA
ATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACT
AGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATA
GTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGG
ACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAG
CTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTAC
GAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACA
AAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGA
AATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCA
TTCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCAT
CCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATT
AGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAA
GTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATA
GCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACA
CAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATC
CACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGT
ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA
TATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC
TAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC
TGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT
CTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG
AATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA
GCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT
TTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
CTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG
TTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT
TGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAA
TCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTG
CACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCA
TGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATAT
CCAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTT
GGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAA
TAGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCAC
CAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAG
TTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAG
GCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGA
AATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT
TTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG
CGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC
ATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA
GAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA
ATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACT
AGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATA
GTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGG
ACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAG
CTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTAC
GAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACA
AAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGA
AATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCA
TTCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCAT
CCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATT
AGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAA
GTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATA
GCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACA
CAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATC
CACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGT
ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA
TATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC
Extracting signal from noise
TAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC
TGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT
CTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG
AATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA
GCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT
TTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
CTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG
TTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT
TGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAA
TCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTG
CACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCA
TGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATAT
CCAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTT
GGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAA
TAGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCAC
CAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAG
TTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAG
GCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGA
AATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT
TTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG
CGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC
ATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA
GAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA
ATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACT
AGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATA
GTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGG
ACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAG
CTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTAC
GAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACA
AAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGA
AATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCA
TTCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCAT
CCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATT
AGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAA
GTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATA
GCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACA
CAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATC
CACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGT
The components of genomes and gene regulation

Goal: A systems-level understanding of genomes and gene regulation:


• The genome: Map reads, align genes/genomes, assembly strategies
• The genes: Protein-coding exons, introns, non-coding RNA, RNA folding
• The control regions: Promoters, enhancers, insulators, chromatin states
• The actual words: Regulatory motifs, high-resolution accessibility maps
• The regulators: Transcription factors, chromatin modifiers, nucleosomes
• The dynamics: Changing maps between cell types, across development
• The networks: regulatorenhancertarget, ChIP-seq, correlated activity
• The grammars: TF/motif/mark combinations, predictive models
• Human variation: Human diversity, population genomics, linkage maps
• Evolution: Phylogenetics, phylogenomics, coalescent, human ancestry
• GWAS/QTLs: Genome variation  organismal/molecular phenotypes
• Disease: Personal (epi)genomics, pharmacogenomics, synthetic biology
Goals for today: Course Introduction
1. Course overview:
– Staff, students, responses to student survey
– Foundations, frontiers, textbook, homework, quiz
– Final project: teams, mentorship, challenge, relevance,
originality, achievement, presentation
2. Why Computational Biology + Generative AI?
– Why Computation is well-suited to Biology+Genomes
– Why this time it’s different: Gen AI+DeepRepr.Learning
3. Overview of course modules
– Genomes, Expression, Epigenomics,
Networks, Genetics, Evolution, Frontiers
4. Biology primer (in the context of this course)
– Central Dogma of Molecular Biology
– DNA, Epigenomics, RNA, Protein, Networks
– Human genetics, Drug Discovery
Deep Data and the Next Wave of Medicines
God / temple physician

Committee Priestes
(peer review) s
(nurse)
Patient

Enkoimesis, Epidauros, 4th BC. Arch. Museum of Piraeus


Hippocrates, Alkmaion, Asclepius, Humorism, Aristotle Renaissance-1900: Anatomy, Microscopy, Versalius, Leonardo, Cajal

Alois Alzheimer, 1911: AD Plaques+Tangles Human genome, genetic studies, GWAS 2007 Single-cell: 430 donors, 2M cells
Three major paradigm shifts: Data, Genomes, AI
Hypothesis-driven research: Data-driven research:
Formulate hypothesis  gather data Gather data  Ask questions later
Lots of thinking before  target study Systematic datasets, build resources,
Problem: Highly biased, little novelty massive data sharing, comprehensive

Correlation-based analysis: Causality-based analysis:


More Coffee  Better Health Genetic variants  Disease outcome
More Chocolate  More Nobel Prizes Polygenic risk score  Causal factors
‘Epidemiology’ all about correlations Perturbation experiments  Confirm

Classical Data Analysis: Generative AI+Deep Learning


New methodology for each problem Foundation models, Multi-Modality
Human scientist does all the ‘thinking’ Representation learning, hierarchical
Few parameters, targeted models Truly ‘understand’ concepts  insights
Dissect mechanisms of disease-associated regions

Roadmap
Nature 15

2. Profile RNA + Epigenome Boix EpiMap


1. Disease genetics reveals Nature 21
in healthy + disease samples
common + rare variants/regions

5. Disseminate results

Cell cultures Mouse models

Claussnitzer
NEJM’15

4. Validate predictions in 3. Integrate data to predict driver


Blanchard,
Nature, 2022 human cells + mouse models genes, regions, cell types Park NBT 15
Non-coding circuitry helps interpret disease loci
Region of association

• Expand each GWAS locus using SNP linkage disequilibrium (LD)


– Recognize relevant cell types: tissue-specific enhancer enrichment
– Recognize driver TFs: enriched motifs in multiple GWAS loci
– Recognize target genes: linked to causal enhancers Quon bioRxiv 467852
FTO & Obesity: Uncover & manipulate circuitry  reverse disease phenotypes
BMI association (-log10P)

Lean

SNP genomic position (23 chrs) Obese


Speliotes NG 2010
Incr. ARID5B  Lean C-to-T  Lean Decrease IRX3, IRX5  Lean
Decr ARID5BObese T-to-C  Obese Increase IRX3, IRX5  Obese

CRISPR-edit human fat cells IRX3 KD  Burn calories in their sleep


Claussnitzer, NEJM 2015  able to burn calories again  54% weight loss. Can’t gain weight
ApoE4 & Alzheimer’s: Cholesterol transport  Oligo ER accumulation  Myelin  Cognition

scRNA of ApoE33, ApoE34, ApoE44 individuals Cholesterol transport & biosynthesis in oligos Cholesterol accumulates in ER, Myelination decrease

Restoring cholesterol transport (Cyclodextrine)


restores myelination & restores cognition

Blanchard,
Causality: Lack of myelination recapitulated in ApoE4 iPSC-derived oligodendrocytes Nature, 2022
With: Joel Blanchard, Leyla Akay, Jose Davila-Velderrain, Djuna von Maydel, Li-Huei Tsai
Reverse cancer w/ immunotherapy: scRNA + epig + TFs  personalized combination treatment

# first sample 2011-12-19, last sample 2020-09-02


# mean age 62
# 28 F and 56 M
# 61 PRE, 57 ON, 23 POST, 7 PRO, 3 NA
# 93 ICI (51 PD1, 30 PD1+CTLA4, 9 CTLA4, 3 PDL1)(43 no prior
treatment), 20 targeted, 13 targeted+ICI, 7 other, 7 other+ICI, 2 NA
# 69 responders (R), 74 progressive disease (PD), 8 NA
# 143 mets, 5 normal, 3 melanoma primary

bioRxiv 506051 ’22

Jackie Yang,
David Liu (Dana Farber),
Kunal Rai (MD Anderson),
Genevieve Boland (MGH)
What is GenAI and how can it help cure disease?

Manolis Kellis
GenAI Key idea: Representation learning

‘Modern’ Deep learning: ‘Classical’ Fully-connected


Hierarchical Representation Learning Neural Networks
Feature extraction Classification

In deep learning, the two tasks are coupled:


• the classification task “drives” the feature extraction
• Extremely powerful and general paradigm
 Be creative! The field is still at its infancy!
 New application domains (e.g. beyond images) can have
structure that current architectures do not capture/exploit
 Genomics/biology/neuroscience can help
drive development of new architectures
Deep learning  many layers of abstraction
Convolutional
Neural Networks

Learn complex
scenes/objects
from simpler Facial structure
parts

Bottom-up
building of world
representations

Convolution Eyes, ears,nose


operation:
scanning for
features in a field

Goodfellow 2016 Edges, dark spots


Deep Convolutional Neural Networks for Genomics

Predict probabilities using logistic neuron

Max pool thresholded scores over windows

Threshold scores using ReLU

Scan sequence using filters

Convolutional filters
learn motifs (PSSM)
Deep Learning Architectures: Graph Neural Networks GNNs
Graph Convolutional Networks

Idea: Node’s neighborhood defines a  Basic approach: Average information from


computation graph neighbors and apply a neural network
(1) average messages
from neighbors

𝑖𝑖

Determine node Propagate and (2) apply neural network


computation graph transform information

Learn how to propagate information


across the graph to compute node
features [Kipf and Welling, ICLR 2017]
43
NLP, words, sentences: Distributional Semantics
• Terms that appear in the same context of other words are (probably) semantically
related
• Every term is mapped to a high-dimensional vector (the embedding space)
• Ever more sophisticated versions of embeddings, equivalent to matrix factorization
• Word2Vec word2vec
• GloVe
• Elmo
• Bert
• GPT

Embedding space calculations:


Plausibility of semantic claims

t-DistributedStochasticNeighbor Embedding of high-dim space


Mapping words to a conceptual embedding space: Word2Vec
• Words with similar contexts should
map to similar coordinates in the
embedding space
• To achieve this, use prediction
context:
encoding [embedding,
representation learning], decoding
[actual prediction]
• Train weights through densely-
connected network [dense] and
through embeddings [emb] with
backpropagation
• Initial embeddings are scattered,
but after training, characters group
together [and words similarly]
• Use multiple consecutive
characters to increase context
information  Prediction improves
• From characters to words: need
larger context, more layers, higher-
dimensional representation
From Words  Sentences  Docs: Attention, Transformer, Re-shaping
• Attention is all you need. NeurIPS 2017
• 125k citations, as of March 1, 2023
(Watson & Crick’s 1953 Nature: 17k)

Key Idea: How important is this word,


with respect to ALL other words?

Encoder: reads the entire


sequence all at once.

Decoder: reads left


to right (but
parallelized)

Positional encodings
Multi-modal generative AI: Image  Text Translation
Paint a classroom of students
listening to a lecture on multi
modality with astronauts and
knights and princesses where the
lecturer is a giant bear

The image depicts a bright classroom scene. There are multiple rows of wooden
desks, each accommodating two students, and the room is filled with children who
appear to be in elementary school. The students are wearing casual clothing, with a
variety of patterns including stripes and plaids. The majority of the children have
their hands raised, signaling eagerness to participate or answer a question. In the
background, there is a teacher standing next to a whiteboard, which is partially
obscured in the image. The whiteboard appears to be blank. The room has large
windows that allow plenty of natural light to fill the space, and there are white walls
and a green chalkboard behind the teacher. The desks have open fronts where books
and notebooks can be stored, and there are papers and books on the desks. The
children's attention is focused on the teacher, indicating an interactive and engaging
class environment
Cross-modal “Visual-Semantic Embeddings”
WSABI (Weston et al 2010), DeVise (Frome et al 2013),
Cross-Modal Transfer (Socher et al 2013)

Frome et al. 2013

Socher et al. 2013

Cross-Modal Transfer (Socher 2013)


• Zero-Shot Learning Through Cross-Modal Transfer
WSABIE: Scaling Up To Large Vocabulary Image Annotation • Object/concept recognition in one modality (e.g. image) even
• Improve image annotation/tagging, scale to large annot vocab when description only in other modality (e.g., text).
• Joint embedding space images + annotation words • NNs to understand relationship between modalities.
• Ranking loss function: learn representations tune to ranking • Zero-Shot Learning: categories not seen during training
annotations of a given image • Semantic Mapping of visual  textual features in common space
where they can be compared and associated
Latent AI Embedding Space
Cartography and Navigation
Accelerate Discovery Process Itself: AI for the Future of Work
Cancer Research & Biomarkers
Cancer Genetics
& Epigenetics
Gene Expression
& Genome Stability
Neurobiological Genomics &
Factors in Neurological Gene Regulation
Disorders

Genetics & Gene Expression


Healthspan & Regulation

Genomics

Example 1: Embedding 155,011 papers citing our work


• Idea Navigator: Explicit Interactive Embedding Space Exploration
• Multi-Resolution: Team, Institution, Sub-Field, Humanity, Custom
• Multi-Scale: Hours, Day, Week, Month, Year, Humanity, Custom
• Multi-Modal: Video,Audio,Notes,Github,Gdocs,Dropbox,Email,Slack
• Density, relatedness, temporality, correlation, bridges, dynamics
• Navigate, explore, transparency, guidance, uncharted territories
• Now: Team collaboration, match ideasprojectspeopledata
• Applications: self-reflection, coordination, planning, evaluation, growth
• Training: students, new team members, re-allocation, re-training
• Ultimate Goal: enrich ways we think, create, collaborate, plan, reflect
• Interactivity: summarize, extrapolate, innovate, follow up, connect
Spruce Campbell
Will Hathaway
Dakota Goldberg
Evan Liu
Brian Zheng Example 2: Embedding 10,428 meeting auto-transcript
Pathological map of 100s of Patients

Every person is a point, based on their cellular expression patterns Use ‘cartography’ from gene expression to map phenotype
Reveal impact of gene expression, phenotype, genotype Common foundational map  reason about health impact
Multi-modal Embeddings of 2.4 million Human Cells
Every dot is a 20,000-dimensional vector
Integrate 2.4 million ‘documents’
Impact:
• Understand gene relationships
• Understand impact of phenotype
• Understand impact of age, sex
• Understand pathway correlations
• Understand gene co-variation
• Map phenotype to cell space
Functional knowledge graph of 20,000 human genes

Knowledge graph integrates: Reveal dimensions of variation


• Diseases, phenotypes, drugs, anatomy, exposures • Function of every protein, in the context of all knowledge
• Biological process, molecular function, localization • Map protein structure, chemical function, gene expression
• Biological Pathways and Biological Functions • Foundational ‘Google Maps’ Layout for layering on knowledge
Joint Map of Protein Structure, Function, Text

Biomedical KnowledgeOntologiesProtein StructureDrugsPathwaysDisease


Geometric deep learning drug design: heart, cancer, Alz

Cardiovascular

Large language models Biological language models Single-cell AI models Geometric deep learning

Metastatic
melanoma

Reasoning, interactive, user-guided, AI-powered drug design Structure-to-Function for


proteins+chemistry
Brad Pentelute
Marinka Zitnik
Owen Queen
Yepeng Huang
Tianlong Chen
Tom Hartvigsen Alzheimer’s
Tom Cobley Self-supervised struct. foundation
models
Literature Description Map of 20k Human Proteins
Navigate 3 million
NY Times articles

57
Papers + Grants + Patents + Startups + Offices Dynamics over time: Knowledge evolution 100,000 loans, clustered by description
 Flow of Knowledge, resources  Disciplines emerging, maturing, changing  Context-specific predictive algorithms

Education: MIT, EdX, AP, High-school, YouTube Google News, Podcasts, Websites, Wikipedia Collaboration, Productivity, Team Progress
 Multi-modal learning, interdisciplinary links  Ontology creation and labeling  Link projects across team members
 Match CVs, job descriptions, skill sets  Auto-link creation, paragraph level  Within-meeting live track of productivity
The power of Maps for Physical Space Navigation
Maps give us
Landscape
Landmarks
Anchor points
Street names
Highway names
Simplification
Abstraction
Summarization
Decision making
The road ahead: Systematic understanding in biology + work

• AI as a discovery partner: multi-modal foundation • AI as “Google Maps” for navigating knowledge space
models • Visual search + integration through millions of documents
• Gain insights previous inaccessible to human scientists • Hierarchical interactive knowledge representation + manip.
• Build rich intuition on biological + therapeutic space
Goals for today: Course Introduction
1. Course overview:
– Staff, students, responses to student survey
– Foundations, frontiers, textbook, homework, quiz
– Final project: teams, mentorship, challenge,
relevance, originality, achievement, presentation
2. Why Computational Biology + Generative AI?
– Why Computation is well-suited to Biology+Genomes
– Why Gen AI + Representation Learning is different
3. Overview of course modules
– Genomes, Expression, Epigenomics,
Networks, Genetics, Evolution, Frontiers
4. Biology primer (in the context of this course)
– Central Dogma of Molecular Biology
– DNA, Epigenomics, RNA, Protein, Networks
– Human genetics, Drug Discovery
Course at a Glance
Challenges in Computational Biology
4 Genome Assembly

5 Regulatory motif discovery 1 Gene Finding


DNA

2 Sequence alignment

6 Comparative Genomics
TCATGCTAT
TCGTGATAA 3 Database lookup
7 Evolutionary Theory TGAGGATAT
TTATCATAT
TTATGATTT

8 Gene expression analysis

RNA transcript
9 Cluster discovery 10 Gibbs sampling
11 Protein network analysis

12 Metabolic modelling

13 Emerging network properties


Aligning and Modeling Genomes

• Foundations vs. frontiers


– Foundations: Classical computational methods / biological topics
– Frontiers: Latest developments, open questions, research areas
– Duality for each: basic problems / fundamental techniques
• Sequence alignment:
– Local/global alignment: infer nucleotide-level evolutionary events
– Database search: scan for regions that may have common ancestry
• Hidden Markov Models
– Hidden Markov Models (HMMs): Central tool in CS
– Decoding, evaluation, parsing, likelihood, scoring
Dynamic Programming Algorithms: Align, HMMs
x1 ………………………… xM State
y1 ………………………… yN

1
2

Vk(i)

x1 x2 x3 ………………………………………..xN
• Sequence alignment • Hidden Markov Models
• DP: Core computational technique
– Pervasive in computer science, and computational biology
– Fully explore exponential search spaces in poly time!
– Greedy algorithms will not work, back-tracking, saving soln
– Special requirements: Optimal substructure
– Found in: alignment, HMMs, phylogeny, genetics, pop gen…
Gene expression analysis and transcripts

• Computational foundations:
– Unsupervised Learning: Expectation Maximization
– Supervised learning: generative/discriminative models
– Read mapping, significance testing, splice graphs
• Biological frontiers:
– PS2: Modeling conservation, GC content, CpG islands
– L6/L7: Genome annotation and parsing
– L8: Gene expression analysis: cluster genes/conditions
– L9: Regulatory motif discovery: EM, gibbs sampling, info
Natural 1st step: group similar rows/columns
Clustering
 Similar cell types  Similarly-behaving groups of genes
Conditions
Conditions
Genes

Genes

Armstrong, Nature Gen 2002 Alizadeh, Nature 2000

Reveal common Reveal common gene behaviors


‘conditions’
If labels are known: find more of same type
Classification
 Classify diseases  Classify genes in different pathways

Armstrong, Nature Gen 2002 Alizadeh, Nature 2000

Find features that Find additional members of existing gene classes


distinguish known classes Predict function of uncharacterized genes
Epigenomics and gene regulation

• Computational Foundations
– Hidden Markov Models (HMMs): Central tool in CS
– Decoding, evaluation, parsing, likelihood, scoring
– Unsupervised Learning: Expectation Maximization
– Supervised learning: generative/discriminative models
• Biological frontiers:
– PS2: Modeling conservation, GC content, CpG islands
– L6/L7: Genome annotation and parsing
– L8: Gene expression analysis: cluster genes/conditions
– L9: Regulatory motif discovery: EM, gibbs sampling, info
Motifs summarize TF sequence specificity

• Summarize
information

• Integrate many
positions

• Measure of
information

• Distinguish motif
vs. motif instance

• Assumptions:
– Independence
– Fixed spacing
Starting positions  Motif matrix
• given aligned sequences  easy to compute profile matrix
shared motif sequence positions

1 2 3 4 5 6 7 8

A 0.1 0.3 0.1 0.2 0.2 0.4 0.3 0.1


C 0.5 0.2 0.1 0.1 0.6 0.1 0.2 0.7
G 0.2 0.2 0.6 0.5 0.1 0.2 0.2 0.1
T 0.2 0.3 0.2 0.2 0.1 0.3 0.3 0.1
given profile matrix

• easy to find starting position probabilities

Key idea: Iterative procedure for estimating both, given


uncertainty
(learning problem with hidden variables: the starting positions)
Multivariate HMM for Chromatin States
Transcription
Enhancer
Start Site
Transcribed Region DNA

Observed
chromatin
marks. Called
K4me1 K4me3 K4me3 K4me1 K36me3 K36me3 K36me3
based on a K36me3

poisson
distribution K27ac K4me1

Most likely
Hidden State 1 2 3 4 6 6 6 6 6 5 5 5
High Probability Chromatin Marks in State
0.8 0.8
200bp 1: K4me1 K27ac
0.7 4: All probabilities are
intervals K4me1
0.9 learned from the data
2: 0.8
5:
K4me3 K4me1

3: 0.9 6: 0.9 72
K4me3 K36me3

Ernst and Kellis


Nature Biotech 2010
Evolution/phylogeny/populations

• Phylogenetics / Phylogenomics
– Phylogenetics: Evolutionary models, Tree building, Phylo inference
– Phylogenomics: gene/species trees, reconciliation, coalescent, pops
• Population genomics:
– Learning population history from genetic data (David Reich)
– Statistical genetics: disease mapping in populations (Mark Daly)
– Measuring natural selection in human populations (Pardis Sabeti)
– The missing heritability in genome-wide associations (Yaniv Erlich)
• And we’re done! Last pset Nov 21st, In-class quiz on Nov 22nd
– No lab 4! Then entire focus shifts to projects, Thanksgiving, Frontiers
Characterizing sub-threshold variants in heart arrhythmia

Focus on sub-threshold variants


(e.g. rs1743292 P=10-4.2)
Trait: QRS/QT interval
(1) Large cohorts, (2) many known hits
(3) well-characterized tissue drivers
Protein folding, 3D structure, Chemical Structure, Geometric Deep Learning, GNNs, PLMs
Course at a Glance
Goals for today: Course Introduction
1. Course overview:
– Staff, students, responses to student survey
– Foundations, frontiers, textbook, homework, quiz
– Final project: teams, mentorship, challenge,
relevance, originality, achievement, presentation
2. Why Computational Biology + Generative AI?
– Why Computation is well-suited to Biology+Genomes
– Why Gen AI + Representation Learning is different
3. Overview of course modules
– Genomes, Expression, Epigenomics,
Networks, Genetics, Evolution, Frontiers
4. Biology primer (in the context of this course)
– Central Dogma of Molecular Biology
– DNA, Epigenomics, RNA, Protein, Networks
– Human genetics, Drug Discovery
Biology primer

Quick introduction to molecular biology


and information transfer within the cell
“Central dogma” of Molecular Biology

DNA
makes

RNA
makes

Protein
DNA: The double helix
• The most noble molecule of our time
DNA: the molecule of heredity
• Self-complementarity sets molecular basis of heredity
– Knowing one strand, creates a template for the other
– “It has not escaped our notice that the specific pairing we have postulated immediately
suggests a possible copying mechanism for the genetic material.” Watson & Crick, 1953
DNA: chemical details
2’ 3’
T 1’
4’ • Bases hidden on the inside
5’
5’
A • Phosphate
outside
• backbone
Weak hydrogen bonds hold the
two strands together
4’ 1’ 2’ 3’ • This allows low-energy opening
3’ 2’ C 1’
4’ and re-closing of two strands
5’
5’
G
• Anti-parallel strands
4’ 1’ 2’ 3’ • Extension 5’3’ tri-
3’ 2’ T 1’
4’ phosphate coming from
5’ newly added nucleotide
5’
A
4’ 1’ 2’ 3’ The only parings are:
3’ 2’ C 1’
4’
• A with T
5’
5’
G • C with G

4’ 1’
3’ 2’
DNA: the four bases

Purine Purine
Pyrimidine Pyrimidine
Weak Weak
Strong Strong
Amino Amino
Keto Keto
Goals for today: Course Introduction
1. Course overview:
– Staff, students, responses to student survey
– Foundations, frontiers, textbook, homework, quiz
– Final project: teams, mentorship, challenge,
relevance, originality, achievement, presentation
2. Why Computational Biology + Generative AI?
– What makes our field unique
3. Overview of course modules
– Genomes, Expression, Epigenomics,
Networks, Genetics, Evolution, Frontiers
4. Biology primer (in the context of this course)
– Central Dogma of Molecular Biology
– DNA, Epigenomics, RNA, Protein, Networks
– Human genetics, Drug Discovery
“Central dogma” of Molecular Biology

DNA Epigenomics
makes

RNA
makes

Protein
Chromosomes inside the cell
• Eukaryote cell

• Prokaryote
cell
DNA packaging
• Why packaging
– DNA is very long
– Cell is very small
• Compression
– Chromosome is 50,000
times shorter than
extended DNA
• Using the DNA
– Before a piece of DNA
is used for anything,
this compact structure
must open locally
• Now emerging:
– Role of accessibility
– State in chromatin itself
– Role of 3D interactions
Diverse epigenetic modifications

89
Image source: http://nihroadmap.nih.gov/epigenomics/
Diversity of epigenetic modifications
modifications • 100+ different histone modifications
• Histone protein  H3/H4/H2A/H2B
• AA residue  Lysine4(K4)/K36…
• Chemical modification  Met/Pho/Ubi
Histone tails • Number  Me-Me-Me(me3)
• Shorthand: H3K4me3, H2BK5ac
• In addition:
• DNA modifications
• Methyl-C in CpG / Methyl-Adenosine
• Nucleosome positioning
• DNA accessibility
• The constant struggle of gene regulation
DNA wrapped around
histone proteins • TF/histone/nucleo/GFs/Chrom compete 90
Epigenomics Roadmap across 100+ tissues/cell types

Diverse epigenomic assays:


1. Histone modifications
• H3K4me3, H3K4me1
• H3K36me3
Art: Rae Senarighi, Richard Sandstrom • H3K27me3, H3K9me3
• H3K27/9ac, +20 more
2. Open chromatin:
• DNase
3. DNA methylation:
• WGBS, RRBS, MRE/MeDIP
4. Gene expression
Diverse tissues and cells: • RNA-seq, Exon Arrays
1. Adult tissues and cells (brain, muscle, heart, digestive, skin, adipose, lung, blood…)
2. Fetal tissues (brain, skeletal muscle, heart, digestive, lung, cord blood…)
3. ES cells, iPS, differentiated cells (meso/endo/ectoderm, neural, mesench, trophobl)
Deep sampling of 9 reference epigenomes (e.g. IMR90)

UWash Epigenome Browser, Ting Wang


Chromatin state+RNA+DNAse+28 histone marks+WGBS+Hi-C
Diverse chromatin signatures encode epigenomic state

Enhancers Promoters Transcribed Repressed


• H3K4me1 • H3K4me3 • H3K36me3 • H3K9me3
• H3K27ac • H3K9ac • H3K79me2 • H3K27me3
• DNase • DNase • H4K20me1 • DNAmethyl
• H3K4me3
• H3K4me1
• H3K27ac
• H3K36me3
• H4K20me1
• H3K79me3
• H3K27me3
• H3K9me3
• H3K9ac
• H3K18ac

• 100s of known modifications, many new still emerging


• Systematic mapping using ChIP-, Bisulfite-, DNase-Seq
Chromatin state annotations across 127 epigenomes

Reveal epigenomic variability: enh/prom/tx/repr/het


Anshul Kundaje
Goals for today: Course Introduction
1. Course overview:
– Staff, students, responses to student survey
– Foundations, frontiers, textbook, homework, quiz
– Final project: teams, mentorship, challenge,
relevance, originality, achievement, presentation
2. Why Computational Biology + Generative AI?
– What makes our field unique
3. Overview of course modules
– Genomes, Expression, Epigenomics,
Networks, Genetics, Evolution, Frontiers
4. Biology primer (in the context of this course)
– Central Dogma of Molecular Biology
– DNA, Epigenomics, RNA, Protein, Networks
– Human genetics, Drug Discovery
“Central dogma” of Molecular Biology

DNA
makes

RNA
makes

Protein
Genes control the making of cell parts

• The gene is a fundamental unit of inheritance


– Each DNA molecule  10,000+ genes
– 1 gene  1 functional element (one “part” of cell
machinery)
– Every time a “part” is made, the corresponding gene is:
• Copied into mRNA, transported, used as blueprint to make protein
• RNA is a temporary copy
– The medium for transporting genetic information from the
DNA information repository to the protein-making machinery
is an RNA molecule
– The more parts are needed, the more copies are made
– Each mRNA only lasts a limited time before degradation
mRNA: The messenger

• Information changes medium


– single strand vs. double strand
– ribose vs. deoxyribose sugar
A T T A C G G T A C C G T
U A A U G C C A U G G C A
– Compatible base-pairing in
hybrid
From DNA to RNA: Transcription
From pre-mRNA to mRNA: Splicing

• In Eukaryotes, not every part of a gene is coding


– Functional exons interrupted by non-translated introns
– During pre-mRNA maturation, introns are spliced out
– In humans, primary transcript can be 106 bp long

– Alternative splicing can yield different exon subsets for the same gene,
and hence different protein products
RNA can be functional

• Single Strand allows complex structure


– Self-complementary regions form helical stems
– Three-dimensional structure allows functionality of RNA
• Four types of RNA
– mRNA: messenger of genetic information
– tRNA: codon-to-amino acid specificity
– rRNA: core of the ribosome
– snRNA: splicing reactions
• To be continued…
– We’ll learn more in a dedicated lecture on RNA world
– Once upon a time, before DNA and protein, RNA did all
RNA structure: 2ndary and 3rdary
Splicing machinery made of RNA
Goals for today: Course Introduction
1. Course overview:
– Staff, students, responses to student survey
– Foundations, frontiers, textbook, homework, quiz
– Final project: teams, mentorship, challenge,
relevance, originality, achievement, presentation
2. Why Computational Biology + Generative AI?
– What makes our field unique
3. Overview of course modules
– Genomes, Expression, Epigenomics,
Networks, Genetics, Evolution, Frontiers
4. Biology primer (in the context of this course)
– Central Dogma of Molecular Biology
– DNA, Epigenomics, RNA, Protein, Networks
– Human genetics, Drug Discovery
“Central dogma” of Molecular Biology

DNA
makes

RNA
makes

Protein
Proteins carry out the cell’s chemistry

• More complex polymer


– Nucleic Acids have 4 building blocks
– Proteins have 20. Greater versatility
– Each amino acid has specific properties
• Sequence  Structure  Function
– The amino acid sequence determines the
three-dimensional fold of protein
– The protein’s function largely depends on
the features of the 3D structure
• Proteins play diverse roles
– Catalysis, binding, cell structure, signaling,
transport, metabolism
Protein structure

Alpha-beta horseshoe
Beta-barrel this placental ribonuclease inhibitor is a
Helix-turn-helix Some antiparallel b-sheet cytosolic protein that binds extremely
domains are better described as strongly to any ribonuclease that may leak
Common motif for into the cytosol. 17-stranded parallel b
b-barrels rather than b-
DNA-binding proteins sheet curved into an open horseshoe shape,
sandwiches, for example
that often play a with 16 a-helices packed against the outer
streptavadin and porin. Note
regulatory role as surface. It doesn't form a barrel although it
that some structures are
mRNA level looks as though it should. The strands are
transcription factors intermediate between the only very slightly slanted, being nearly
extreme barrel and sandwich parallel to the central `axis'.
arrangements.
Protein building blocks
• Amino Acids
From RNA to protein: Translation

•tRNA
• Ribosome
The Genetic Code

 Use evolutionary and compositional properties


to computationally discover protein-coding genes
Summary: The Central Dogma
DNA makes RNA makes Protein

Inheritance

Messages

Reactions
Goals for today: Course Introduction
1. Course overview:
– Staff, students, responses to student survey
– Foundations, frontiers, textbook, homework, quiz
– Final project: teams, mentorship, challenge,
relevance, originality, achievement, presentation
2. Why Computational Biology + Generative AI?
– What makes our field unique
3. Overview of course modules
– Genomes, Expression, Epigenomics,
Networks, Genetics, Evolution, Frontiers
4. Biology primer (in the context of this course)
– Central Dogma of Molecular Biology
– DNA, Epigenomics, RNA, Protein, Networks
– Human genetics, Drug Discovery
Cellular dynamics and regulation
How cells move through this Central Dogma

DNA
makes

Gene regulation RNA


makes

Protein
Animal/Human gene regulation:
One genome  Many cell types
ACCAGTTACGACGGTCA
GGGTACTGATACCCCAA
ACCGTTGACCGCATTTA
CAGACGGGGTTTGGGTT
TTGCCCCACACAGGTAC
GTTAGCTACTGGTTTAG
CAATTTACCGTTACAAC
GTTTACAGGGTTACGGT
TGGGATTTGAAAAAAAG
TTTGAGTTGGTTTTTTC
ACGGTAGAACGTACCGT
TACCAGTA

114
Image Source wikipedia
Eukaryotic Gene Regulation
Diverse roles for regulatory non-coding RNAs

• Small RNA pathways (18-21 nt)


– microRNAs:
• Repress genes by targeting their 3’UTRs by complementarity
• Double-stranded RNA is then recognized and degraded
• Recently found to also target promoter regions in rare cases
– piwiRNAs
• Target and repress transposable elements in germline
– snoRNAs
– 21U-RNAs
• Long non-coding RNAs (1000s nt, many exons)
– Scaffolds for protein/TF binding
– Scaffolds for 3D structure of RNA
Regulation of Gene Expression

• Upstream of genes are


Transcription Factor Polymerase promoter regions
Promoter
• Contain promoter sequences
or motifs
• Transcription factors (TFs)
bind to motifs
mRNA
• TFs recruit RNA polymerase
Transcription Factor Binding Site • Gene transcription
Examples:
Predicted motif drivers
of enhancer modules

• Activator and
repressor motifs
consistent with
tissues
Pouya Kheradpour
Network components reveal functional modules

• Feed-forward loops in developmental patterning


• Cooperation of master reg. & downstream reg.
Zeitlinger et al, Genes & Development 2007
Systematic motif dissection in 2000 enhancers:
5 activators and 2 repressors in 2 cell lines

54000+ measurements (x2 cells, 2x repl)

Kheradpour et al Genome Research 2013


Emerging properties of regulatory networks

• Hierarchical levels of regulatory control


– Small number of backward-pointing edges
• Specific / distinct feedback by microRNAs at each level
– Two classes of TFs: miRNA regulators and miR-regulated
From Systems Biology to Synthetic Biology
Regulatory Networks
Synthetic

Jim Collins
• Components with
known properties
• Assemble based
Metabolic Pathways

on engineering
goals / principles
Synthetic

• Implement within
engineered cells
and organisms
• Study behavior &
adjust as needed
Jay Keasling
Over-express a single microRNA leads to new wing
wing
w/bristles

Note: C,D,E same magnification


Sensory bristles
haltere

wing haltere
WT

wing

sense Antisense

• Discovery of sense/anti-sense miRNAs


• Regulatory switch selects between two
developmental programs
• By over-expressing one strand (miRNAas)
the balance is tilted
• Wing program launched vs. haltere Stark et al, Genes&Development 2007
Goals for today: Course Introduction
1. Course overview:
– Staff, students, responses to student survey
– Foundations, frontiers, textbook, homework, quiz
– Final project: teams, mentorship, challenge,
relevance, originality, achievement, presentation
2. Why Computational Biology + Generative AI?
– What makes our field unique
3. Overview of course modules
– Genomes, Expression, Epigenomics,
Networks, Genetics, Evolution, Frontiers
4. Biology primer (in the context of this course)
– Central Dogma of Molecular Biology
– DNA, Epigenomics, RNA, Protein, Networks
– Human genetics, Drug Discovery
Brief intro to Human Genetics
The role of genetic alterations

DNA
makes

RNA
makes

Protein
Brief intro to human genetics
• Human genome: 3.2B letters, 2 copies, 23 chromosomes,
20k genes, ~3M common SNPs, ~500k haplotype blocks
The power and challenge of disease-association studies

Slide credit: Luke Ward, Mark Daly

• Large associated blocks with many variants: Fine-mapping challenge


• No information on cell type/mechanism, most variants non-coding
 Epigenomic annotations help find relevant cell types / nucleotides
The power of GWAS: reveal new disease genes

rs11209026 A G
Cases 22 976
IL23R cytokine receptor on a subset of effector T-cells
Controls 68 932
Chi-sq = 24.5, p=7.3 x 10-7
Genomewide association in schizophrenia
with 40,000 cases

More than 100 distinct regions of


the genome associated to
schizophrenia!!!
Stephan Ripke
Interpreting non- xx
coding variants

• Disease-associated SNPs enriched for enhancers in relevant cell types


• E.g. lupus SNP in GM enhancer disrupts Ets1 predicted activator
Mechanistic predictions for top disease-associated SNPs
Lupus erythromatosus in GM lymphoblastoid Erythrocyte phenotypes in K562 leukemia cells

Disrupt activator Ets-1 motif Creation of repressor Gfi1 motif


 Loss of GM-specific activation  Gain K562-specific repression
 Loss of enhancer function  Loss of enhancer function
 Loss of HLA-DRB1 expression  Loss of CCDC162 expression
Characterizing sub-threshold variants in heart arrhythmia

Focus on sub-threshold variants


(e.g. rs1743292 P=10-4.2)
Trait: QRS/QT interval
(1) Large cohorts, (2) many known hits
(3) well-characterized tissue drivers
GWAS hits in enhancers of relevant cell types
Linking traits to their relevant cell/tissue types

ES
Liver

Brain
Digestive

Heart

T cells B cells
Methylation differences a causal component of AD

Methylation probes altered in AD


are enriched in AD-associated SNPs

GMD
GMD
G  D
AD predictive power reduced
M
after removing meQTL effect
Set-wise causality testing
Uncovering the molecular basis of top obesity gene

Lean

Obese

ARID5B KD IRX3, IRX5 knock-down


(obesity) (anti-obesity phenotypes)
ARID5B OE
IRX3, IRX5 overexpression
(anti-obesity)
(pro-obesity phenotypes)

C-to-T motif rescue T-to-C motif disruption


(anti-obesity phenotypes) (pro-obesity phenotypes)
Model: beige  white adipocyte development

Shift therapeutic focus from brain to adipocytes


Goals for today: Course Introduction
1. Course overview:
– Staff, students, responses to student survey
– Foundations, frontiers, textbook, homework, quiz
– Final project: teams, mentorship, challenge,
relevance, originality, achievement, presentation
2. Why Computational Biology + Generative AI?
– Why Computation is well-suited to Biology+Genomes
– Why Gen AI + Representation Learning is different
3. Overview of course modules
– Genomes, Expression, Epigenomics,
Networks, Genetics, Evolution, Frontiers
4. Biology primer (in the context of this course)
– Central Dogma of Molecular Biology
– DNA, Epigenomics, RNA, Protein, Networks
– Human genetics, Drug Discovery
Course at a Glance

You might also like