TCGA gene expression data classification
TCGA gene expression data classification
Presented by
Tareque Mohmud Chowdhury
(144701)
Supervisor
Md Abu Raihan Mostafa Kamal, PhD
Professor, Department of Computer Science and Engineering (CSE)
Islamic University of Technology (IUT)
Contents
Introduction
Motivation
Objectives
Literature Review
• Gene and gene expression
• Gene products
• Gene regulation
Outline of methodology
References
2
Introduction
Cancer is the most common human genetic disease. The transition from a normal cell
to a cancerous cell is driven by changes to a cell's DNA, known as mutations.
Genetic mutations can occur randomly during cell division, can occur due to some
extreme environmental stress, or can be inherited from parents.
Cancer arises from the transformation of normal cells into tumor cells in a multi-stage
process that generally progresses from a pre-cancerous lesion to a malignant tumor.
3
Introduction
A transcriptome is the full range of messenger RNA, or mRNA, molecules expressed
by an organism in a cell, in a tissue or in the entire body.
4
Motivation
Cancer is one of the leading cause of death worldwide, accounting for nearly 10 million
deaths in 2020, or nearly one in every six deaths. [1]
Early diagnosis of cancer has the best chance for a successful treatment. Transcriptome
profile analysis may help to diagnose cancer in early stages.
Transcriptome profile analysis also helps to understand the root causes behind cancer
development and progression.
Recent reviews[10,11] have shown that not much work has done on Transcriptome
profile analysis by Deep Learning despite the the fact that DL has the capability to
discover hidden features from big datasets.
5
Objective
Objective of the study is to build a deep learning based classifier model to detect
cancer types and cancer stages by transcriptome data (gene products) and/or DNA
methylation data.
Using the model identify set of gene products and/or DNA methylation data which
contributed significantly for each cancer types and cancer progression.
Finally validate the gene products and/or DNA methylation data set by pathway
analysis.
6
Challenges(1)
The architecture of a Deep Learning Network (DLN) provides the working parameters
—such as the number, size, and type of layers of the network. This architecture
greatly varies based on the dataset. Mostly intuitive trial and error approach has been
used to fine tune the architecture to achieve higher level of validation accuracy for a
dataset from a certain domain .
7
Challenges(2)
Biological data related to cancer genomics comes in a wide variety of category. Some
examples may include: DNA methylation data, mRNA nucleotide sequence, mRNA
expression data, miRNA nucleotide sequence, miRNA expression data, other RNA
expression data, etc. Choosing the right category to feed to the prepared DLN is not
easy.
Moreover as all category of data points to the same cancer samples, all or subset of
them can be feed to the prepared DLN for better outcome. In this case, DLN need to
be updated to accept such array of diverse input data.
8
Gene and Gene Expression
mRNA is the only RNA which can be translated into proteins. Other RNAs
participated in various cellular activities.
9
Gene Products Transcription + Translation = Central-Dogma
transcription
DNA/Gene RNA
translation
• mRNA Protein
• ncRNA
− lncRNA
− miRNA
− snRNA
− circRNA
− tRNA
− pseudoGene RNA
− Other RNA
mRNA
Protein Level 3
11
Gene Regulation
Gene regulation is the process used to control the timing, location and amount in which
genes are expressed.
Gene regulation is carried out by a variety of mechanisms, including through
regulatory proteins/RNAs and chemical modification of DNA (methylation).
A typical animal gene is regulated by its adjacent promoter plus several enhancers that
can be located in 5' and 3' regulatory regions, as well as within introns. Enhancers are,
on average, 500 base pairs in length and contain as many as 10 binding sites for
multiple transcription factors. [2]
12
Gene Regulation
Figure 1. Regulation of gene expression by transcription factors. Transcriptional activators and repressors bind to specific DNA sequences
present in enhancers and promoters. The regulation of gene expression occurs by enhancing or inhibiting recruitment [3]
13
Datasets(1)
Two major sources of genes and gene expression database related to Cancer are-
• The Cancer Genome Atlas (TCGA) is a landmark cancer genomics program, molecularly
characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer
types available at https://portal.gdc.cancer.gov
14
Datasets (2)
For our study we will download gene expression data from TCGA database for our DL
architecture and after preparing the model we will cross validate it with GEO datasets.
15
Gene Expression Analysis: Literature Review:
Identify a set of genes related to a specific state (normal or disease) or as a difference
between two different states (normal and disease or between disease states), any of
the following methods are used widely:
• DESeq2 [7]
• WGCNA: Weighted Correlation Network Analysis [8]
• ceRNA: Competitive endogenous RNA network [9]
Explore pathways from the set of identified genes using Gene Ontology (GO) database.
16
Literature Review: DESeq2
Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2
Author: Love, M.I., Huber, W. & Anders, S.
DESeq2 offers a comprehensive and general solution for gene-level analysis of RNA-
seq data.
DESeq2 is a method for differential analysis of count data, using shrinkage estimation
for dispersions and fold changes to improve stability and interpretability of estimates.
This enables a more quantitative analysis focused on the strength rather than the
mere presence of differential expression.
17
Literature Review: WGCNA
18
Literature Review: CERNA
Competing endogenous RNAs (ceRNAs): new entrants to the intricacies of gene regulation
Author: Subramanian Subbaya
miRNA is a short sequence (~22nt) of RNA which can attached to promoter region of an mRNA and silence the
mRNA. Afterwards that particular mRNA can never be participated in the translation process hence silenced.
There are around 2000 miRNA exists in human transcriptome. There is a many-to-many relationship exists
between miRNA and mRNA. That means one type of miRNA can silence multiple types of mRNA and one type of
mRNA canbe silenced by multiple types of miRNA.
Moreover, there are some other types of RNA (lncRNA, circRNA, etc.) exists which can work as sponge for miRNA.
i.e., miRNA can be absorbed by them by attaching with them.
19
Literature Review: Deep Learning on Gene
Expression Data
Koumakis L. has listed several limitations to apply DL in genomic data In a review
paper[11] as follows:
• Model interpretation (the black box)
• The curse of dimensionality
• Imbalanced classes
• Heterogeneity of data
A review paper[10] titled “Deep Learning in Cancer Diagnosis and Prognosis
Prediction: A Minireview on Challenges, Recent Trends, and Future Directions” by
Yong-Kui Ma et al. suggested that so far DL has been applied only few of types of
genomic data. Much works can be done to apply DL on different types of genomic
data.
20
Outline of methodology
We download TCGA datasets for each types of Cancer having significant number of
samples.
Prepare a DL architecture of classify types and/or stages of cancer
Process TCGA downloaded datasets into intermediary/augmented form to feed into
the prepared DL architecture
Identify features (genes, gene products) which contributes most for classification of
data.
21
References
1. WHO, World health organization, https://www.who.int/news-room/fact-sheets/detail/cancer
2. Sunil Lakhani et al., The landscape of cancer genes and mutational processes in breast cancer. Nature 486 (7403) 400-U133. https://doi.org/10.1038/nature11017
3. "Versuche ber Pflanzen-Hybriden“, Johann Gregor Mendel, Verhandlungen des naturforscheden Vereines in Brünn 4 1865, [3]-47
4. Adams, J. (2008) The complexity of gene expression, protein interaction, and cell differentiation. Nature Education 1(1):110
5. Duygu Koca, Characterization of novel histone methyltransferases and their roles in cancer, Thesis for Doctoral, Advisor: Roland Schüle, January 2019
6. Preetha Anand, Ajaikumar B. Kunnumakara, Chitra Sundaram, Kuzhuvelil B. Harikumar, Sheeja T. Tharakan, Oiki S. Lai, Bokyung Sung & Bharat B. Aggarwal, Cancer is a Preventable Disease
that Requires Major Lifestyle Changes, Pharmaceutical Research, Vol. 25, No. 9, September 2008 DOI: 10.1007/s11095-008-9661-9
7. Love, M.I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15, 550 (2014).
https://doi.org/10.1186/s13059-014-0550-8
8. Langfelder, P., Horvath, S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 9, 559 (2008). https://doi.org/10.1186/1471-2105-9-559
9. Subramanian Subbaya, Competing endogenous RNAs (ceRNAs): new entrants to the intricacies of gene regulation, Frontiers in Genetics VOL 5, 2014,
https://doi.org/10.3389/fgene.2014.00008
10. Yong-Kui Ma, Ahsan Bin Tufail, Mohammed K. A. Kaabar, Francisco Martínez, A. R. Junejo, Inam Ullah, Rahim Khan, "Deep Learning in Cancer Diagnosis and Prognosis Prediction: A Minireview on
Challenges, Recent Trends, and Future Directions", Computational and Mathematical Methods in Medicine, vol. 2021, Article ID 9025470, 28 pages, 2021. https://doi.org/10.1155/2021/9025470
11. Koumakis L. Deep learning models in genomics; are we there yet? Comput Struct Biotechnol J. 2020 Jun 17;18:1466-1473. doi: 10.1016/j.csbj.2020.06.017. PMID: 32637044; PMCID: PMC7327302.
22
Thank You All
23
Questions ?
24