0% found this document useful (0 votes)
7 views

TCGA gene expression data classification

The document presents a study on transcriptome analysis and its application in understanding genetic diseases, particularly cancer. It outlines the motivation for early cancer diagnosis through deep learning-based classifiers using transcriptome data and discusses the methodology for analyzing gene expression data from major cancer databases. The study aims to identify significant gene products and validate them through pathway analysis to enhance cancer detection and understanding.

Uploaded by

coolbd
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

TCGA gene expression data classification

The document presents a study on transcriptome analysis and its application in understanding genetic diseases, particularly cancer. It outlines the motivation for early cancer diagnosis through deep learning-based classifiers using transcriptome data and discusses the methodology for analyzing gene expression data from major cancer databases. The study aims to identify significant gene products and validate them through pathway analysis to enhance cancer detection and understanding.

Uploaded by

coolbd
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

Transcriptome analysis and genetic diseases

Presented by
Tareque Mohmud Chowdhury
(144701)

Supervisor
Md Abu Raihan Mostafa Kamal, PhD
Professor, Department of Computer Science and Engineering (CSE)
Islamic University of Technology (IUT)
Contents
 Introduction
 Motivation
 Objectives
 Literature Review
• Gene and gene expression
• Gene products
• Gene regulation
 Outline of methodology
 References

2
Introduction
 Cancer is the most common human genetic disease. The transition from a normal cell
to a cancerous cell is driven by changes to a cell's DNA, known as mutations.

 Genetic mutations can occur randomly during cell division, can occur due to some
extreme environmental stress, or can be inherited from parents.

 Cancer arises from the transformation of normal cells into tumor cells in a multi-stage
process that generally progresses from a pre-cancerous lesion to a malignant tumor.

3
Introduction
 A transcriptome is the full range of messenger RNA, or mRNA, molecules expressed
by an organism in a cell, in a tissue or in the entire body.

 Transcriptome profile analysis can help researchers to understand genetic diseases,


progress of genetic disease from one state to another over time.

4
Motivation
 Cancer is one of the leading cause of death worldwide, accounting for nearly 10 million
deaths in 2020, or nearly one in every six deaths. [1]

 Early diagnosis of cancer has the best chance for a successful treatment. Transcriptome
profile analysis may help to diagnose cancer in early stages.

 Transcriptome profile analysis also helps to understand the root causes behind cancer
development and progression.

 Recent reviews[10,11] have shown that not much work has done on Transcriptome
profile analysis by Deep Learning despite the the fact that DL has the capability to
discover hidden features from big datasets.
5
Objective
 Objective of the study is to build a deep learning based classifier model to detect
cancer types and cancer stages by transcriptome data (gene products) and/or DNA
methylation data.
 Using the model identify set of gene products and/or DNA methylation data which
contributed significantly for each cancer types and cancer progression.
 Finally validate the gene products and/or DNA methylation data set by pathway
analysis.

6
Challenges(1)
 The architecture of a Deep Learning Network (DLN) provides the working parameters
—such as the number, size, and type of layers of the network. This architecture
greatly varies based on the dataset. Mostly intuitive trial and error approach has been
used to fine tune the architecture to achieve higher level of validation accuracy for a
dataset from a certain domain .

7
Challenges(2)
 Biological data related to cancer genomics comes in a wide variety of category. Some
examples may include: DNA methylation data, mRNA nucleotide sequence, mRNA
expression data, miRNA nucleotide sequence, miRNA expression data, other RNA
expression data, etc. Choosing the right category to feed to the prepared DLN is not
easy.

 Moreover as all category of data points to the same cancer samples, all or subset of
them can be feed to the prepared DLN for better outcome. In this case, DLN need to
be updated to accept such array of diverse input data.

8
Gene and Gene Expression

 Genes are the coded regions (instructions) of DNA


which can be transcribed into various types of
RNAs.

 Gene expression is the process by which the


instructions (genes) in DNA are converted into a
functional products, such as RNAs or proteins.

 mRNA is the only RNA which can be translated into proteins. Other RNAs
participated in various cellular activities.

9
Gene Products Transcription + Translation = Central-Dogma

transcription
 DNA/Gene  RNA
translation
• mRNA  Protein
• ncRNA
− lncRNA
− miRNA
− snRNA
− circRNA
− tRNA
− pseudoGene RNA
− Other RNA

Level 1 Level 2 Level 3


10
Hierarchy of Gene Products
DNA
gene gene gene gene gene gene gene gene gene gene Level 1

Other RNAs snRNA


lncRNA
Level 2
circRNA
miRNA
~22 nt pseudoGene RNA

mRNA

Protein Level 3

11
Gene Regulation
 Gene regulation is the process used to control the timing, location and amount in which
genes are expressed.
 Gene regulation is carried out by a variety of mechanisms, including through
regulatory proteins/RNAs and chemical modification of DNA (methylation).
 A typical animal gene is regulated by its adjacent promoter plus several enhancers that
can be located in 5' and 3' regulatory regions, as well as within introns. Enhancers are,
on average, 500 base pairs in length and contain as many as 10 binding sites for
multiple transcription factors. [2]

12
Gene Regulation

Figure 1. Regulation of gene expression by transcription factors. Transcriptional activators and repressors bind to specific DNA sequences
present in enhancers and promoters. The regulation of gene expression occurs by enhancing or inhibiting recruitment [3]
13
Datasets(1)
 Two major sources of genes and gene expression database related to Cancer are-

• The Cancer Genome Atlas (TCGA) is a landmark cancer genomics program, molecularly
characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer
types available at https://portal.gdc.cancer.gov

• Gene Expression Omnibus (GEO) is an international repository for high-throughput functional


genomic data which can be downloaded in several formats using a variety of mechanisms at
https://www.ncbi.nlm.nih.gov/geo/info/download.html

 A number of other database have been listed in a review paper[11] by Yong-Kui Ma et


al. but most are based on specific types of cancer with limited number and type of
samples.

14
Datasets (2)
 For our study we will download gene expression data from TCGA database for our DL
architecture and after preparing the model we will cross validate it with GEO datasets.

15
Gene Expression Analysis: Literature Review:
 Identify a set of genes related to a specific state (normal or disease) or as a difference
between two different states (normal and disease or between disease states), any of
the following methods are used widely:

• DESeq2 [7]
• WGCNA: Weighted Correlation Network Analysis [8]
• ceRNA: Competitive endogenous RNA network [9]

 Explore pathways from the set of identified genes using Gene Ontology (GO) database.

16
Literature Review: DESeq2

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2
Author: Love, M.I., Huber, W. & Anders, S.

 DESeq2 offers a comprehensive and general solution for gene-level analysis of RNA-
seq data.

 DESeq2 is a method for differential analysis of count data, using shrinkage estimation
for dispersions and fold changes to improve stability and interpretability of estimates.
This enables a more quantitative analysis focused on the strength rather than the
mere presence of differential expression.

 DESeq2 provides us differentially expressed genes between two or more conditions.

17
Literature Review: WGCNA

WGCNA: an R package for weighted correlation network analysis


Author: Langfelder, P., Horvath, S.

 Weighted gene co-expression network analysis is a systems biology method for


describing the correlation patterns among genes across gene expression data.
 WGCNA can be used for finding clusters (modules) of highly correlated genes.

18
Literature Review: CERNA

Competing endogenous RNAs (ceRNAs): new entrants to the intricacies of gene regulation
Author: Subramanian Subbaya

 miRNA is a short sequence (~22nt) of RNA which can attached to promoter region of an mRNA and silence the
mRNA. Afterwards that particular mRNA can never be participated in the translation process hence silenced.

 There are around 2000 miRNA exists in human transcriptome. There is a many-to-many relationship exists
between miRNA and mRNA. That means one type of miRNA can silence multiple types of mRNA and one type of
mRNA canbe silenced by multiple types of miRNA.

 Moreover, there are some other types of RNA (lncRNA, circRNA, etc.) exists which can work as sponge for miRNA.
i.e., miRNA can be absorbed by them by attaching with them.

19
Literature Review: Deep Learning on Gene
Expression Data
 Koumakis L. has listed several limitations to apply DL in genomic data In a review
paper[11] as follows:
• Model interpretation (the black box)
• The curse of dimensionality
• Imbalanced classes
• Heterogeneity of data
 A review paper[10] titled “Deep Learning in Cancer Diagnosis and Prognosis
Prediction: A Minireview on Challenges, Recent Trends, and Future Directions” by
Yong-Kui Ma et al. suggested that so far DL has been applied only few of types of
genomic data. Much works can be done to apply DL on different types of genomic
data.

20
Outline of methodology
 We download TCGA datasets for each types of Cancer having significant number of
samples.
 Prepare a DL architecture of classify types and/or stages of cancer
 Process TCGA downloaded datasets into intermediary/augmented form to feed into
the prepared DL architecture
 Identify features (genes, gene products) which contributes most for classification of
data.

21
References
1. WHO, World health organization, https://www.who.int/news-room/fact-sheets/detail/cancer

2. Sunil Lakhani et al., The landscape of cancer genes and mutational processes in breast cancer. Nature 486 (7403) 400-U133. https://doi.org/10.1038/nature11017

3. "Versuche ber Pflanzen-Hybriden“, Johann Gregor Mendel, Verhandlungen des naturforscheden Vereines in Brünn 4 1865, [3]-47

4. Adams, J. (2008) The complexity of gene expression, protein interaction, and cell differentiation. Nature Education 1(1):110

5. Duygu Koca, Characterization of novel histone methyltransferases and their roles in cancer, Thesis for Doctoral, Advisor: Roland Schüle, January 2019

6. Preetha Anand, Ajaikumar B. Kunnumakara, Chitra Sundaram, Kuzhuvelil B. Harikumar, Sheeja T. Tharakan, Oiki S. Lai, Bokyung Sung & Bharat B. Aggarwal, Cancer is a Preventable Disease
that Requires Major Lifestyle Changes, Pharmaceutical Research, Vol. 25, No. 9, September 2008 DOI: 10.1007/s11095-008-9661-9

7. Love, M.I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15, 550 (2014).
https://doi.org/10.1186/s13059-014-0550-8

8. Langfelder, P., Horvath, S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 9, 559 (2008). https://doi.org/10.1186/1471-2105-9-559

9. Subramanian Subbaya, Competing endogenous RNAs (ceRNAs): new entrants to the intricacies of gene regulation, Frontiers in Genetics VOL 5, 2014,
https://doi.org/10.3389/fgene.2014.00008

10. Yong-Kui Ma, Ahsan Bin Tufail, Mohammed K. A. Kaabar, Francisco Martínez, A. R. Junejo, Inam Ullah, Rahim Khan, "Deep Learning in Cancer Diagnosis and Prognosis Prediction: A Minireview on
Challenges, Recent Trends, and Future Directions", Computational and Mathematical Methods in Medicine, vol. 2021, Article ID 9025470, 28 pages, 2021. https://doi.org/10.1155/2021/9025470

11. Koumakis L. Deep learning models in genomics; are we there yet? Comput Struct Biotechnol J. 2020 Jun 17;18:1466-1473. doi: 10.1016/j.csbj.2020.06.017. PMID: 32637044; PMCID: PMC7327302.

22
Thank You All

23
Questions ?

24

You might also like