Clustering for Gene Expression Analysis
Clustering for Gene Expression Analysis
Olga Georgieva
Sofia University „St. Kliment Ohridski”, 5 James Bourchier blvd., 1164 Sofia, Bulgaria
Abstract
Since several decades intensive research for revealing and understanding gene expression and
consequent their role in organisms live are research focus of scientists in different areas. A part
of these investigations is the analysis of gene expression data. The present paper extends and
by that improves an iterative clustering procedure for detection genes that are differentially
expressed. The presented analytical review of the clustering algorithms applied to the gene
expression analysis serves as a good prerequisite for this aim. Additional investigation is
presented and conclusion for appropriate choice of the procedure parameters is summarized.
Keywords 1
Gene expression analysis, Differential gene expression, Clustering analysis
1 Introduction
Since several decades intensive research for revealing and understanding gene expression and
consequent their role in organisms live are research focus of scientists in different areas. A part of these
investigations is the analysis of gene expression data. This is a numerical table of values of expression
levels of thousands of genes observed for (usually large amounts) of samples [1].
The expression data tables are obtained through well-established technologies, which have been
significantly improved in recent years. Thus, the contemporary technologies named Next Generation
Sequencing provide large volumes of DNA-seq and RNA-seq data. The obtained gene expression data
by them are more precise having higher resolution than the older technologies of microarray. It
significantly increases the opportunities for effective research to gather knowledge for existing
dependencies of the genes’ activity. Despite the technology used to obtain gene expression, it results in
gene expression table that contains up to several hundred thousand numbers, where each row
corresponds to one particular gene and each column to a sample. The number of genes in such table
comprises a part or whole genome of a given organism. This huge amount of data needs to be processed
applying specific data mining algorithms to reveal new information about unknown biological
dependences. Typical information that could be revealed by processing these tables is summarized in
the following general tasks [1,5,6]:
To understand how genome sequences specify the forms and functions in the organisms and
how remarkable diversity of life appears on the Earth.
To understand the mechanisms and evolution of the organism’s response to environment
changes.
To find groups of genes that act equally and by that to identify distinct subsets of genes that are
unknown in their biological response.
Specific solutions of the above general tasks could be mentioned as the aim to identify genes with
expression levels that reflect biological processes of certain interest as for instance cancer. One can
identify a specific type of cancer in respect to group of genes responsible for it. Thus, classification
Education and Research in the Information Society, October 13-14, 2022, Plovdiv, Bulgaria
EMAIL: [email protected]
ORCID: 0000-0001-5376-6729
© 2022 Copyright for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR Workshop Proceedings (CEUR-WS.org)
17
obtained from gene expression analysis can be used for more precise tumor diagnostic and its effective
treatment [1].
The posed general tasks need to discover groups of genes could be solved in a supervised manner if
preliminary knowledge about these groups exists. However, in most cases it is searched for unknown
data partitioning. In this case unsupervised methods of cluster analysis are appropriate tool [3,4]. Due
to the vast amount of data in the gene expression matrix it is helpful to process data by subdividing the
genes into a smaller number of categories and then analyzing the obtained groups [2]. It is more
interesting to cluster genes that show similar expression patterns across a number of samples, rather
than clustering the samples themselves.
A specific task of gene expression research is the aim for defining differentially expressed genes.
These are genes that act differently in equal strains, for which many samples of each strain have been
collected. In this way the genes responsible for certain disease states can be discovered or it is possible
to find differences between two species strains. In searching an appropriate solution several methods
for RNA-seq data analysis that identify different genes according to their expression levels basically on
statistical data analysis have been explored [7,8,9]. However, there is no good agreement among the
applied methods as beside the commonly identified genes each method detects additional genes that are
not identified by the others. Furthermore, the additional genes are large number among the distinct
methods. By taking the advantage of the unsupervised data analysis methods an iterative clustering
procedure to deal with this task was recently introduced and compared with several statistical methods
[10]. The difficulty of the selection problem due to the large number of indistinguishable genes is solved
by iterative solutions. The procedure depends on several parameters, which values are subject of a
preliminary choice.
The present paper aims to extend and by that to improve the iterative clustering procedure for
differential gene expression detection. The analytical review presented in the paper underlines the
powerful ability of the clustering algorithms to deal with gene expression data. The review serves as a
good prerequisite to solve the research aim. Based on it additional investigation is presented and
conclusion for appropriate choice of the procedure parameters is summarized.
18
known according to the implemented linkage function. Due to the existing indistinguishability of genes
in their expression different initial linking and different linkage functions have to be explored in order
to establish good clustering results.
The establishment of hierarchical clustering method as a widely used method for clustering gene
expression data is owed to its visibility features. The clustering results are presented in a dendrogram.
However, they are also given in a colored view way that has become a standard for visualization of the
gene expression data.
19
The difficulty of the task could be summarized as:
• Due to the large number of genes and lack of significance in separation by the estimated p-
values no valuable conclusion about the gene separation could be done.
• Often the number of searched genes is (quite) smaller than the whole genes’ amount. These
genes could be treated more as outliers than a significant group.
• The genes do not have (near) constantly activity levels among different samples of a certain
strain. This imposes to work with average values than to single sample data.
A recently proposed solution takes advantages from the density-based clustering approach dueto its
ability to find outliers with the core clusters. The implemented algorithm separates genes densely
scabbed around the equivalence area from the outliers that are away from this area. In fact, the outliers
are data of the interested differentially expressed genes. Figure 1 illustrates this observation for
DBSCAN clustering applied to the average values of the gene’s activity of two mice strains. However,
only 154 genes were discovered against several hundred found by statistical data analysis.
Figure 1: Results of DBSCAN, r = 0,4, Nmin = 20. Discovered clusters are enumerated in the picture
legends: outliers “-1”, core cluster “1” [10].
20
Again, due to the similarity in the genes' behavior as in case of the whole data set (Fig. 1), it could
be expected that the data of each batch form a large compact group along the equivalence area. This
area is rather oblong than spherical one. It suggests that clusters are not Euclidean. This observation
requires further detail investigation into the selection of an appropriate proximity measure. In searching
of proper distance measure different distances have to be applied and the results to be assessed in terms
of separation abilities.
The minimum number Nmin is difficult to find in advance as it depends on the density of the data.
Here we determine heuristically in terms to it increase the outlier’s separation.
The number of genes that form a batch is other problem to be solved. The larger batch could make
impossible to detect the outliers at all, whereas small batch could embarrass detection of clusters and
thus the right distinguish between equivalent and differentially expressed genes.
Once the parameters are determined they are applied to each data batch according to the accepted
iterative scheme of clustering.
Table 1: Number of differentially expressed genes selected by different cluster distances compared
with results of different statistical methods
Method ttest edgeR limma DESeq2
ML
ttest ML edgeR ML limma ML DESeq2 ML all data
Distance
Euclidean 71 71 915 647 736 537 982 648 2848
Mahalanobis 71 71 915 738 736 611 982 735 1905
Cityblock 71 70 915 417 736 365 982 407 994
Minkowski, p=2 71 68 915 286 736 266 982 283 614
Minkowski, p=0.5 71 71 915 789 736 641 982 807 3475
Chebychev, r=0,1, Nmin=5 71 71 915 667 736 548 982 665 2264
In searching the appropriate batch volume, we explore different volumes for clustering by best found
cluster distance (Table 2). The results for batches consisting of 511 and 1022 genes with appropriate
21
settings of the rest two parameters – r and Nmin, are comparable. Certain preferences can be given to
clustering at batch of 511 genes because of the smaller number of all separated genes (ML all data). On
the other hand, batch of 1022 genes ensures larger number of differentially expressed genes.
Table 2: Number of differentially expressed genes selected by iterative procedure with different
batch volumes size and applied Mahalonobis distance
ttest edgeR limma DESeq2
Number of genes
ML
in a batch,
ttest ML edgeR ML limma ML DESeq2 ML all data
algorithm
parameters
511, r=0,2,
71 71 915 738 736 611 982 735 1905
Nmin=5
511, r=0,3,
71 71 915 527 736 457 982 516 896
Nmin=5
1022, r=0,2,
71 71 915 584 736 508 982 573 1058
Nmin=5
1022, r=0,2,
71 71 915 791 736 642 982 787 2050
Nmin=10
1022, r=0,3,
71 71 915 525 736 455 982 514 891
Nmin=10
The advantage of the procedure is its result visibility abilities. For each batch the clustering result
could be seen and assessed visually. Further assessment and interpretation according the biological
meaning of the compact clusters found in the equivalent area and the respective outliers could be applied
(Figure 2).
Figure 2: Results of iterative DBSCAN clustering by Mahalonobis distance measure, r=0,3, Nmin=10
and bach=1022 genes.
22
4 Conclusion
The paper improves an iterative clustering procedure for differential gene expression detection. For
this aim, first an analytical review that underlines the powerful ability of the clustering algorithms to
deal with gene expression data by revealing their advantages and disadvantages for solving gene
expression data analysis tasks is presented. Second, the specific task for searching genes that are
differentially expressed is considered. The task is solved by procedure taking the advantage both of
density-based clustering and the iterative clustering.
The paper makes conclusions about the appropriate choice of the procedure parameters – proper
distance measure and batch volume. The results obtained are compared with statistical data analysis
applied to the investigated dataset of gene expression of two mice strains. The best distance measures
are found that are Mahalonobis and Minkowski distances. Good results are obtained for batches of 511
and 1022 genes.
The iterative procedure is standalone applicable. In other cases, it could be used in combination with
statistical methods in order to stick their search in a smaller number of genes.
5 Acknowledgements
This research work has been supported by GATE project, funded by the Horizon 2020
WIDESPREAD-2018-2020 TEAMING Phase 2 programme under grant agreement No.857155 and
Operational Programme Science, by Operational Programme Science and Education for Smart Growth
under Grant Agreement no. BG05M2OP001-1.003-0002-C01 and by the Science Fund of Sofia
University under project no. 80-10-169/23.05.2022.
6 References
[1] E. Domany, Cluster Analysis of Gene Expression Data. Journal of Statistical Physics 110 (2003): 1117–
1139.
[2] P. D'haeseleer, How does gene expression clustering work? Nat Biotechnol 23 (2005): 1499–1501.
[3] A.K. Jain, R.C.Dubes, Algorithms for Clustering Data. (Prentice Hall, Englewood Cliffs, New Jersey,
1988.
[4] D. Jiang, C. Tang, A. Zhang, Cluster analysis for gene expression data: a survey. IEEE Trans. Know.
Data Eng. 16 (2004): 1370–1386.
[5] M.B. Eisen, P.T. Spellman, P.O. Brown, D. Botstein, Cluster analysis and display of genome-wide
expression patterns. Proc. Natl. Acad. Sci. USA 95 (1998): 14863–14868.
[6] P.T.Spellman, Cluster analysis and display, In: “DNA Microarrays” edited by D. Bowtell and J.
Sambrook, Cold Spring Harbor Laboratory (2002): 569-581.
[7] D. Spies, P.F. Renz, T.A. Beyer, C. Ciaudo, Comparative analysis of differential gene expression tools
for RNA sequencing time course data, Briefings in Bioinformatics 20(1) (2019): 288–298.
[8] D. Palejev, Comparison of RNA-Seq Differential Expression Methods, Cybernetics and Information
Technologies 17(5) (2017): 60-67.
[9] T. Wang, B. Li, C.E. Nelson, et al. Comparative analysis of differential gene expression analysis tools
for single-cell RNA sequencing data. BMC Bioinformatics 20, 40, 2019.
[10] O. Georgieva, Iterative Clustering for Differential Gene Expression Analysis. In: Rojas, I., Valenzuela,
O., Rojas, F., Herrera, L.J., Ortuño, F. (eds) Bioinformatics and Biomedical Engineering. IWBBIO
2022. Lecture Notes in Computer Science 13347. Springer, Cham., 2022.
[11] F.D. Smet, J. Mathys, K. Marchal, G. Thijs, B.D. Moor, and Y. Moreau, “Adaptive Quality-Based
Clustering of Gene Expression Profiles,” Bioinformatics 18 (2002): 735-746.
[12] A. P. Gasch, M.B. Eisen, Exploring the conditional coregulation of yeast gene expression through
fuzzy k-means clustering, Genome Biology, 3(11), (2002): 1-22.
23
[13] O. Georgieva, F. Klawonn, E. Härtig, Fuzzy Clustering of Macroarray Data. In: Reusch, B. (eds)
Computational Intelligence, Theory and Applications. Advances in Soft Computing 33. Springer,
Berlin, Heidelberg, 2005.
[14] D. Kumar, U. Batra, Clustering algorithm for gene expression data. Int J Recent Res Asp. 4, (2017):
122-128.
[15] S. Srivastava, N. Joshi. Clustering techniques analysis for microarray data. Int J Comput Sci Mob
Comput. 3 (2014); 359-364.
[16] M. Ester, H.-P Kriegel, J. Sander, X. Xiaowei, A density-based algorithm for discovering clusters in
large spatial databases with noise. In: Proceedings of the Second International Conference on
Knowledge Discovery in Databases and Data Mining, 226–231. AAAI Press, Portland, 1996.
[17] D. .Bottomly, N. A. R. Walter, J. E. Hunter et al. Evaluating Gene Expression in C57BL/6J and DBA/2J
Mouse Striatum Using RNA-Seq and Microarrays. – PLoS ONE 6(3), e17820, 2011.
24