A machine learning framework for the prediction of chromatin folding in Drosophila using epigenetic features

View article
PeerJ Computer Science

Main article text

 

Introduction

Materials and Methods

Data

Target value

Problem statement

  • objects are genomic bins of 20-kb length that do not intersect,

  • input features are the measurements of chromatin factors’ binding,

  • target value is the transitional gamma, which characterizes the TAD status of the region and, thus, the DNA folding,

  • objective is to predict the value of transitional gamma and to identify which of the chromatin features are most significant in predicting the TAD state.

Selection of loss function

Machine learning models

Results

Chromatin marks are reliable predictors of the TAD state

The context-aware prediction of TAD state is the most reliable

Reduced set of chromatin marks is sufficient for a reliable prediction of the TAD state in Drosophila

Feature importance analysis reveals factors relevant for chromatin folding into TADs in Drosophila

TAD state prediction models are transferable between cell lines of Drosophila

The all-cell-lines model improves prediction for most cell lines

Discussion

Conclusions

Supplemental Information

Weighted MSE for each dataset while using each chromatin separately as the input single on train, test and validation datasets

Results of biLSTM RNN using (A) Schneider-2, (B) Kc167, (C) DmBG3-c2 and (D) all three cell lines together. The green lines reflect the weighted MSE scores on the test sets, the blue lines show the wMSE on the train sets and the yellow lines correspond to the same metric on the validation datasets.

DOI: 10.7717/peerj-cs.307/supp-1

Histograms of (A) the original and (B) the normalized data on ChIP-chip features for the Schneider-2 cell line

Each histogram corresponds to the distribution of the analysed ChIP-chip features. Before the normalization (A), the distributions are not centered at zeros and have varying variance. After normalization (B), all the features are rescaled to the same mean and variance.

DOI: 10.7717/peerj-cs.307/supp-2

The modENCODE IDs of chromatin factors for three selected Drosophila cell lines

Each number in the table corresponds to the modENCODE ID. The columns identify the Drosophila cell lines. The rows show the chromatin factors.

DOI: 10.7717/peerj-cs.307/supp-3

Additional Information and Declarations

Competing Interests

Mikhail Gelfand is an Academic Editor for PeerJ. Grigory V. Sapunov is employed by Intento, Inc.

Author Contributions

Michal B. Rozenwald conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft.

Aleksandra A. Galitsyna conceived and designed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft.

Grigory V. Sapunov, Ekaterina E. Khrameeva and Mikhail S. Gelfand conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the paper, and approved the final draft.

Data Availability

The following information was supplied regarding data availability:

1. The code and the data are available at GitHub: https://github.com/MichalRozenwald/Hi-ChIP-ML

2. The chromatin marks are available at modEncode using the following IDs:

# name Schneider-2 Kc167 DmBG3-c2

1 Chriz 279 277 275

2 CTCF 3749 3749 3671

3 Su(Hw) 5147 3801 3717

4 BEAF-32 922 3745 3663

5 CP190 925 3748 3666

6 GAF 3753 3753 2651

7 H3K4me1 3760 5138 2653

8 H3K4me2 965 4935 2654

9 H3K4me3 3761 5141 967

10 H3K9me2 311 938 310

11 H3K9me3 4183 3013 312

12 H3K27ac 3757 3757 295

13 H3K27me1 3943 3942 3941

14 H3K27me3 298 5136 297

15 H3K36me1 3170 3003 299

16 H3K36me3 303 302 301

17 H4K16ac 320 318 316

18 RNA-polymerase-II 329 328 950

3. The Hi-C data is available at NCBI GEO: GSE69013.

Funding

This study was supported by the Russian Science Foundation, grant number 19-74-00112, and Skoltech Fellowship in Systems Biology for Aleksandra A. Galitsyna. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

32 Citations 4,466 Views 838 Downloads