Bioinformatics for wet lab
Bioinformatics for wet lab
Abstract
Background Genomics data is available to the scientific community after publication of research projects and can be
investigated for a multitude of research questions. However, in many cases deposited data is only assessed and used
for the initial publication, resulting in valuable resources not being exploited to their full depth.
Main A likely reason for this is that many wetlab-based researchers are not formally trained to apply bioinformatic
tools and may therefore assume that they lack the necessary experience to do so themselves. In this article, we pre-
sent a series of freely available, predominantly web-based platforms and bioinformatic tools that can be combined
in analysis pipelines to interrogate different types of next-generation sequencing data. Additionally to the presented
exemplary route, we also list a number of alternative tools that can be combined in a mix-and-match fashion. We
place special emphasis on tools that can be followed and used correctly without extensive prior knowledge in pro-
gramming. Such analysis pipelines can be applied to existing data downloaded from the public domain or be com-
pared to the results of own experiments.
Conclusion Integrating transcription factor binding to chromatin (ChIP-seq) with transcriptional output (RNA-seq)
and chromatin accessibility (ATAC-seq) can not only assist to form a deeper understanding of the molecular interac-
tions underlying transcriptional regulation but will also help establishing new hypotheses and pre-testing them
in silico.
Keywords ChIP-seq, RNA-seq, ATAC-seq, Integrated data analysis, Transcriptional networks
© The Author(s) 2023. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which
permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or
other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line
to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory
regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this
licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativeco
mmons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Laub et al. BMC Genomics (2023) 24:382 Page 2 of 19
and training available to bench scientists. Here, we pre- resource for functional genomics data is ArrayExpress
sent an analysis pipeline that makes use of different plat- [12]. It includes metadata detailing experimental proce-
forms to retrieve sequencing data from the public domain dures as well as processed and/or raw data. While Array
together with freely available, user-friendly and predomi- Express is hosted by the European Molecular Biology
nantly web-based bioinformatic tools for the evaluation Laboratory-European Bioinformatics Institute (EMBL-EBI),
and visualization of results. As many platforms and tools which is part of the intergovernmental European organi-
can be used for several sequencing paradigms, visualized zation ELIXIR, GEO and SRA are hosted by the US based
in Fig. 1, and to make the content most accessible to new National Center for Biotechnology Information (NCBI).
users, specific aspects of their application will be intro- Because these databases have been separately main-
duced in different subchapters of this paper. The initial tained, users might need to search several databases to
bioinformatic analysis of raw next-generation sequenc- get a comprehensive overview of public genomics data
ing (NGS) results, including quality filtering, alignment of interest [17]. Since errors can occur when uploading
of the reads to the genome and peak calling, is not sub- datasets as well as descriptive meta-data into databases,
ject in this review article but should follow appropriate it may also be helpful to cross-check relevant informa-
guidelines such as those curated by the ENCODE Conso tion of individual datasets across platforms and in the
rtium. We here focus on the type of analysis that builds original publication. GEO includes data from many
on already-processed datasets, enabling analysis steps different genetic and genomic approaches, including
such as comparison between datasets, mapping of indi- genome methylation, chromatin structure, and genome-
vidual data points across datasets, searching for gene protein interactions [18]. Each dataset on GEO, accessd
ontology terms, jointly regulated pathways, or shared via the GEO Accession Viewer, is provided with contact
upstream regulators, and more. We present one exem- information of the researcher who generated it as well
plary route and follow it throughout the paper but sug- as a reference to the corresponding publication, if
gest alternative tools and approaches alongside. This available. Datasets of multiple experiments in a given
allows users to develop an analysis strategy that fits their study (including different sequencing paradigms) are
needs and matches their preferences. We argue that this assembled in series that are linked on GEO Accession
approach can serve as a valuable resource to explore new Viewer and can therefore be found with ease. While
ideas and projects in silico, before moving forward with the provision of descriptive data regarding the experi-
time-, cost-, and resource-intensive wet-lab experiments. mental process applied to obtain a certain dataset is
Data resources are continuously growing and the here standardized on GEO Accession Viewer, the available
described databases are frequently augmented with new datasets vary in format.
datasets. Nevertheless, data on many target genes or cell Another valuable resource is the Encyclopedia of DNA
types are still missing from these repositories. New wet- Elements (ENCODE) project. ENCODE collects a wealth
bench experiments will therefore be surely needed for of datasets from various sequencing paradigms, meta-
the foreseeable future. Making the resulting data openly data as well as protocols and provides various data for-
accessible is thereby a critical and valuable contribution mats (both raw and processed) in order to systematically
to the scientific community. map regions of transcription, transcription factor (TF)
association, chromatin structure and histone modifica-
Data storage and accession tion [13]. However, as compared to the platforms dis-
Most scientific journals require that all sequencing data cussed above, the ENCODE project follows a specific
submitted as evidence in a particular study are being scientific aim rather than providing a mere collection of
made available in a public repository after the manu- data, and thus is focused on in-depth assessment of spe-
script has been accepted for publication. With the pub- cific factors, rather than a wide range of transcriptional
lication, an accession code is provided to retrieve the regulators. Therefore, most projects are centered around
data from respective platforms. Several options are avail- common cell lines and ubiquitously expressed factors or
able for this. For bottom-up approaches, these platforms histone marks.
can be searched for available datasets in order to start a Of note, most datasets in the public domain arise from
new research project from previously published data. cell culture experiments, which are often chosen for their
Own sequencing results can also be included and com- practical advantages in culturing, providing a homoge-
bined with public datasets. Commonly used platforms neous cell population. However, this comes at a cost in
described in detail here are listed in Table 1. that cell lines are only an approximation of the primary
Two frequently used platforms for data storage are Gene cell types which are modeled. When making use of public
Expression Omnibus (GEO) for processed data and Seque datasets, potential (epi-)genetic differences that might be
nce Read Archive (SRA) for raw data files. An alternative introduced should be critically assessed. Where available,
Laub et al. BMC Genomics (2023) 24:382 Page 3 of 19
Fig. 1 Tools and platforms presented for NGS data retrieval and analysis. The type of analysis is depicted by the color of circles, input data formats
are given by pinned icons. Lines connect each NGS paradigm with bioinformatic resources applicable to this data type
Laub et al. BMC Genomics (2023) 24:382 Page 4 of 19
ChIP-seq data ChIP-ATLAS Data can be downloaded in .bed and .bigWig format; [6, 7]
Data can be readily visualized in IGV;
Peak calling differs from original publications
Cistrome DB Information on QC; Motifs underlying peaks; [8]
Nexus to Galaxy analysis pipeline;
Human and mouse data only; Download of data in .bed only
ChIPBase Transcriptional regulatory networks of lncRNAs, miRNAs, other ncRNAs [9, 10]
and protein-coding genes; Motif information included;
No raw data can be downloaded
ReMap Collection of manually curated ChIP-seq, ChIP-exo, and DAP-seq data; [11]
Data can be directly accessed via UCSC Genome Browser
Functional genomics ArrayExpress Includes metadata detailing experimental procedures [12]
as well as processed and/or raw data
ENCODE Raw as well as processed data; Good QC and reproducibility [13, 14]
General seq repositories GEO Accession Viewer Comprehensive study overview; Contact information to data curators; [15]
No consistent data formats available
SRA Run Browser Hosts raw data files [16]
it is further advisable to juxtapose this data with such Tools for dataset conversion
from primary cells or tissues. Challenges that researchers often face when retrieving
data from public repositories are the different file formats
Software repositories and general tools and annotations in which the data is stored. For example,
Many bioinformatic tools as well as databases can be ChIP-seq data can be deposited in a variety of formats
found on bio.tools, accessed via this registry and applied ranging from raw data in .fasta or .fastq, over processed
online or after downloading to a local computer. The data in simple tabular, human-readable .bed files to con-
code-base of tools is usually hosted on Github or Biocon- tinuous track formats such as .bigWig. While .bed formats
ductor. Github is an online platform that allows users to contain the coordinates of sequencing peaks, and thus
create repositories to store and share both, analysis code can be viewed as quantitative data structures providing
in all programming languages and datasets. A great many information on the presence or absence of peaks [53],
repositories can be created and openly shared using a .bigWig is a more qualitative data structure that also ena-
free personal account, but contributors also have the bles the assessment of peak shapes. Complicating matters
option to restrict the use of their repository by a license. further, most downstream applications have precise data
The website of the public repository can then be used for structure requirements (visualized in Fig. 1) that do not
ease of code as well as data sharing in publications. As an necessarily match the structure in which the correspond-
example, the public repository created for analyzing data ing data is stored in public repositories. However, many
with UpSetR discussed in more detail below can be found of these formats can be translated into one another. An
here. For large datasets, file hosting services are recom- easy to use resource for this purpose can be found in the
mended whose file links can be shared from the public Galaxy platform, a system for the integration of genomic
folders for use in Github. Bioconductor is an initiative sequences, their alignments, and functional annotation
for the collaborative creation of bioinformatic software, [23, 24]. For example, the UNIX command line appli-
harboring a multitude of open-source and open-devel- cation bedtools getfasta available on Galaxy allows the
opment programs written in the statistical program- conversion of .bed data into the .fasta file format. In
ming language R [19]. Commonly used tools described in case of annotation differences between datasets, annota-
detail in this paper are listed in Table 2 (ChIP-/ATAC-seq tion transfer tools such as liftOver can convert genome
analysis) and Table 3 (RNA-seq analysis). This list could coordinates of .bed files into the respective assembly
be extended much further as the demand for and devel- [21]. This enables users to integrate data from different
opment of sequencing analysis tools continuously grows, annotation generations of the same species (e.g. mm9
but we restrict ourselves to a selection of approaches that and mm10 when working with mouse-derived data) and
can be fitted into one exemplary pipeline. thus to compare results that were mapped to different
Laub et al. BMC Genomics (2023) 24:382 Page 5 of 19
genomic assemblies. In addition, with the help of liftOver availability of additional datasets, the graphical display
genomic annotations from a wide range of species can and image quality is superior in IGV as here content can
be converted into one another, facilitating inter-species be directly exported as vector graphics. UCSC Genome
comparisons. However, a significant drawback of this Browser on the other hand provides .eps graphics, which
approach is that regions, which are not evolutionary con- can be converted into publication-quality figures using
served between the original and target species, are lost. appropriate software such as INKSCAPE.
It should be noted, that liftOver facilitates the conver-
sion between genomic loci of different species or among Software choice
different generations of genome assemblies, but not Most tools discussed in this review are available as
between different gene nomenclatures. Further tools for graphic-user interfaces (GUI) such as online or desktop
annotation transfer are discussed elsewhere [22]. versions and command line-run programs. For inexperi-
enced users, GUI versions may be a good choice, as these
Visualization of sequencing data usually provide intuitive handling and easier navigation.
A helpful step to gain an initial impression of sequenc- To target our discussion to wet-lab based researchers who
ing data or to view specific genomic regions in detail is may have little to no prior experience with bioinformatic
to visualize genome-wide sequences relative to the refer- computing, we will focus on tools that are available online
ence genome. This can be achieved by tools such as the as these come without installation requirements. How-
UCSC Genome Browser or Integrative Genome Viewer ever, to make the most of the application possibilities of
(IGV) [21, 31]. Uploading sequencing data to either web- a given tool, it may be advantageous to resort to desktop
site will allow users to graphically visualize genomic data, or even command line versions, as for many tools these
search them for gene names and genomic coordinates, include more customization options and can run faster
and compare multiple datasets. Alongside the uploaded than online distributions. For most of the tools presented
data, additional pre-installed genomic information is here, online tutorials of their application are provided on
provided in both tools such as ChIP-seq data for histone the respective websites. Users who are interested in diving
modifications or common transcription factors, SNPs, deeper into the bioinformatic application of these tools
conservation across species or repeating elements. Yet, are advised to become familiar with Unix command line
while UCSC Genome Browser outperforms IGV in the navigation, as well as programming in R and Python.
Laub et al. BMC Genomics (2023) 24:382 Page 6 of 19
Assessment of protein/DNA interaction: ChIP‑seq antibodies directed against the TF or histone modifica-
Epigenetic modification of chromatin together with the tion of interest. Protein-DNA complexes are recovered,
temporally and spatially controlled contact of TFs and washed to reduce background signals and the precipi-
their transcriptional co-regulators lie at the core of gene tated DNA is isolated by heat-induced crosslinking rever-
expression regulation. A variety of techniques has been sal. The DNA fragments are then subjected to library
developed in recent years to map the occupancy of TFs preparation and, after indexing and quality control (QC),
and histones on DNA and detect the chemical modi- samples are sequenced using an appropriate next-gen-
fications these carry. One of the first and still the most eration sequencing platform. Following a series of QC
widely used method to assess the chromatin landscape steps (which include eliminating contaminating DNA
genome-wide is chromatin immunoprecipitation (ChIP) sequences from other commonly used model organ-
[54] followed by massive parallel sequencing (ChIP-seq) isms using FastQC [29], removing remaining adapter
[3]. Briefly, ChIP uses polymerization of paraformal- sequences, and quality trimming), the reads are mapped
dehyde (PFA) to crosslink proteins to chromatin. After against an appropriate reference genome. Mapped reads
cell lysis and recovery of the cell nuclei, the chromatin is are then filtered to retain only high confidence concord-
fragmented by sonication or micrococcal nuclease diges- ant pairs, usually followed by the removal of reads map-
tion. The fragmented chromatin is then precipitated with ping to the mitochondrial genome and unassembled
Laub et al. BMC Genomics (2023) 24:382 Page 7 of 19
contigs. Peak calling is performed, and candidate regions Downloaded ChIP-seq datasets can then be subjected
are further filtered by fold enrichment score. In this stage to post-analysis and in silico assessments by a specific
of analysis, datasets are most commonly deposited in workflow that we present below. A schematic of this
public repositories. Different peak calling algorithms are workflow is summarized in Fig. 2 A, and exemplary out-
in use. While TF binding sites are usually called assum- puts are displayed in Fig. 2 B-F. For simplicity, only one
ing narrow peaks, for histone modifications broad peak possible approach is described below, in which we focus
callers are employed. When using data from the public on the identification of regulatory interactions in chro-
domain, it should always be cross-checked with other matin. However, many different analysis routes are pos-
publications whether peak width of a certain dataset is in sible and, depending on the initial data structure, other
the appropriate range for the assessed entity. Performing approaches than the ones detailed below may be suitable.
some of these simple but effective quality control meth- Figure 1 lists several tools and platforms, together with
ods can be of great help, especially when working with their respective input data formats and purposes that
data that originate from the public domain. can be used on ChIP-seq data. Table 2 gives an overview
over some of the most prominent tools that can be used
Databases instead or in addition to those discussed below.
One useful public repository for retrieving datasets is
ChIP-Atlas, a fully integrated data-mining suite for ChIP- Visualization
seq, DNAse-seq, ATAC-seq, and Bisulfite-seq data [6, 7]. As described above for the visualization of general
This database serves the assembly of datasets from vari- sequencing information, called peaks and sequencing
ous sources and organisms, including human and mouse. tracks generated in the course of ChIP-seq experiments
It shows alignment and peak-call results in several for- are commonly visualized in genome browsers relative to
mats including .bed as well as .bigWig for ChIP-seq data. a reference genome and relevant genomic features. The
Alongside data retrieval, ChIP-Atlas allows analyzing two most common formats for ChIP-seq data are .bed
genome-wide transcriptional regulator interactions with for called peaks and .bigWig for continuous sequenc-
one another or with genes of interest, as well as examin- ing tracks. There are several genome browsers to choose
ing enrichment of protein binding for multiple genomic from, depending on the origin of the data one would like
coordinates or gene names. In addition, ChIP-Atlas offers to visualize. ChIP-Atlas can be easily combined with IGV
options to visually assess the quality of different types of (Fig. 2 B), while Cistrome DB carries direct plugins for
sequencing data, a requirement for any meaningful fur- UCSC Genome Browser (Fig. 2 B’) [21, 30] and WashU
ther analysis. The representation of ‘Base call quality data Epigenome Browser [32]. Data curated in ReMap as well
from DBCLS SRA’ in ChIP-Atlas allows to visually deter- as ENCODE can be directly accessed via UCSC Genome
mine data quality in the form of a homogeneous distri- Browser and multiple factors can be integrated for paral-
bution of quality scores spanning the green area of QC lel visualization.
plots. Another database harboring human and murine
data from ChIP-seq, DNase-seq and ATAC-seq experi- Functional analysis
ments, which can be used to extract further cis-regula- A useful next step is to assess ChIP-seq datasets in
tory information, is Cistrome DB [8, 55]. While fewer terms of potential biological functions. The Genomic
datasets are available on Cistrome DB than ChIP-Atlas, Regions Enrichment of Annotations Tool (GREAT) is a
additional functions are implemented, such as QC and good choice for predicting functions of cis-regulatory
motif discovery, which is a clear advantage of this data- regions [26]. Any set of genomic regions in .bed format
base. ChIPBase is a third possibility to collect datasets, can serve as input to this GO term analysis tool. However
enabling direct performance of motif discovery [9, 10]. it should be noted, that the current version of GREAT
While this database focuses on the function of non-cod- only supports human (hg19 and hg38) and mouse (mm9
ing RNA (ncRNA) entities, ChIP-function can be initially and mm10) assemblies, and data from different species or
assessed by correlation with expression of TFs as indi- assemblies need to be converted first using liftOver. Out-
cated by RNA-seq. A drawback of ChIPBase is that raw puts can be visualized either as bar chart or interactive
peak data cannot be downloaded, but a reference to GEO ontological hierarchy (Fig. 2 C). Additionally, ChIP-seq
Accession Viewer is provided, through which access to peaks can be subjected to peak annotation and visualiza-
the original data is possible. Finally, large-scale integra- tion with ChIPseeker (Fig. 2 D) to gain a deeper under-
tive analysis can also be performed with ReMap, another standing of where peaks are localized relative to distinct
collection of manually curated ChIP-seq, ChIP-exo, and genomic sites such as promotor regions, and intragenic
DAP-seq (DNA Affinity Purification Sequencing) data or intergenic genomic sequences [20]. While GREAT
from public sources (GEO, ENCODE, ENA) [11]. needs to be accessed through the respective website,
Laub et al. BMC Genomics (2023) 24:382 Page 8 of 19
Fig. 2 Exemplary ChIP-seq analysis pipeline and outputs. (A) Exemplary workflow and suggested tools, (B) overlay of ChIP-seq tracks in IGV
and (B’) UCSC Genome Browser, (C) associated GO terms of ChIP-seq data obtained by analysis with GREAT, (D) genomic annotation of ChIP-seq
peaks with ChIPseeker, (E) motif distribution of two exemplary TFs in whole genome and (E’) exemplary secondary motif spacing derived
from MEME-ChIP analysis, and (F) dataset intersection of two exemplary TFs using bedtools intersect, visualized with simple text editor program
(columns 1-4: peak information TF1 [peak chromosome, start, stop, name], columns 5-7: peak information TF2 [peak chromosome, start, stop],
column 8: overlapping peak width)
ChIPseeker can be used via the platform Galaxy. To enrichment. Tools available on this platform are easily
this end, a .gtf file harboring the corresponding genome explored, as they come with a comprehensive overview of
assembly (e.g. comprehensive gene annotation) needs to their features, and supported input and output formats.
be retrieved from GENCODE [56], and uploaded to Gal-
axy. Galaxy offers many additional functional analysis Network analysis
tools, such as DiffBind for differential binding analysis of A frequently used method to better understand the
ChIP-seq data [57], or Genrich to detect sites of genomic underlying logic of a given transcriptional regulation
Laub et al. BMC Genomics (2023) 24:382 Page 9 of 19
scheme is to assess the regulatory network in the form [62]. In ChIP-chip (ChIP-on-chip), DNA fragments are
of motif discovery using MEME-ChIP. This tool takes isolated by ChIP and assessed by hybridization to genomic
its input in .fasta format. However, .fasta format is not microarrays [63]. Both methods have found less wide-
provided by many platforms but can be re-constructed spread use than ChIP-seq, but data generated by them can
on basis of more common .bed formats with the help of be examined similarly to the analysis pipelines described
the getfasta function in the bedtools toolkit (available above for data generated by ChIP-seq if appropriate data
on Galaxy). MEME-ChIP is designed for the analysis of formats are available. A further limitation of ChIP is the
ChIP-seq ’peak regions’ [27, 58]. These expected bind- reliance on highly specific antibodies that recognize their
ing regions are defined as short genomic sequences of target after formalin-fixation of the chromatin. As a solu-
6-12 bp in length surrounding the summit of ChIP-seq tion to this problem, DamID (DNA adenine methyltrans-
peaks, i.e. the individual local maxima of alignment reads ferase identification) offers an approach to identify target
in a given ChIP-seq experiment (e.g. the TF binding site sites of chromatin-binding proteins on the genome with-
in case of a ChIP-seq experiment for a TF). Given a set out the need to have suitable antibodies available. Instead,
of genomic regions, MEME-ChIP performs a series of the DNA-binding protein is ectopically expressed as a
ab initio analyses, such as primary and secondary motif fusion to E.coli DNA adenine methyltransferase [64].
discovery, motif distribution, motif enrichment analysis, Sequencing data generated by DamID can be assessed
motif visualization, binding affinity analysis, and motif by the tools described above for ChIP-seq, although spe-
identification (Fig. 2 E). Moreover, datasets can be sub- cialized tools are available for the initial steps of the data
jected to spaced motif analysis (SpaMo), which infers processing workflow such as sequence alignment or read
physical interactions between a previously defined TF extension. A detailed pipeline can be found on GitHub.
and TFs bound at neighboring sites at the DNA interface, Two relatively new technical improvements for chro-
whereby close proximity of TF motifs indicates poten- matin profiling that are becoming increasingly popular
tial interaction (Fig. 2 E’) [59, 60]. Another platform to are CUT&RUN (Cleavage Under Targets and Release
perform de novo motif discovery or motif scanning to Using Nuclease; [65]) and CUT&Tag (Cleavage Under
predict TF binding sites is RSAT. While RSAT operates Targets and TAGmentation; [66]). Both techniques rely
similarly to MEME-ChIP, it integrates more database on the fusion of protein A, required for the purifica-
options for motif discovery. Furthermore, RSAT includes tion of antibody-precipitated DNA, to a DNA-cleaving
original analysis, such as motif quality evaluation, motif enzyme, micrococcal nuclease (MNase) in CUT&RUN
comparisons and clustering, detection and analysis of or Tn5 transposase in CUT&Tag. Both approaches offer
regulatory variants, building of control datasets, and an improved signal to noise ratio compared to ChIP-seq,
comparative genomics to discover motifs based on cross- making them better suited for low cell numbers. Unlike
species conservation [28]. ChIP or DamID, CUT&RUN and CUT&Tag are per-
Public repositories can also be searched for datasets formed on unfixed cells and therefore not affected by
of such factors for which potential interaction functions possible fixation-induced artefacts. A pipeline for analy-
are indicated by motif analysis. Potential co-binding can sis and visualization of CUT&RUN and CUT&Tag data
be assessed by overlap computation of ChIP-seq peaks is provided by CUT&RUNTools [67]. However, for its
in .bed format using the bedtools intersect function [25] application one has to delve a little deeper into bioinfor-
available on Galaxy. This tool generally enables genome matics as currently no web-based analysis tool is availa-
arithmetic and can be used to merge, count, comple- ble. Navigation through GitHub alongside some previous
ment, and shuffle genomic intervals from multiple files experience with Python code are therefore required to
in widely used file formats (an exemplary intersection apply this toolkit. For pre-analyzed CUT&RUN and
output of two TF ChIP-seq datasets is shown in Fig. 2 F). CUT&Tag data, the GEO Accession Viewer again pro-
Alternatively, the intersection tool Intervene available on vides datasets for several biological contexts and tran-
Galaxy can be applied, which allows to produce Upset scriptional regulators.
plots of multiple intersections [61].
Assessment of chromatin accessibility: ATAC‑seq
Alternatives to ChIP‑seq Condensed chromatin, characterized by packaging with
Despite its experimental power and wide application, linker histone H1 and tight DNA wrapping around nucle-
ChIP-seq remains challenging with small samples and osomes, prevails in transcriptionally inactive regions,
binding sites can be mapped only within 100-200 base while open chromatin regions, i.e. stretches of DNA
pairs, limiting the resolution of this method. In ChIP-exo, exhibiting depletion of nucleosomes, are associated with
this problem is alleviated by including a trimming step of transcriptional activity [68, 69]. Mapping genome-wide
the precipitated DNA fragments by lambda exonucleases changes in chromatin accessibility has thus long served
Laub et al. BMC Genomics (2023) 24:382 Page 10 of 19
as a way to identify regulatory elements and study the initial ATAC-seq experiment. If ChIP-seq data for TFs
relationship between chromatin structure and transcrip- of interest are not available, in silico analysis of ATAC-
tional activation. Different NGS-based paradigms for seq can precede ChIP-seq experiments. In such cases,
epigenetic profiling of open chromatin and nucleosome promising TF candidates for immunoprecipitation may
positions have been developed: DNase-seq (DNase I be identified by motif analysis of open chromatin regions
hypersensitive sites followed by massive parallel sequenc- with help of MEME-ChIP or RSAT, followed by assess-
ing) uses the endonuclease DNase to cleave DNA within ment of the corresponding TF-DNA binding by ChIP-seq
accessible chromatin, followed by library preparation and experiments in the laboratory.
NGS [70]. MNase-seq uses the endonuclease/exonucle-
ase Micrococcal nuclease (MNase) to eliminate accessi- Assessment of gene activity: RNA‑seq
ble DNA and selectively sequences nucleosome-bound The most commonly used high-throughput technique in
DNA [71]. FAIRE (Formaldehyde-Assisted Isolation of transcriptomics is bulk RNA-sequencing (RNA-seq). It
Regulatory Elements) sequencing involves formaldehyde provides insight into the transcriptome of tissue sections,
cross-linking of proteins to DNA, shearing of the DNA, biopsies, or cell populations. Although further methods,
recovery of the nucleosome-free DNA-fragments by phe- that will be discussed below, have been developed in the
nol-chloroform extraction, and NGS [72]. In the Assay recent years and despite the caveat that bulk RNA-seq
for Transposase-Accessible Chromatin using sequencing determines the average expression level of individual
(ATAC-seq), hyperactive Tn5 transposase integrates into genes over a large and frequently inhomogeneous start-
open chromatin regions where it simultaneously cuts and ing cell population, bulk RNA-seq also has considerable
ligates adapters for library preparation and high-through- strengths as compared to alternative approaches. The
put sequencing [2, 73]. This underlying principle allowed focus of bulk RNA-seq is on global changes in the tran-
ATAC-seq to be developed further to include methods scriptional profile. Major advantages of bulk RNA-seq
to create chromatin accessibility maps of individual cells are the easy application and relatively low prices, provid-
[74, 75]. Irrespective of the NGS-based technology that ing better accessibility compared to single-cell RNA-seq
was used to profile chromatin accessibility, open chroma- (scRNA-seq), in which an assessment of heterogeneity is
tin regions can be annotated bioinformatically, and post- the focus. For these reasons, bulk RNA-seq datasets are
hoc analysis such as DNA-footprinting or analysis of frequent in the public domain. However, both methods
motif enrichment (AME) can be performed. For a further have their limitations. scRNA-seq is more cost-intensive,
discussion, the reader is referred to [4]. suffers from cell dropout and reduced coverage of genes
Because ChIP-seq and ATAC-seq both yield partial and physiologically occurring fluctuations in expression
genome reads annotated to the whole genome as results, are often overrepresented. Bulk RNA-seq on the other
the tools described above for analysis of ChIP-seq results hand measures gene expression in mixtures of cells and,
can also be applied to ATAC-seq analysis. Further, data consequently, cannot distinguish between low-abundant
generated by ATAC-seq and ChIP-seq experiments can transcripts in large cell populations and high-abundant
be combined in multiple ways, and ATAC-seq datasets transcripts in small populations. It will be the focus of this
can also be retrieved through ChIP-Atlas. For ATAC-seq, chapter to present tools for the in-depth analysis of bulk
some simple forms of quality control can be applied. For RNA-seq datasets, which non-specialists can make use
example, transcription start sites (TSS) of actively tran- of. Nonetheless, the involvement of a trained bioinforma-
scribed genes always have a more open chromatin envi- tician is certainly highly recommended to fully evaluate
ronment, so ATAC-seq data should inevitably contain sequencing data quality and as support to learn and apply
TSS. Starting from ATAC-seq results, and thus from the tools presented in this paper. In addition, even the
genomic regions that classify as ’open’ in a particular cell best and most sophisticated analysis approaches cannot
population or tissue, motif discovery can be applied to compensate for low quality data and a bioinformatician
determine which TF binding motifs these sequences har- can point out the limitations of the original data. Which
bor. This approach will give a first indication of the types approach is the right one depends on the question at hand
of TFs that can bind to these genomic regions, in princi- and is up to the investigator to determine. In addition,
ple. ChIP-seq data for these TFs in the same or related when making use of public datasets or analyzing their
cells and tissues may then be retrieved from the public own datasets, users are recommended to critically assess
domain and compared one by one to the initial ATAC- the study outline under which the data was generated,
seq results. This can be done with the help of tools like whether homogeneity of the sample was ensured, and
the already described bedtools intersect to narrow down appropriate control experiments were executed for the
the list of candidate TFs involved in gene expression reg- reported claims. Specifically, we recommend making sure
ulation through the genomic sequences identified in the that the expected outcomes of the dataset, for example a
Laub et al. BMC Genomics (2023) 24:382 Page 11 of 19
transgene expression profile, have been satisfied and the analysis of GEO RNA-seq data, GEO2R may be used to
data quality metrics are acceptable. Again, we recommend perform differential expression analysis. The R2 Genom
taking the support of a trained bioinformatician if needed ics Analysis and Visualization Platform is another option
for this crucial initial aspect of the data analysis. for exploring and analyzing gene expression data. It con-
RNA-seq allows the analysis of protein-coding mRNAs tains datasets from large numbers of array-type gene
and ncRNA such as ribosomal RNA (rRNA) or micro- expression profiling studies together with bulk RNA-
RNA (miRNA). For this matter, high quality total RNA is seq, scRNA-seq and some ChIP-seq datasets. Any pub-
extracted from cells or tissues. Different sub-populations lic dataset can be added to the R2 platform upon request
of RNA can be enriched or depleted to increase sequenc- using the accession ID of the dataset. This aspect is simi-
ing depth. Ribodepletion, which removes the abundant lar to the GREIN platform which will be discussed below.
rRNA but leaves the full diversity of other RNAs intact, The R2 platform allows users to explore gene expression
is usually carried out as enrichment step. In cases where data in multiple ways, including the correlation of genes
RNA subpopulations are in focus, other isolation proto- (with other genes and with sample groups) and the analy-
cols can be applied e.g. size selection for long ncRNAs sis of differential expression between groups (by DESeq2
or small ncRNAs, or poly-A enrichment to specifically or other tests). Within the framework of R2, data can also
enrich mRNA. Following the choice of RNA subsets, be subjected to KEGG pathway analysis between groups
the RNA is converted to complementary DNA (cDNA) or by correlation. R2 further provides the option to para-
by reverse transcription and sequencing adaptors are metrically analyze gene set enrichment (PAGE) [79], to
added to one or both ends of the cDNA fragments. After perform survival analysis (Kaplan-Meier) and gene ono-
amplification of the fragments, the RNA-seq library can tology analysis for suitable datasets, and to create clas-
be sequenced by various paradigms using NGS platforms sic PCA plots, volcano plots, heatmaps, as well as Upset
[76, 77]. When performing RNA-seq, normalization of plots. For RNA-seq data, some databases do provide
sequencing depth and gene length to permit comparison data quality information. For example the GREIN plat-
of results between genes and samples is obtained by one form provides information about the sequence alignment
of three measures: Reads Per Kilobase Million (RPKM) scores, duplicate reads, sequence counts for each sample
for single-end RNA-seq, Fragments Per Kilobase Mil- and indicates whether it passed the quality test or not.
lion (FPKM) for paired-end RNA-seq, or Transcripts Likewise, RNA-seq data can be expected to be enriched
Per Kilobase Million (TPM), which can be used for both in exonic sequences, and, hence, the overrepresentation
sequencing paradigms. As of now, validation of RNA-seq of exonic sequences in RNA-seq data can be regarded as
experiments by qPCR is a standard in good experimen- a sign of confidence. We strongly recommend users to
tal practice, and is a useful starting point when building make sure that the data quality is acceptable before in-
hypotheses on public domain datasets. However, it must depth analysis of a particular dataset. In case no quality
be noted that qPCR is a sensitive methodology for detect- information is provided, users are advised to check for
ing relative levels of a particular transcript, whereas expected expression profiles, i.e. whether appropriate
RNA-seq datasets are limited by their sequencing depth. housekeeping or marker genes for the given context are
This aspect can be appreciated particularly when com- present, and how many genes have average counts above
paring scRNA-seq with bulk RNA-seq as mentioned in a given number, thereby ensuring a good statistical basis
the section comparing these two analyses types. Thus less for differential gene expression analysis. A particular
abundant transcripts may be absent in sequencing data- challenge for working with data from different sources,
sets, while they are often detected by qPCR. Therefore especially when the data comes from older studies, is the
validation of less abundant genes may not yield compa- often ambiguous gene nomenclature. In the past, multi-
rable results by the two methods. Whether this experi- ple alternative paradigms were developed for gene iden-
mental practice will be uphold in the future, will depend tification, resulting in many genes having been assigned
among others on the abundance of available datasets on a multiple names. In such cases, it is up to the research-
given physiological context in the public domain. ers themselves to identify alternative or redundant gene
names. In this case tools for the conversion of common
Platforms and databases nomenclatures can be helpful such as BioTools.fr and
RNA-seq datasets can be accessed through various data- g:Convert of the g:Profiler toolset [50].
bases, including GEO Accession Viewer, R2 and ARCHS4 Below we present an analysis pipeline to make use of
[78]. Available data formats on GEO vary greatly among retrieved RNA-seq datasets from the public domain, fol-
datasets, ranging from spreadsheets over .txt to graphic lowing a certain workflow. A schematic of this workflow
formats such as .bedgraph or .bigWig, as no standardized is exemplified in Fig. 3 A, and exemplary outputs are dis-
upload criteria are defined for this repository. For direct played in Fig. 3 B-F.
Laub et al. BMC Genomics (2023) 24:382 Page 12 of 19
Fig. 3 Exemplary RNA-seq analysis pipeline and outputs. (A) Exemplary workflow and suggested tools, (B) Scatterplot and (B’) heatmap obtained
by WIlsON analysis, (C) visualization of dataset intersection in UpSet plot and venn diagram, (D) bar plot and (D’) pie diagram of GO terms obtained
with PANTHER, (E) interaction network obtained by STRING analysis and (F) motif enrichment of differentially expressed genes and (F´) predicted
motif interactions using ISMARA to assess potential transcriptional regulators
Laub et al. BMC Genomics (2023) 24:382 Page 13 of 19
available here, and can be used as template for the con- DAVID [44, 45], KEGG PATHWAY Database [46–48]
struction of .json and the corresponding .csv files [82]. and STRING [42, 43]. However, it should be noted that
To generate the .csv file, a binary data spreadsheet file the main functionality of STRING is to provide informa-
has first to be created from the RNA-seq data. A simple tion on protein-protein interactions of gene products as
binary transformation can be 1 for regulated genes and 0 described below. Since the output of most of these tools
for non-regulated genes between 2 conditions. Once the are complex hierarchies of GO terms, another useful tool
binary sets of interest are created, a .csv spreadsheet file is REVIGO, which can be applied to reduce functional
should be set up with user defined headers for the binary redundancy of GO term lists and visualize the results
data columns. Additional columns for fold changes or [49]. One downside of REVIGO is that it requires GO
reads etc. can be included in this UpSet file for data visu- term IDs as input. Depending on the output format of the
alization along with the intersection of sets with UpSet. preceding GO term analysis step, it may become neces-
On the UpSet website, one needs to input the .json file sary to retrieve these IDs manually. Further discussion of
created as described above. The genes comprising any the above described GO analysis tools can be found else-
particular intersection of interest can be visualized on the where [86].
site under ‘Query Results’ after selecting a certain inter- Enrichr is another interactive and collaborative gene
section on the UpSet plot (see Fig. 3 C). Specific genes of list enrichment analysis tool, which can be applied to var-
interest can also be searched in the ‘Query Filters’ menu. ious genomics data, including data obtained from ChIP-
For the list of genes or elements displayed on the plat- seq and ATAC-seq experiments [39]. The required input
form, a simple copy paste option allows the data import format for Enrichr are Entrez gene symbols. The program
into a spreadsheet file (after selection of .txt). For UpSet allows to query a given list of input gene symbols for
plots, Venn diagrams, or any other features in display, various characteristics, such as consensus TFs, lncRNA,
only a screenshot option is available for storing the data. epigenetic roadmaps of histone marks, and various other
enrichment paradigms that may be associated with these
Functional analysis genes. In contrast to GO term analysis tools like PANTH
Gene Ontology (GO) analysis is commonly executed ER or DAVID, which perform population-based statistics
following differential gene expression analysis to assess and therefore perform more reliably on larger gene sets,
functions of genes and gene products. GOs are built as Enrichr can be queried for any number of genes, even
a transdisciplinary endeavor between Molecular Biology, single genes.
Computer Science and Linguistics/Philosophy and as to
the procedural progress of research, are continuously Network analysis
updated with the latest empirical evidence [83]. Different A frequent feature of transcription control is the recipro-
tools for GO term analysis exist, which build upon dif- cal regulation of gene activities, including feedback- and
ferent logics, sources and gene concepts. Therefore, it is feedforward-loops, both of which can be of highly com-
recommended to use multiple options for a deeper and plex dynamics and often operate in parallel. Such multi-
more comprehensive understanding of the biological factorial regulatory networks can be explored in silico
context under investigation. with the help of computational approaches.
One widely used GO analysis tool is PANTHER [40, STRING (Search Tool for the Retrieval of Interacting
41]. This tool builds on a knowledge base curated by the Genes/Proteins) uses RNA-seq data to examine whether
Gene Ontology Consortium [84, 85]. PANTHER takes functional relationships may exist among gene prod-
a list of gene names as input (supported are several ID- ucts. STRING requires a list of gene names as input and
systems including Ensembl and Uniprot) and can work performs network analysis on them, making use of the
with a large number of different species. As output, the STRING database of known and predicted protein-pro-
GO terms Molecular Function, Biological Process, Cel- tein interactions [42, 43]. A network of connected genes
lular Component, Protein Class, and Pathway are avail- displayed as cloud of spheres and lines is given as output
able (see Fig. 3 D and D´ for an exemplary analysis) and (see Fig. 3 E). This can be assessed interactively and sub-
various statistical tests can be peformed. A full list of jected to clustering analysis. The graphical output may be
genes in the analyzed gene set that are associated with customized in its visual appearance and downloaded in
each pathway in the dropdown menu can be obtained via high image quality.
the associated hyperlink with each GO term. One draw- RNA-seq data can further be subjected to a reverse
back of PANTHER is the low quality of produced plots, analysis of gene expression regulatory networks. The
but this can be bypassed by direct downloading of the aim hereby is to project transcriptional regulators that
data and plotting with R or any other data analysis plat- may function as upstream regulators of genes that were
form of choice. Other GO term analysis tools include identified as differentially expressed in a given RNA-seq
Laub et al. BMC Genomics (2023) 24:382 Page 15 of 19
experiment. One such approach is the web-based tool these gaps. scRNA-seq allows to read the transcriptome
ISMARA (Integrated System for Motif Activity Response of individual cells in great depth and, thus, delivers infor-
Analysis) [51]. It is designed to perform motif discovery mation of gene expression with cellular precision. This
and to predict key TFs and miRNAs, which may be criti- technical advance has greatly changed how gene expres-
cal for the changes in gene expression observed in a given sion is studied in biology and biomedicine. The boom in
experiment. For motif analysis, the tool only requires raw this technology led to an exponential increase of available
gene expression data as input (RNA-seq or microarray scRNA-seq datasets, the navigation through which can
data) from a set of biological samples, uploaded as .fastq be challenging. The HumanCell Atlas project pursues
or .bed/.bam/.sam alignment files. These input data can the ambitious goal to map every cell type in the human
be directly used for automatic processing and model- body. A comprehensive, manually curated and search-
ling, based on pre-calculated annotations of hundreds able list of single-cell transcriptomics studies, indexed by
of regulatory sites of several mammalian genomes. Once publication and including meta-data such as cell source,
the analysis is complete, ISMARA provides a table with type of analysis, and protocol used can be found here
the motive activities found in the samples, sorted by the [87]. scRNA-seq datasets can be accessed through GEO
significance score, which ISMARA assigns to each motif. Accession Viewer, but numerous other collections exist,
Besides motif names and significance scores expressed with The Single Cell ExpressionAtlas hosted by EMBL-
as z-values, the output file also includes gene names of EBI or the Cell TypesRNA-Seq Atlas of Allen Brain Insti-
TFs associated with the motif, the activity profiles across tute, which contains transcriptomic information from
samples, and the consensus binding sequence of TFs, mouse and human cortex, being just two examples of
termed logos (see Fig. 3 F). Each of the listed motifs is many. Upon publication, pre-analyzed scRNA-seq data-
further linked to another separate results page, contain- sets are often made accessible via interactive web appli-
ing additional information. These include the top target cations, frequently presented as visually appealing Shiny
genes known to be regulated by the motif, the target Apps. These can be employed for in-depth assessment
genes network according to the STRING database [42, of individual genes and cell cohorts but mostly must be
43], respective gene ontology analysis of various catego- accessed through a link given in the respective original
ries, as well as predicted direct regulatory interactions publication. Readers specifically interested in bioinfor-
between this and other motifs (see Fig. 3 F’). All collected matic analysis of scRNA-seq data are referred to the large
information together with high-resolution images can be number of excellent recent reviews on this topic, such
downloaded from the website. Repeating ISMARA analy- as [88–91]. A web-based, manually curated catalogue
ses with sample averaging emphasizes contrasts between of software tools for the analysis of scRNA-seq data is
sample groups (e.g. treated vs. non-treated). ISMARA scRNA-tools database.
thereby allows the annotation of replicates and calculates While the cell-to-cell heterogeneity in popula-
motif activity profiles that are averaged over these repli- tions of cells is kept in scRNA-seq, spatial transcrip-
cates and thus enables a simple initial analysis of possi- tomics retains the spatial information of transcripts
ble regulatory networks. However, ISMARA predicts the within tissues [92–94]. The method quickly expanded
TF motifs only based on proximal promotors. This fea- in the last few years to include applications to epig-
ture can be a shortcoming of this tool, as many TFs pre- enome sequencing via chromatin state profiling [95]
dominantly bind to distal or intragenic control regions of and to ATAC-sequencing [96]. The positional infor-
gene expression, like enhancers, rather than to proximal mation is obtained via arrayed barcoded oligonucleo-
promotors. Finally, ISMARA can also be applied to ChIP- tides that are hybridized to overlaid tissue specimen.
seq data for motif discovery, similar to MEME-ChIP These approaches also allow multimodal spatial profil-
described earlier. Alternative transcription factor binding ing for example with antibody-based protein barcod-
site analysis tools are oPOSSUM [52] and RSAT network- ing approaches in parallel to next-generation RNA-seq
interactions [28]. [97]. The tools and platforms described in the current
review are also applicable to spatial -Omics datasets
Further applications of RNA‑seq once appropriate transformation of the data and clus-
Bulk RNA-seq determines the average expression level tering is performed to separate the positional informa-
of individual genes over a large and often inhomogene- tion from the sequencing data. For example, RNA-seq
ous starting cell population. This approach can deliver a data obtained from spatial transcriptomics can be ana-
wide range of information in various experimental setups lyzed by DESeq2 for comparing the raw data (.fastq)
but may not be sufficient when cellular and spatial lev- from 2 regions of interest to obtain differential gene
els need to be considered. Spatial transcriptomics and expression between them. In addition, spatial tran-
scRNA-seq are two new, sophisticated methods that fill scriptomics data can be explored with the help of tools
Laub et al. BMC Genomics (2023) 24:382 Page 16 of 19
like the 10x Genomics Loupe Browser or several spe- In this review, we present general resources and an
cialized software packages [98], many of which can be exemplary analysis pipeline that integrates publicly
accessed through bio.tools but require more program- available data types and multiple research method-
ming experience. ologies. The use of published genomics data together
with multi-layered data integration may constitute a
new epistemic practice to uncover biological functions
Integrative data analysis as well as their relationality in space and time. As an
In order to maximize the opportunities for insight added benefit, resources may be used more sustainably,
that computational analysis tools provide, it is often as new hypotheses can be first tested in silico before
necessary to triangulate and integrate information moving to experiments in the wet-lab. Indeed, in recent
from various sources. Several tools and pipelines can years an increasing number of researchers have inte-
be employed to bioinformatically integrate data from grated their own results with public datasets and used
various sequencing experiments. One useful tool to bioinformatic tools in their analysis similar to what we
annotate ChIP-seq peaks with the two closest genes proposed in this review. This includes such broad appli-
is RnaChipIntegrator. However, unlike most of the cations as cellular senescence [105], carcinogenesis [38,
other tools described in this review RnaChipIntegrator 106], or immunology [107, 108].
requires some programming and command line expe- Still, this approach necessitates a reciprocal reflec-
rience. Using RNA-seq as prompt, the RSAT module tion of the object of inquiry and the methodology used,
retrieve-sequences allows to extract upstream, down- and questions such as the following should therefore
stream or open reading frame sequences [99], while be asked: What kind of data are available and which
RSAT retrieve-ensembl-seq retrieves sequences of data might be lacking to complement the picture?
promotors or other specified features on-the-fly from How was the data produced, what bias may have been
Ensembl [28]. These promotor regions can be subjected introduced? Can biological contexts be compared (e.g.
to downstream motif analysis to discover potential TF because of evolutionary relation) or should they be
binding sites using RSAT network-interactions, as well considered separately? What are relevant and ontologi-
as overlapped with relevant ChIP-seq data using bedto cally meaningful controls? The latter point is particu-
ols intersect. In cases where ATAC-seq data is avail- larly important when multiple datasets are compared,
able, intersection of promotors and TF occupancy can and corrections for multiple testing need to be applied.
be further refined by information about chromatin Because bioinformatic tools are continuously devel-
accessibility, thus integrating data from three different oped and new genomics datasets become available,
sequencing paradigms. Subsequently, after conversion the approach presented here must be considered as a
of the peak data to gene sets e.g. following this Galaxy procedural activity constantly under flux rather than a
tutorial, datasets can be subjected to STRING and thus fixed pipeline. Experimental validation of bioinformati-
interrogated for an in silico prediction of protein-pro- cally derived hypothesis and in silico predictions should
tein interactions. Finally, once the above strategies have be triangulated with in vitro and in vivo approaches to
revealed genomic binding of one or more DNA-bind- bridge the gap of disciplinary languages, and to gain a
ing proteins in close proximity, proteomics databases deeper insight into the objects of inquiry in both mate-
such as PRIDE (PRoteomics IDEntifications Data- rial and informational dimensions.
base) or BioGRID can be interrogated to determine
whether corresponding protein-protein interactions
Abbreviations
have already been detected in similar biological systems AME Analysis of Motif Enrichment
[100–102]. ATAC-seq Assay for Transposase-Accessible Chromatin using sequencing
ChIP-seq Chromatin Immunoprecipitation followed by sequencing
cDNA complementary DNA
Conclusions CLARION generiC fiLe formAt foR quantItative cOmparsions of high
Traditionally, the epistemic culture in Molecular Biol- throughput screeNs
ogy used to follow an unidirectional path from hypoth- CUT&RUN Cleavage Under Targets and Release Using Nuclease
CUT&Tag Cleavage Under Targets and TAGmentation
esis to data acquisition [103, 104]. In the post-genomics DamID DNA adenine methyltransferase IDentification
era, Biology has been increasingly informed by infor- DNase-seq DNase I hypersensitive sites followed by sequencing
matics as to cope with large-scale datasets produced by EMBL-EBI European Molecular Biology Laboratory-European Bioinformat-
ics Institute
whole-genome sequencing approaches. Bioinformatics ENCODE Encyclopedia of DNA Elements
has since evolved as a subdiscipline of Molecular Biol- FAIRE Formaldehyde-Assisted Isolation of Regulatory Elements
ogy, but the two research disciplines still need to be GO Gene Ontology
GEO Gene Expression Omnibus
more fully integrated. GREAT Genomic Regions Enrichment of Annotations Tool
Laub et al. BMC Genomics (2023) 24:382 Page 17 of 19
24. Galaxy. The Galaxy platform for accessible, reproducible and col- 49. Supek F, Bošnjak M, Škunca N, Šmuc T. REVIGO summarizes and visual-
laborative biomedical analyses: 2022 update. Nucleic Acids Res. izes long lists of gene ontology terms. PloS ONE. 2011;6(7):e21800.
2022;50(W1):W345–51. 50. Raudvere U, Kolberg L, Kuzmin I, Arak T, Adler P, Peterson H,
25. Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing et al. g:Profiler: a web server for functional enrichment analysis
genomic features. Bioinformatics. 2010 01;26(6):841–2. https://doi.org/ and conversions of gene lists (2019 update). Nucleic Acids Res.
10.1093/bioinformatics/btq033. 2019;47(W1):W191–8.
26. McLean CY, Bristor D, Hiller M, Clarke SL, Schaar BT, Lowe CB, et al. 51. Balwierz PJ, Pachkov M, Arnold P, Gruber AJ, Zavolan M, van Nimwegen
GREAT improves functional interpretation of cis-regulatory regions. Nat E. ISMARA: automated modeling of genomic signals as a democracy of
Biotechnol. 2010;28(5):495–501. regulatory motifs. Genome Res. 2014;24(5):869–84.
27. Machanick P, Bailey TL. MEME-ChIP: motif analysis of large DNA datasets. 52. Ho Sui SJ, Fulton DL, Arenillas DJ, Kwon AT, Wasserman WW. OPOSSUM:
Bioinformatics. 2011;27(12):1696–7. integrated tools for analysis of regulatory motif over-representation.
28. Santana-Garcia W, Castro-Mondragon JA, Padilla-Gálvez M, Nguyen Nucleic Acids Res. 2007;35(suppl_2):W245–52.
NTT, Elizondo-Salas A, Ksouri N, et al. RSAT 2022: regulatory sequence 53. Niu J, Denisko D, Hoffman MM. The Browser Extensible Data (BED)
analysis tools. Nucleic Acids Res. 2022;50(W1):W670–6. format. File Format Stand. 2022;1:8.
29. Andrews S, et al. FastQC: a quality control tool for high throughput 54. Solomon MJ, Varshavsky A. Formaldehyde-mediated DNA-protein
sequence data. Babraham Institute, Cambridge, United Kingdom: crosslinking: a probe for in vivo chromatin structures. Proc Natl Acad
Babraham Bioinformatics; 2010. Sci. 1985;82(19):6470–4.
30. Karolchik D, Baertsch R, Diekhans M, Furey TS, Hinrichs A, Lu Y, et al. The 55. Zheng R, Wan C, Mei S, Qin Q, Wu Q, Sun H, et al. Cistrome Data
UCSC genome browser database. Nucleic Acids Res. 2003;31(1):51–4. Browser: expanded datasets and new tools for gene regulatory analysis.
31. Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES, Getz Nucleic Acids Res. 2019;47(D1):D729–35.
G, et al. Integrative genomics viewer. Nat Biotechnol. 2011;29(1):24–6. 56. Frankish A, Diekhans M, Jungreis I, Lagarde J, Loveland JE, Mudge JM,
32. Li D, Purushotham D, Harrison JK, Hsu S, Zhuo X, Fan C, et al. et al. GENCODE 2021. Nucleic Acids Res. 2021;49(D1):D916–23.
WashU Epigenome Browser update 2022. Nucleic Acids Res. 57. Ross-Innes CS, Stark R, Teschendorff AE, Holmes KA, Ali HR, Dunning MJ,
2022;50(W1):W774–81. et al. Differential oestrogen receptor binding is associated with clinical
33. Jia A, Xu L, Wang Y. Venn diagrams in bioinformatics. Brief Bioinform. outcome in breast cancer. Nature. 2012;481(7381):389–93.
2021;22(5):bbab108. 58. Zeitlinger J. Seven myths of how transcription factors read the cis-
34. Lex A, Gehlenborg N. Points of view: Sets and intersections. Nat Meth- regulatory code. Curr Opin Syst Biol. 2020;23:22–31.
ods. 2014;11(8):779. 59. Whitington T, Frith MC, Johnson J, Bailey TL. Inferring transcrip-
35. Schultheis H, Kuenne C, Preussner J, Wiegandt R, Fust A, Bentsen M, tion factor complexes from ChIP-seq data. Nucleic Acids Res.
et al. WIlsON: web-based interactive omics visualization. Bioinformatics. 2011;39(15):e98–e98.
2018;35(6):1055–7. 60. Guo Y, Mahony S, Gifford DK. High resolution genome wide binding
36. Mahi NA, Najafabadi MF, Pilarczyk M, Kouril M, Medvedovic M. GREIN: event finding and motif discovery reveals transcription factor spatial
An interactive web platform for re-analyzing GEO RNA-seq data. Sci binding constraints. PLoS Comput Biol. 2012;8(8):e1002638.
Rep. 2019;9(1):1–9. 61. Khan A, Mathelier A. Intervene: a tool for intersection and visualiza-
37. Love MI, Huber W, Anders S. Moderated estimation of fold change tion of multiple gene or genomic region sets. BMC Bioinformatics.
and dispersion for RNA-seq data with DESeq2. Genome Biol. 2017;18(1):1–8.
2014;15(12):1–21. 62. Rhee HS, Pugh BF. Comprehensive genome-wide protein-DNA
38. Nagel S, Meyer C. Normal and Aberrant TALE-Class Homeobox Gene interactions detected at single-nucleotide resolution. Cell.
Activities in Pro-B-Cells and B-Cell Precursor Acute Lymphoblastic 2011;147(6):1408–19.
Leukemia. Int J Mol Sci. 2022;23(19):11874. 63. Blat Y, Kleckner N. Cohesins bind to preferential sites along yeast chro-
39. Chen EY, Tan CM, Kou Y, Duan Q, Wang Z, Meirelles GV, et al. Enrichr: mosome III, with differential regulation along arms versus the centric
interactive and collaborative HTML5 gene list enrichment analysis tool. region. Cell. 1999;98(2):249–59.
BMC Bioinformatics. 2013;14(1):1–14. 64. Steensel Bv, Henikoff S. Identification of in vivo DNA targets of chro-
40. Thomas PD, Campbell MJ, Kejariwal A, Mi H, Karlak B, Daverman R, et al. matin proteins using tethered dam methyltransferase. Nat Biotechnol.
PANTHER: a library of protein families and subfamilies indexed by func- 2000;18(4):424–8.
tion. Genome Res. 2003;13(9):2129–41. 65. Skene PJ, Henikoff S. An efficient targeted nuclease strategy for high-
41. Mi H, Ebert D, Muruganujan A, Mills C, Albou LP, Mushayamaha T, et al. resolution mapping of DNA binding sites. elife. 2017;6:e21856.
PANTHER version 16: a revised family classification, tree-based clas- 66. Kaya-Okur HS, Wu SJ, Codomo CA, Pledger ES, Bryson TD, Henikoff JG,
sification tool, enhancer regions and extensive API. Nucleic Acids Res. et al. CUT&Tag for efficient epigenomic profiling of small samples and
2021;49(D1):D394–403. single cells. Nat Commun. 2019;10(1):1–10.
42. Snel B, Lehmann G, Bork P, Huynen MA. STRING: a web-server to retrieve 67. Yu F, Sankaran VG, Yuan GC. CUT&RUNTools 2.0: a pipeline for single-cell
and display the repeatedly occurring neighbourhood of a gene. and bulk-level CUT&RUN and CUT&Tag data analysis. Bioinformatics.
Nucleic Acids Res. 2000;28(18):3442–4. 2022;38(1):252–4.
43. Szklarczyk D, Gable AL, Nastou KC, Lyon D, Kirsch R, Pyysalo S, et al. The 68. Bednar JB, Hamiche A, Dimitrov SI. H1-nucleosome interac-
STRING database in 2021: customizable protein-protein networks, and tions and their functional implications. Biochim Biophys Acta.
functional characterization of user-uploaded gene/measurement sets. 2016;1859(3):436–43.
Nucleic Acids Res. 2021;49(D1):D605–12. 69. Buenrostro JD, Wu B, Chang HY, Greenleaf WJ. ATAC-Seq: A method for
44. Dennis G, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, et al. DAVID: assaying chromatin accessibility genome-wide. Curr Protoc Mol Biol.
database for annotation, visualization, and integrated discovery. 2015;109(1):21.29.1–9. https://doi.org/10.1002/0471142727.mb212
Genome Biol. 2003;4(9):1–11. 9s109.
45. Sherman BT, Hao M, Qiu J, Jiao X, Baseler MW, Lane HC, DAVID: a web 70. Crawford GE, Holt IE, Whittle J, Webb BD, Tai D, Davis S, et al. Genome-
server for functional enrichment analysis and functional annotation of wide mapping of DNase hypersensitive sites using massively parallel
gene lists, et al. update). Nucleic Acids Res. 2021;2022:10. signature sequencing (MPSS). Genome Res. 2006;16(1):123–31.
46. Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. 71. Mieczkowski J, Cook A, Bowman SK, Mueller B, Alver BH, Kundu S, et al.
Nucleic Acids Res. 2000;28(1):27–30. MNase titration reveals differences between nucleosome occupancy
47. Kanehisa M. Toward understanding the origin and evolution of cellular and chromatin accessibility. Nat Commun. 2016;7(1):1–11.
organisms. Protein Sci. 2019;28(11):1947–51. 72. Giresi PG, Kim J, McDaniell RM, Iyer VR, Lieb JD. FAIRE ((F) under-baror-
48. Kanehisa M, Furumichi M, Sato Y, Kawashima M, Ishiguro-Watanabe M. maldehyde-(A) under-barssisted (I) under-barsolation of (R) under-
KEGG for taxonomy-based analysis of pathways and genomes. Nucleic baregulatory (E) under-barlements) isolates active regulatory elements
Acids Res. 2023;51(D1):D587–92. from human chromatin. Genome Res. 2007;17(6):877–85.
Laub et al. BMC Genomics (2023) 24:382 Page 19 of 19
73. Goryshin IY, Reznikoff WS. Tn5 in Vitro Transposition. J Biol Chem. 98. Pardo B, Spangler A, Weber LM, Page SC, Hicks SC, Jaffe AE, et al.
1998;273(13):7367–74. https://doi.org/10.1074/jbc.273.13.7367. spatialLIBD: an R/Bioconductor package to visualize spatially-resolved
74. Cusanovich DA, Daza R, Adey A, Pliner HA, Christiansen L, Gunderson transcriptomics data. BMC Genomics. 2022;23(1):434.
KL, et al. Multiplex single-cell profiling of chromatin accessibility by 99. Nguyen NTT, Contreras-Moreira B, Castro-Mondragon JA, Santana-
combinatorial cellular indexing. Science. 2015;348(6237):910–4. Garcia W, Ossio R, Robles-Espinoza CD, et al. RSAT 2018: regula-
75. Buenrostro JD, Wu B, Litzenburger UM, Ruff D, Gonzales ML, Snyder MP, tory sequence analysis tools 20th anniversary. Nucleic Acids Res.
et al. Single-cell chromatin accessibility reveals principles of regulatory 2018;46(W1):W209–14.
variation. Nature. 2015;523(7561):486–90. 100. Martens L, Hermjakob H, Jones P, Adamski M, Taylor C, States D,
76. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for tran- et al. PRIDE: the proteomics identifications database. Proteomics.
scriptomics. Nat Rev Genet. 2009;10(1):57–63. 2005;5(13):3537–45.
77. Kukurba KR, Montgomery SB. RNA sequencing and analysis. Cold 101. Perez-Riverol Y, Bai J, Bandla C, García-Seisdedos D, Hewapathirana S,
Spring Harb Protoc. 2015;2015(11):pdb–top084970. Kamatchinathan S, et al. The PRIDE database resources in 2022: a hub
78. Lachmann A, Torre D, Keenan AB, Jagodnik KM, Lee HJ, Wang L, et al. for mass spectrometry-based proteomics evidences. Nucleic Acids Res.
Massive mining of publicly available RNA-seq data from human and 2022;50(D1):D543–52.
mouse. Nat Commun. 2018;9(1):1–10. 102. Oughtred R, Rust J, Chang C, Breitkreutz BJ, Stark C, Willems A, et al.
79. Kim SY, Volsky DJ. PAGE: parametric analysis of gene set enrichment. The BioGRID database: A comprehensive biomedical resource of
BMC Bioinformatics. 2005;6(1):1–12. curated protein, genetic, and chemical interactions. Protein Sci.
80. Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, 2021;30(1):187–200.
et al. Introducing mothur: open-source, platform-independent, com- 103. Rheinberger HJ. Toward a history of epistemic things: Synthesizing
munity-supported software for describing and comparing microbial proteins in the test tube. Stanford University Press; 1997.
communities. Appl Environ Microbiol. 2009;75(23):7537–41. 104. Cetina KK. Epistemic cultures: How the sciences make knowledge.
81. Cock PJ, Grüning BA, Paszkiewicz K, Pritchard L. Galaxy tools and Harvard University Press; 1999.
workflows for sequence analysis with applications in molecular plant 105. Song Q, Hou Y, Zhang Y, Liu J, Wang Y, Fu J, et al. Integrated multi-omics
pathology. PeerJ. 2013;1:e167. approach revealed cellular senescence landscape. Nucleic Acids Res.
82. Spitzer D, Khel MI, Pütz T, Zinke J, Jia X, Sommer K, et al. A flow 2022; 50(19):10947–10963.
cytometry-based protocol for syngenic isolation of neurovascular unit 106. Naik A, Dalpatraj N, Thakur N. Global Gene Expression Regula-
cells from mouse and human tissues. Nat Protoc. 2023;18(5):1510–42. tion Mediated by TGFβ Through H3K9me3 Mark. Cancer Informat.
83. Schulze-Kremer S. Ontologies for molecular biology and bioinformatics. 2022;21:11769351221115136.
Silico Biol. 2002;2(3):179–93. 107. Moorlag SJ, Matzaraki V, van Puffelen JH, van der Heijden C, Keating S,
84. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Groh L, et al. An integrative genomics approach identifies KDM4 as a
et al. Gene ontology: tool for the unification of biology. Nat Genet. modulator of trained immunity. Eur J Immunol. 2022;52(3):431–46.
2000;25(1):25–9. 108. Jones K, Ramirez-Perez S, Niu S, Gangishetti U, Drissi H, Bhattaram P.
85. The Gene Ontology resource. enriching a GOld mine. Nucleic Acids Res. SOX4 and RELA Function as Transcriptional Partners to Regulate the
2021;49(D1):D325–34. Expression of TNF-Responsive Genes in Fibroblast-Like Synoviocytes.
86. Calderón-González KG, Hernández-Monge J, Herrera-Aguirre ME, Luna- Front Immunol. 2022;13:789349.
Arias JP. Bioinformatics tools for proteomics data interpretation. Mod
Proteomics-Sample Prep Anal Pract Appl. 2016;919:281–341.
87. Svensson V, da Veiga Beltrame E, Pachter L. A curated database reveals Publisher’s Note
trends in single-cell transcriptomics. Database. 2020;2020:baaa073. Springer Nature remains neutral with regard to jurisdictional claims in pub-
Available at https://academic.oup.com/database/article/doi/10.1093/ lished maps and institutional affiliations.
database/baaa073/6008692.
88. Chen G, Ning B, Shi T. Single-cell RNA-seq technologies and related
computational data analysis. Front Genet. 2019;10:317.
89. Kharchenko PV. The triumphs and limitations of computational meth-
ods for scRNA-seq. Nat Methods. 2021;18(7):723–32.
90. Carangelo G, Magi A, Semeraro R. From multitude to singularity: An
up-to-date overview of scRNA-seq data generation and analysis. Front
Genet. 2022;13:994069.
91. Zappia L, Theis FJ. Over 1000 tools reveal trends in the single-cell RNA-
seq analysis landscape. Genome Biol. 2021;22:1–18.
92. Ståhl PL, Salmén F, Vickovic S, Lundmark A, Navarro JF, Magnusson J,
et al. Visualization and analysis of gene expression in tissue sections by
spatial transcriptomics. Science. 2016;353(6294):78–82.
93. Williams CG, Lee HJ, Asatsuma T, Vento-Tormo R, Haque A. An introduc-
tion to spatial transcriptomics for biomedical research. Genome Med.
2022;14(1):1–18.
94. Armingol E, Ghaddar A, Joshi CJ, Baghdassarian H, Shamie I, Chan J,
Ready to submit your research ? Choose BMC and benefit from:
et al. Inferring a spatial code of cell-cell interactions across a whole
animal body. PLoS Comput Biol. 2022;18(11):e1010715.
• fast, convenient online submission
95. Deng Y, Bartosovic M, Kukanja P, Zhang D, Liu Y, Su G, et al. Spatial-CUT
&Tag: spatially resolved chromatin modification profiling at the cellular • thorough peer review by experienced researchers in your field
level. Science. 2022;375(6581):681–6. • rapid publication on acceptance
96. Llorens-Bobadilla E, Zamboni M, Marklund M, Bhalla N, Chen X, Hart-
• support for research data, including large and complex data types
man J, et al. Chromatin accessibility profiling in tissue sections by
spatial ATAC. bioRxiv. 2022. https://doi.org/10.1101/2022.07.27.500203. • gold Open Access which fosters wider collaboration and increased citations
https://www.biorxiv.org/content/early/2022/07/29/2022.07.27.500203. • maximum visibility for your research: over 100M website views per year
97. Liu Y, Yang M, Deng Y, Su G, Enninful A, Guo CC, et al. High-spatial-res-
olution multi-omics sequencing via deterministic barcoding in tissue. At BMC, research is always in progress.
Cell. 2020;183(6):1665–81.
Learn more biomedcentral.com/submissions