0% found this document useful (0 votes)
15 views10 pages

Bio Articles2

The document discusses how the programming language Julia is well-suited for computational work in biology and biomedical research. Julia aims to provide both high performance and ease of use, avoiding the need for researchers to work in multiple languages. Examples are given of biological domains where Julia has enabled new approaches and capabilities. Resources are also provided to help researchers get started with Julia.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views10 pages

Bio Articles2

The document discusses how the programming language Julia is well-suited for computational work in biology and biomedical research. Julia aims to provide both high performance and ease of use, avoiding the need for researchers to work in multiple languages. Examples are given of biological domains where Julia has enabled new approaches and capabilities. Resources are also provided to help researchers get started with Julia.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

nature methods

Perspective https://doi.org/10.1038/s41592-023-01832-z

Julia for biologists

Received: 21 September 2021 Elisabeth Roesch1,2,3, Joe G. Greener4, Adam L. MacLean 5,


Huda Nassar6, Christopher Rackauckas 3,7,8, Timothy E. Holy 9
Accepted: 27 February 2023
& Michael P. H. Stumpf 1,2,10,11
Published online: 6 April 2023

Check for updates Major computational challenges exist in relation to the collection, curation,
processing and analysis of large genomic and imaging datasets, as well as the
simulation of larger and more realistic models in systems biology. Here we
discuss how a relative newcomer among programming languages—Julia—is
poised to meet the current and emerging demands in the computational
biosciences and beyond. Speed, flexibility, a thriving package ecosystem
and readability are major factors that make high-performance computing
and data analysis available to an unprecedented degree. We highlight how
Julia’s design is already enabling new ways of analyzing biological data and
systems, and we provide a list of resources that can facilitate the transition
into Julian computing.

Computers are tools. Like pipettes or centrifuges, they allow us to per- research that is hidden from most users, however, continues to rely on
form tasks more quickly or efficiently, and like microscopes, they give C/C++ or Fortran. Computationally intensive studies are often initially
us new, more detailed insights into biological systems and data. Com- designed and prototyped in R, Python or MATLAB and subsequently
puters allow us to develop, simulate and test mathematical models of translated into C/C++ or Fortran for increased performance. This is
biology and compare models with complex datasets. As computational known as the two-language problem6.
power evolved, solving biological problems computationally became This two-language approach has been successful but has limita-
possible, then popular and, eventually, necessary1. Entire fields such tions (Fig. 1a). When moving an implementation from one language to
as computational biology and bioinformatics emerged. Without com- another, faster, programming language, verbatim translation may not
puters, the reconstruction of structures from X-ray crystallography, be the optimal route: faster languages often provide the programmer
NMR or cryogenic electron microscopy methods would be impossible. with higher autonomy to choose how memory is accessed or allocated
The same goes for the 1000 Genomes Project2, which used computer or to employ more flexible data structures7. Exploiting such features
programs to assemble and analyze the DNA sequences generated. may involve a complete rewrite of the algorithm to ensure faster imple-
More recently, vaccine development has benefited from advances in mentation or better scaling as datasets grow in size and complexity.
algorithms and computer hardware3. This requires expertise across both languages, but also rigorous testing
Programming languages are also tools. They make it possible of the code in both languages.
to instruct computers. Some languages are good at specific tasks Julia8 is a relatively new programming language that overcomes
(think Perl for string processing tasks or R for statistical analyses), the two-language problem. Users do not have to choose between ease
whereas others—including C/C++ and Python—have been used with of use and high performance. Julia has been designed to be easy to
success across many different domains. In biomedical research, the program in and fast to execute (Fig. 1b). This efficiency and the grow-
prevailing languages have arguably been R4 and Python5. Much of ing ecosystem of state-of-the-art application packages (Table 1 and
the high-performance backbone supporting computationally intensive Fig. 2) and introductions7,9 make it an attractive choice for biologists.

School of Mathematics and Statistics, University of Melbourne, Melbourne, Victoria, Australia. 2Melbourne Integrative Genomics, University of
1

Melbourne, Melbourne, Victoria, Australia. 3JuliaHub, Somerville, MA, USA. 4Medical Research Council Laboratory of Molecular Biology, Cambridge,
UK. 5Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA. 6RelationalAI, Berkeley, CA,
USA. 7Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA. 8Pumas-AI, Centreville, VA, USA. 9Departments of
Neuroscience and Biomedical Engineering, Washington University in St. Louis, St. Louis, MO, USA. 10School of BioSciences, The University of Melbourne,
Melbourne, Victoria, Australia. 11ARC Centre of Excellence for the Mathematical Analysis of Cellular Systems, Melbourne, Victoria, Australia.
e-mail: [email protected]

Nature Methods | Volume 20 | May 2023 | 655–664 655


Perspective https://doi.org/10.1038/s41592-023-01832-z

a
New biology

Speed and
metaprogramming

Abstraction

Base camp: R/Python/MATLAB Base camp: Julia

b
Provided performance

C/C++,
Provided performance

Fortran Julia
Translation

c
R,
Python,
MATLAB Current R, Python, C/C++,
project Julia MATLAB Fortran

Required performance Required performance

Two-language option Julia option Add new


data types?
Code complexity

Add new
Coding effort

functions?

Required performance Required performance

Fig. 1 | Julia is a tool enabling biologists to discover new science. a, In the the glacier and a rocket to simply fly over the chasm. These represent Julia’s top
biological sciences, the most obvious alternatives to the programming language three language design features: abstraction, speed and metaprogramming. With
Julia are R, Python and MATLAB. Here we contrast the two potential pathways to these tools, the journey to the top of the mountain becomes much easier for the
new biology with a mountaineering analogy. The top of the mountain represents excursionist. Julia allows biologists to not be held back by the problems discussed
new biology49. There are two potential base camps for the ascent: base camp 1 in b and c. b, The two-language problem refers to having separate languages for
(left, red) is R/Python/MATLAB. Base camp 2 (right, green) is Julia. To get to the algorithm development and prototyping (such as R or Python) and production
top, the mountaineer, representing a researcher, needs to overcome certain runs (such as C/C++ or Fortran), respectively. Julia was designed to be good at both
obstacles, such as a glacier and a chasm. These represent research hurdles, such tasks, which can reduce programming efforts and software complexity. c, The
as large and diverse datasets or complex models. Starting at the Julia base camp, expression problem refers to the effort required by users to define new (optimized)
the mountaineer has access to efficient and effective tools, such as a bridge over data types and functions that can be added to existing external code bases.

Biological systems and data are multifaceted by nature, and to In this article, we discuss each language feature and its rele-
describe them or model them mathematically requires a flexible vance in the context of one concrete biological example per feature.
programming language that can connect different types of highly An additional example per feature can be found in the Supplementary
structured data (Fig. 1c). Three hallmarks of the language make Julia Information. Furthermore, in Supplementary Table 1, we provide a sum-
particularly suitable for meeting current and emerging demands of mary of why we believe Julia is a good programming language for biolo-
biomedical science: speed, abstraction and metaprogramming. gists. Supporting online material is provided in a GitHub repository at

Nature Methods | Volume 20 | May 2023 | 655–664 656


Perspective https://doi.org/10.1038/s41592-023-01832-z

Table 1 | Julia provides a rich package ecosystem for biologists

Community Topic Example packages

JuliaData Data manipulation, storage, and input and output DataFrames.jl, JuliaDB.jl, DataFramesMeta.jl and CSV.jl
JuliaPlots Data visualization Plots.jl, Makie.jl, StatsPlots.jl and PlotlyJS.jl
JuliaStats Statistics and machine learning Distributions.jl, GLM.jl, StatsBase.jl, Distances.jl, MixedModels.jl, TimeSeries.jl,
Clustering.jl, MultivariateStats.jl and HypothesisTests.jl.
BioJulia Bioinformatics and computational biology BioSequences.jl, BioStructures.jl, BioAlignments.jl, FASTX.jl and Microbiome.jl
JuliaImages Image processing Images.jl, ImageSegmentation.jl, ImageTransformations.jl and ImageView.jl
EcoJulia Ecological research SpatialEcology.jl, EcologicalNetworks.jl, Phylo.jl and Diversity.jl
SciML Scientific machine learning DifferentialEquations.jl, ModelingToolkit.jl, DiffEqFlux.jl and Catalyst.jl
FluxML Machine learning Flux.jl, Zygote.jl, MacroTools.jl, GeometricFlux.jl and Metalhead.jl
Related packages are organized in package communities. In this table, we present an overview of the package communities we consider to be most relevant to biologists.

RCall.jl
PyCall.jl
Data: MATLAB.jl
CxxWrap.jl
DataFrames.jl JavaCall.jl
CSV.jl Domain data:
Visualization: Graphs.jl
Images.jl BioSequences.jl
CellFishing.jl Tools:
Plots.jl
FASTX.jl
StatPlots.jl Integration of non-Julia
PhyloPlots.jl code Miocrobiome.jl
PyPlot.jl BioStructures.jl
Gadfly.jl BioAlignments.jl
Data handling and PhyloNetworks.jl
visualization MIToS.jl
Bioinformatics
Dimensionality Julia for Biologists
reduction:
Advanced models:
TSne.jl
UMAP.jl ModelingToolkit.jl
DifferentialEquations.jl
Statistical and machine Mathematical modeling DynamicalSystems.jl
learning Catalyst.jl,Turing.jl
Statistics: BifurcationKit.jl

HypothesisTests.jl Deep learning:


MultiVariateStats.jl Inferences and optimization:
MixedModels.jl Flux.jl
Traditional machine DiffEqFlux.jl
InformationMeasures.jl
learning: ChainRules.jl
Optim.jl
Zygote.jl
NearestNeighbors.jl GpABC.jl
DecisionTrees.jl JuMP.jl
Clustering.jl
MLJ.jl
GLM.jl

Fig. 2 | Overview of Julia’s package ecosystem, presented by topic group. Julia consists of packages related to five main biological topics: bioinformatics,
mathematical modeling, statistical and machine learning, data handling and visualization, and the integration of non-Julia code.

https://github.com/ElisabethRoesch/Perspective_Julia_for_Biologists/ discovery when performed a small number of times. However, when


tree/main/examples/Abstraction/Example_Structural_bioinformat- performed repeatedly on large datasets, the execution speed of a pro-
ics_with_composable_packages. First, the online material shows code gramming language can become the limiting factor. Similarly, simu-
for the examples discussed here. Code examples have been chosen and lation of large and complex computational models is only possible
designed to be accessible to a wide audience. We group them based on with fast implementations. For example, digital twins11,12 in precision
computational focus (high- and low-level user case) and access points medicine will be useless without fast computation.
(for example, Julia files and interactive notebooks). Second, a summary The speed of the programming language also determines how
of helpful resources for starting with Julia and building Julia solutions extensively we can test statistical analysis or simulation algorithms
is provided. The latter include, for example, platform-specific Julia before using them on real data. Thorough testing of a new statisti-
installation guides, links to introductory Julia courses and a selection cal algorithm can be expected to be around two to three orders of
of pointers to relevant Julia communities. magnitude more costly in computational terms than a single produc-
tion run13. Furthermore, the quality of approximations depends on
Speed many factors (for example, the number of tested candidates14,15 and
The speed of a programming language is not just a matter of conveni- grid step sizes16), and faster code enables better analysis. Here and in
ence that allows us to complete analyses more quickly (Fig. 3). It can the Supplementary Information we provide insights into the design
enable new and better science. Speed is important for analyzing large features underlying Julia’s speed6. The speed rivals that of statically
datasets10 that are becoming the norm across many areas of modern compiled languages such as Fortran and C/C++. Higher-level language
biomedical research. Slow computations might not hold back scientific features—hallmarks of R, Python, MATLAB and Julia—typically lead to

Nature Methods | Volume 20 | May 2023 | 655–664 657


Perspective https://doi.org/10.1038/s41592-023-01832-z

a
6
Julia
R
–2
4
Julia: DifferentialEquations.jl DP5
–4 Julia: DifferentialEquations.jl Tsit5

Time (log[s])
Time (log[s])

Julia: DifferentialEquations.jl Vern7


Fortran: Hairer dopri5
2 C: Sundials CVODE Adams
–6 MATLAB: ode45
MATLAB: ode113
Python: SciPy RK45
Python: SciPy LSODA
0 Python: SciPy odeint
–8 R: deSolve Isoda
R: deSolve ode45

–2 –10

200 400 600 800 1,000 –25 –20 –15 –10 –5


Number of genes log[error]

b
# allocate tmp Actually in C

for i in 1:n
tmp[i] = A[i] × B[i] In Python:
In Python D=A×B+C 2 function calls and
# allocate D 2 allocations

Linear operation for in i in n


on matrices versus
D[i] = tmp[i] + C[i]

In Julia:
for i in 1:n
In Julia D. = A. × B. + C 1 function call and
D[i] = A[i] × B[i] + C[i]
no allocation
In Julia

c
Numbers of function calls for calculating the derivative f([x,y]) Function call costs

Julia: In Julia: ~5 ns
Fused to 1 function call

8 scalar Python: In Python: ~150 ns


f([x,y]) = Calling one function
operations 8 function calls

Numba:
In Numba: ~150 ns
Fused to 1 function call

Theoretically inferred and real-time calculation of f([x,y])

Time of
Time of array floating point Time of
allocation + + function calls = Inferred time Real time
operations

Julia 8 × 2 ns + 1 × 5 ns = 21 ns 20 ns

Python 300 ns + 8 × 2 ns + 8 × 150 ns = 1,516 ns 1,510 ns

Numba 300 ns + 8 × 2 ns + 1 × 150 ns = 466 ns 425 ns

Fig. 3 | Julia’s speed feature. a, Speed-up examples relevant to biology. Lotka–Volterra model (more systems are described in ref. 50). b, Schematic
Left, comparison of the time required to calculate the mutual information of the speed up of vectorizable code (as in a). c, Schematic of the speed up of
for all possible pairs of genes of a single-cell dataset13. Right, benchmark of nonvectorizable code (as in b).
ODE solvers implemented in Julia, Fortran, C, MATLAB, Python and R for the

shorter development times. Going from an initial idea to working code Shiny) and flexible software editing environments. Julia combines
can be orders of magnitude faster than, for example, C/C++. This is in fast development with fast run-time performance and is therefore
no small measure helped by the flexible Jupyter and Pluto.jl notebook appropriate for both algorithm/method prototyping and time- and
user interfaces (which fulfill similar functions to, for example, R’s resource-intensive applications.

Nature Methods | Volume 20 | May 2023 | 655–664 658


Perspective https://doi.org/10.1038/s41592-023-01832-z

Example: network inference from single-cell data AbstractArray


In single-cell biology, we can measure expression levels of tens of thou- interface

sands of genes in tens of thousands of cells17. Increasingly, we are able


to do this with spatial resolution. However, searching for patterns in
complex and large datasets is computationally expensive. Even appar-
ently simple tasks, such as calculating the mutual information across OffsetArray SubArray ...
all pairs of genes in a large dataset can quickly become impossible.
Gene regulatory network inference from single-cell data is a sta- Analogy:
tistically demanding task and one for which Julia’s speed helps. Chan
Pipettor
et al.13 used higher-order information theoretical measures to infer
interface
gene regulatory networks from transcriptomic single-cell data of a
range of developmental and stem cell systems. The mutual informa-
tion has to be calculated for gene pairs, but a multivariate information
measure—partial information decomposition—is also considered to Pipettor by Pipettor by
manufacturer A manufacturer B ...
separate out direct and indirect interactions18, and this requires con-
sideration of all gene triplets. Fig. 4 | Interfaces in Julia. It is possible for experimental scientists to switch
The run time of algorithms implemented in the Julia package between different pipettors without recreating entire experimental protocols
InformationMeasures.jl can be compared with that of the minet R because a common understanding (or interface) exists that specifies tasks
package19 (Fig. 3a, left). For small numbers of genes, differences are that pipettors should be able to perform in a similar manner. In Julia, we can
considerable but not prohibitive. Inferring a network with 100 genes define interfaces, such as the AbstractArray class, in which we specify rules
takes around 0.3 s in Julia compared with 1.5 s in R, but already for 1,000 that any array-like computational object has to follow. Interfaces allow us to
genes the inference times differ substantially (17 s in Julia and 390 s apply methods developed for abstract types to custom types. By building our
(>20-fold difference) in R). For datasets with 3,500 genes and 600 cells algorithms around interfaces, we can make the use, reuse and refinement of
(by today’s standards, small datasets), R needs over 2.5 h compared code easier.

with Julia’s 134 s (~64-fold difference) and, in real-world applications,


400-fold speed differences are possible (this corresponds to comput-
ing times of hours versus weeks). Here we reach the threshold of what
can be tested and evaluated rigorously in many high-level languages. Abstraction
Overall, multivariate information measures would almost certainly be Julia allows an exceptionally high level of abstraction21. We can illustrate
unfeasible in pure R or Python implementations. the advantages of abstraction by drawing an analogy to a standard
The reason for this performance difference is Julia’s ability to laboratory tool: the pipettor. Pipettors produced by different manu-
optimize vectorizable code6 (Fig. 3b). Users of Python and R are familiar facturers have slightly different designs. Nevertheless, they all perform
with vectorized functions, such as maps and element-wise operations. the same task in a similar way. It thus takes minimum effort to get used
Julia’s performance improves by combining just-in-time compilation, to a new pipettor without having to retrain on every aspect of an experi-
whereby computer code is compiled at run time (and the compiler can mental protocol. Abstraction achieves the same for software. Similar
therefore be informed by the current state of the program and data), to the described abstract interface pipettor, in Julia we have interfaces
rather than ahead of execution, using vectorized functions via a trick such as the AbstractArray interface (Fig. 4 and discussed in detail in
known as operator fusion. When writing a chain of vector expres- the Supplementary Information). All of its implementations are array-like
sions, such as D = A × B + C (where A, B, C and D are n-dimensional structures that provide the same core functionalities that an array-like
vectors), libraries such as NumPy call optimized code, which is typically structure is expected to have. This allows us to easily and flexibly
written in languages such as C/C++, and these operations are computed switch between different implementations of the same interface22.
sequentially (Fig. 3c). To evaluate A × B, C code is called to produce a Abstraction is especially advantageous in the biological sciences
temporary array, tmp, then tmp + C is evaluated (using C) to produce D. where data are frequently heterogeneous and complex23,24. This can
Allocating memory for the temporary intermediate tmp and the pose challenges for software developers22 and data analysis pipelines,
final result D is O(n) (which means that the time it takes to complete as changes to data may require substantial rewriting of code for pro-
the computation increases approximately linearly with n, the length cessing and analysis. We may either end up with separate implementa-
of the vectors) and scales proportionally to the compute cost; thus, tions of algorithms for different types of data or we may remove details
no matter what the size of the vectors, there is a major unavoidable and nuance from the data to enable analysis by existing algorithms.
overhead. Julia uses the “.” (“dot”) operator to signify element-wise With abstraction, we do not have to make such choices. Julia’s abstrac-
action of a function, and we write D = A.× B.+ C. When the Julia com- tion capabilities provide room for both specialization and generaliza-
piler sees this so-called broadcast expression, indicated by the “.” tion through features such as abstract interfaces and generic functions
operator, it fuses all nearby dot operations into a single function and that can exploit the advantages of unique data formats with vary-
just-in-time compilation compiles this function at run-time into a loop. ing internal characteristics without an overall performance penalty.
Thus, NumPy makes two function calls and spends time generating Here we illustrate the effect of Julia’s abstraction via an example of a
two arrays, whereas Julia makes a single function call and reuses exist- structural bioinformatics pipeline. Additionally, we provide a second,
ing memory. This and similar performance features are now leading more technical abstraction example focusing on image analysis in
package authors of statistical and data science libraries to recommend Supplementary Fig. 1.
calling Julia for such operations, such as the recommendation by the
principal author of the R lme4 linear mixed effects library to use Julia- Example: structural bioinformatics with composable
Call to access MixedModels.jl in Julia (both written by the same author) packages
for an approximately 200× acceleration20. Julia’s flexibility means that packages from different authors can
The code for this example can be found at https://github.com/ generally be combined with ease into workflows—a feature known as
ElisabethRoesch/Perspective_Julia_for_Biologists/tree/main/ composability (Fig. 5). Users benefit from Julia’s flexibility just as much
examples/Abstraction/Example_Structural_bioinformatics_with_ as package developers. For example, we consider a standard struc-
composable_packages. tural bioinformatics workflow (Fig. 5a) in which we want to download

Nature Methods | Volume 20 | May 2023 | 655–664 659


Perspective https://doi.org/10.1038/s41592-023-01832-z

a b
PDB file of monomer
Existing types Domain-specific function
of Graphs.jl finding residues for allosteric communication
Read file

Extract Cβ atom
Reuse types by writing
generic pipelines
Plot distance map

Graph of contacting
New operation applies to
residues
existing type

Input/output
Betweenness Graph of contacting
centrality of residues Key steps residues
highlighting
flexibility

c
Existing generic function Domain-specific type
plot of Plots.jl in BioStructures.jl

Define new recipe to


customize plot for this specific type

Existing operation applies to new


type

Plot distance map

Fig. 5 | The abstraction feature in Julia. a, Abstract Julia code enables a code for defining a new type and and a new plot recipe. This example is for the
flexible structural bioinformatics pipeline. The flow chart shows a pipeline that structure MyBioStruc, which captures the results of prediction algorithms of
combines multiple Julia packages seamlessly together. This gives developers amino acid sequences based on data. It is defined with the fields predicted_AA
and users flexibility so that the effort and time required to generate new models (a vector of characters that represent the predicted AAs), certainty_AA (a vector
and complex workflows is substantially reduced and collaboration is made of numbers quantifying the certainty for each predicted AA), study (a string
easier. PDB, Protein Data Bank. b, An example pipeline showing the solving of naming the respective study that the prediction is based on) and alg (a string
the first part of the expression problem (an illustration of which is provided in naming the respective prediction algorithm). With the macro @recipe, we can
Fig. 1) via the easy code base extension to new functions (step highlighted in specify how the function plot(…) should work for our newly specified example
blue). c, Left, an example pipeline showing the solving of another expression type. Here we define that this should create a line plot of the predicted amino
problem: extension to new types. The step highlighted in blue represents the acids with the mean of the certainty of the prediction shown by the opacity of the
point at which a new plot recipe is defined for a domain-specific type (that is, we line, specified by the Plots.jl package as α. More details on the selected example
demonstrate the extension of an existing code base to new types). Right, Julia code are provided in the Supplementary Information.

and read the structure of the protein crambin from the Protein Data Packages can be combined to meet the specific needs of each
Bank. This can be done using the BioStructures.jl package25 from the study; for example, to generate protein ensembles and predict allos-
BioJulia organization, which provides the essential bioinformatics teric sites28 or to carry out information theoretical comparisons using
infrastructure. Protein structures can be viewed using Bio3DView.jl, the MIToS.jl package29. In this example, we have used at least five dif-
which uses the 3Dmol.js JavaScript library26 as Julia can easily connect ferent packages together seamlessly. Plots.jl, BioAlignments.jl and
to packages from other languages. We can show the distance map of the Graphs.jl do not depend on or know about BioStructures.jl, but can
Cβ atoms using Plots.jl. While Plots.jl is not aware of this custom type, still be used productively alongside it (Fig. 5c). Abstraction means
a Plots.jl recipe makes this straightforward. BioSequences.jl provides that the improvements in any of these packages will benefit users
custom data types of sequences and allows us to represent the protein of BioStructures.jl, despite the packages not being developed with
sequence efficiently. With this, BioAlignments.jl can be used to align protein structures in mind.
our sequences of interest. This suite of packages can be used to carry Package composability is common across the Julia ecosystem
out single-cell, full-length total RNA sequencing analysis27 quickly and is enabled by abstract interfaces supported by multiple dispatch
and with ease. A few lines of code in BioStructures.jl allow us to define (that is, the ability to define multiple versions of the same function
the residue contact graph using Graphs.jl, giving access to optimized with different argument types). Programmers can define standard
graph operations implemented in Graphs.jl for further analysis, such functions such as addition and multiplication for their own types.
as calculating the betweenness centrality of the nodes. If coding and Abstraction means that functions in unrelated packages often just work
analysis are performed in Pluto.jl, then updating one section updates despite knowing nothing about the custom types. This is rarely seen in
the whole workflow, which assists exploratory analysis (Fig. 5b). languages such as Python, R and C/C++, where the behavior of an object

Nature Methods | Volume 20 | May 2023 | 655–664 660


Perspective https://doi.org/10.1038/s41592-023-01832-z

is tightly confined and combining classes and functions from different systems underlying cellular function34,35. However, the specification
projects requires much more (of what is known as) boilerplate code. of mathematical models is challenging and requires us to specify all
For example, the Biopython project30 has become a powerful of our assumptions explicitly. We then have to solve these models
package covering much of bioinformatics. However, extensions to based on these assumptions. Analyzing a given reaction network can
Biopython objects are generally added to (an increasingly monolithic) involve the solution, for example, of ordinary differential equations,
Biopython, rather than to independent packages. This can lead to delay differential equations, stochastic differential equations (SDEs)
objects and algorithms that have the difficult task of fitting all use or discrete-time stochastic processes. To create instances of each of
cases, including their dependencies, simultaneously31. In contrast, these models would—in languages such as C/C++ or Python—typically
Julia’s composability facilitates writing generic code that can be used require the writing of different snippets of code for each modeling
beyond its intended application domain. Tables.jl, for example, pro- framework. In Julia, via metaprogamming, different models can be
vides a common interface for tabular data, allowing generic code for generated automatically from a single block of code. This simpli-
common tasks on tables. Currently, some 131 distinct packages draw fies workflows and makes them more efficient, but also removes the
on this common core for purposes far beyond the initially conceived possibility of errors due to model inconsistencies.
application scope. This is an example that showcases how abstraction For example, we can consider the ERK phosphorylation process
ensures the interoperability and longevity of code. shown in Fig. 6b36. Here ERK is doubly phosoporylated (by its cognisant
The code for this example can be found at https://github.com/ kinase, MEK), upon which it can shuttle into the nucleus and initiate
ElisabethRoesch/Perspective_Julia_for_Biologists/tree/main/ changes in gene expression. Its role and importance have made ERK a
examples/Abstraction/Example_Structural_bioinformatics_with_ target of extensive further analysis, and modeling has helped to shed
composable_packages. light on its function and role in cell fate decision-making systems37.
This small system, albeit one of great importance and subtlety, forms
Metaprogramming building blocks for larger, more realistic biochemical reaction and
As our knowledge of the complexity of biological systems increases, so signal transduction38 models.
does our need to construct and analyze mathematical models of these In Julia, using the package Catalyst.jl39, this model can be written
systems (Fig. 6). Currently, most modeling studies in biology rely on directly in terms of its reactions, with the corresponding rates. Source
programming languages that treat source code as static. Once writ- code is human readable and differs minimally from the conventional
ten, it can be processed into loaded and executing code, but it is never chemical reaction systems shown in Fig. 6c.
changed while running. We can compare this linear control process with The science is encapsulated in this little snippet. Solving of the
the central dogma of biology: source code (DNA) is transformed into reaction systems then proceeds by calling the appropriate simulation
loaded code (RNA) and executing code (protein). We now know that this tool from DifferentialEquations.jl. For a deterministic model, the reac-
process (DNA⟶RNA⟶protein) is not linear and unidirectional. RNA tion network is directly converted into a system of ordinary differential
and proteins can alter how and when DNA is expressed. Programming equations (via ODESystem). The same reaction network can be directly
languages that support metaprogramming break the linear flow of the converted into a model that is specified by SDEs (via SDEProblem) or a
computer program in a analogous manner (Fig. 6a). With metaprogram- discrete-time stochastic process model (via DiscreteProblem). Each of
ming, source code can be written that is processed into loaded and these cases leads to the creation of a distinct model that can be simu-
executing code and that can be modified during run time. This shifts lated or analyzed; yet, all of the models share the underlying structure
our perception from static software to code as a dynamic instance when of the same reaction network. To simulate one of the resulting models,
the program can modify aspects of itself during run time. the user needs to specify only the necessary assumptions required for a
Metaprogramming originated in the LISP programming language simulation (that is, the parameter values and initial conditions), as well
in the early days of artificial intelligence research. It enables a form of as any further assumptions required that are specific to the model type
reflection and learning by the software, but the ability of a program to (for example, the choice of noise model for a system of SDEs). Adapting
modify computer code needs to be channeled very carefully. In Julia, the model to include nuclear shuttling40 of ERK, as in Fig. 6c, or extrinsic
this is done via a feature called hygenic macros32. These are flexible code noise upstream of ERK36 is easily achieved using metaprogramming.
templates, specified in the program, that can be manipulated at execu- The fitting of models to data, or estimation of their parameters, is
tion time. They are called hygenic because they prohibit accidentally also supported by the Julia package ecosystem. Parameter estimation
using variable names (and thus memory locations) that are defined by evaluating the likelihood, the posterior distribution or a cost func-
and used elsewhere. These macros can be used to generate repetitive tion is straightforward using the Optim.jl41 or JuMP.jl42 packages. Also,
code efficiently and effectively. because of Julia’s speed, it has become much easier to deploy Bayesian
However, there are other uses that can enable new research, and inference methods. Here, too, metaprogramming helps tools such as
this includes the development of mathematical models of biological the probabilistic programming tool Turing.jl43. Approximate Bayesian
systems. Unlike in physics, first principles (the conservation of energy, computation approaches44 also benefit from Julia’s speed, abstraction
momentum and so on) offer little guidance as to how we should con- and metaprogramming and are implemented in GpABC.jl14.
struct models of biological processes and systems. For these notori- The code for this example can be found at https://github.com/
ously complicated biological systems, trial and error, coupled with ElisabethRoesch/Perspective_Julia_for_Biologists/tree/main/examples/
biological domain expertise and state-of-the-art statistical model Metaprogramming/Example_Biochemical_reaction_networks.
selection, is required33. Great manual effort is spent on the formulation
of mathematical models, the exploration of their behavior and their Outlook
adaptation in light of comparisons with data. Metaprogramming (or Computer languages, like human languages, are diverse and changing
the abilities of introspection and reflection during run time32) and the to meet new demands. When selecting a programming language, we
ability to automate parts of the modeling process open up enormous have many choices, but often they reduce to essentially two options:
scope for new approaches to modeling biological systems (Fig. 6b), using a widely used language that everybody else is using or using the
including whole cells (see Supplementary Information). best language for the problem. Traditional languages have an enviable
track record of success in biological research. A frightening propor-
Example: biochemical reaction networks tion of the Internet and modern information infrastructure probably
Mathematical models of biochemical reaction networks allow us to ana- depends on legacy software that would not pass modern quality con-
lyze biological processes and make sense of the bewilderingly complex trol. However, it does the job, for the moment. Similarly, scientific

Nature Methods | Volume 20 | May 2023 | 655–664 661


Perspective https://doi.org/10.1038/s41592-023-01832-z

a b Large-scale, automated model development


Vn
Metaprogramming V3
V2
V1
MAPKK MAPKK
Source Loaded Executed
code code code M Mp Mpp

MKP MKP

Analogy:
Transcription factor

M Mp Mpp
DNA RNA Protein

c
Mathematical model description Metaprogramming syntax in Julia

Proposed model Final model


Does the model describe
ErkModel ErkModel
given data well?
Example syntax of adding reactions

No

Update: Change model

Delete reactions Add reactions Manipulate reactions

Fig. 6 | Julia’s metaprogramming feature. a, Illustration of metaprogramming MAP kinases present in human cells and build compartmental models by
and an analogy to the central dogma of molecular biology. Similar to how a explicitly modeling the kinase dynamics in the nucleus and cytosol40. c, Example
transcription factor, initially encoded in DNA, can control gene expression workflow of model construction. The adaption process of models could, for
and modify RNA levels of an organism, with metaprogramming we can create example, start with a theoretical inferred mathematical description, captured
code with a feedback effect. b, An example application of metaprogramming via the @reaction_network syntax of the Julia package Catalyst.jl. Subsequently,
in biology. Metaprogramming is especially helpful for large-scale, automated given experimental data, we evaluate an objective function of the current model,
model development. We can write code that adapts the model definition capturing the descriptiveness of the model in light of the data. Depending on the
automatically (for example, in light of new data or based on how they interact outcome of this evaluation, the model will be updated (for example, by adding
with other submodels). For example, when constructing models of cellular new reactions to the model via the macro @add_reactions). More details on the
systems V1, V2, ..., Vn, we can combine structurally similar models for the different selected example code are provided in the Supplementary Information.

progress is possible with legacy software. Python and R are far from On top of all of that, is a state-of-the-art package manager. All pack-
legacy and have plenty of life in them, and there are tools that allow us ages and Julia itself are maintained via Git, which makes installing
to overcome their intrinsic slowness45. and updating the Julia language, packages and their dependencies
Here we have tried to explain why we consider Julia a language for straightforward6.
the next chapter in the quantitative and computational life sciences. Julia has a smaller user base than R and Python, but it is growing.
Julia was designed to meet the current and future demands of scien- In some domains these languages have truly impressive package
tific and data-intensive computing46. It is an unequivocally modern ecosystems. R and the associated Bioconductor project, in particular,
language and it does not have the ballast of a long track record going have been instrumental in bringing sophisticated bioinformatics, data
all the way to the pre-big data days. The deliberate choices made by analysis and visualization methods to biologists. For many, they have
the developers furthermore make it fast and give developers and users also served as a gateway into programming. In other application areas
of the language a level of flexibility that is difficult to achieve in other (notably, the simulation of dynamical systems), Julia has leapfrogged
common languages such as R and Python, but also C/C++ and Fortran. the competition47. Many of the speed advantages of Julia come from

Nature Methods | Volume 20 | May 2023 | 655–664 662


Perspective https://doi.org/10.1038/s41592-023-01832-z

just-in-time compilation, which underlies and enables good run-time 13. Chan, T. E., Stumpf, M. P. & Babtie, A. C. Gene regulatory network
performance. This, however, takes time and causes what is known as inference from single-cell data using multivariate information
latency. Latency can be a problem for applications with hard real-time measures. Cell Syst. 5, 251–267.e3 (2017).
constraints, such as being the embedded code on a medical device that 14. Tankhilevich, E. et al. GpABC: a Julia package for approximate
requires strict accurate updates at 100-ms intervals. Bayesian computation with Gaussian process emulation.
Julia was designed to meet the current and future demands of Bioinformatics 36, 3286–3287 (2020).
scientific and data-intensive computing. The Julia alternative that 15. Innes, M. Flux: elegant machine learning with Julia. J. Open
arguably has the most traction is Rust. Rust is an emerging language Source Softw. 3, 602 (2018).
that has syntactic similarity to C++ but is better at managing memory 16. Rackauckas, C. & Nie, Q. DifferentialEquations.jl—a performant
safely. It detects discrepancies of type assignments at compile time and feature-rich ecosystem for solving differential equations in
and not just at run time, as is the case for C/C++. For this reason, it is Julia. J. Open Res. Softw. 5, 15 (2017).
being used in, for example, the Linux kernel. In the biological domain, it 17. Chen, J. et al. Spatial transcriptomic analysis of cryosectioned
could become a choice for medical devices (as we can control latency) tissue samples with Geo-seq. Nat. Protoc. 12, 566–580
or bioinformatics servers that would previously have been developed (2017).
in Java or C/C++. 18. Mahon, S. S. M. et al. Information theory and signal transduction
These advantages of a new language need to be balanced against systems: from molecular information processing to network
the convenience of programmers who are able to tap into the collective inference. Semin. Cell Dev. Biol. 35, 98–108 (2014).
knowledge of vast user communities. All languages have started small and 19. Meyer, P. E., Lafitte, F. & Bontempi, G. minet: a R/Bioconductor
had to develop user bases. The Julia community is growing, including in package for inferring large transcriptional networks using mutual
the biomedical sciences, and it appears to be acutely aware of the needs information. BMC Bioinformatics 9, 461 (2008).
of newcomers to Julia (and under-represented minorities in the compu- 20. Bates, D. Julia MixedModels from R. https://rpubs.com/
tational sciences more generally48; see, for example, https://julialang.org/ dmbates/377897 (2018).
diversity/ for details), which makes the switch to Julia easier9. 21. Lange, K. Algorithms from the Book (SIAM, 2020).
We have described the three main language design features that 22. Oliveira, S. & Stewart, D. E. Writing Scientific Software: a Guide to
make Julia interesting for the scientific computing: speed, abstrac- Good Style (Cambridge Univ. Press, 2006).
tion and metaprogramming. We have provided some intuition that 23. Alyass, A., Turcotte, M. & Meyre, D. From big data analysis to
fills these concepts with life, and we have illustrated how they can be personalized medicine for all: challenges and opportunities. BMC
exploited in different biological domains, and how speed, abstraction Med. Genom. 8, 33 (2015).
and metaprogramming together enable new ways of performing bio- 24. Gomez-Cabrero, D. et al. Data integration in the era of
logical research. Even though we have introduced these features sepa- omics: current and future challenges. BMC Syst. Biol. 8, I1
rately, they are deeply intertwined. For example, a lot of the speed-up (2014).
opportunities of Julia derive from the language’s abstraction powers; 25. Greener, J. G., Selvaraj, J. & Ward, B. J. BioStructures.jl: read, write
abstraction in turn makes metaprogramming easier. and manipulate macromolecular structures in julia. Bioinformatics
36, 4206–4207 (2020).
References 26. Rego, N. & Koes, D. 3Dmol.js: molecular visualization with WebGL.
1. Tomlin, C. J. & Axelrod, J. D. Biology by numbers: mathematical Bioinformatics 31, 1322–1324 (2014).
modelling in developmental biology. Nat. Rev. Genet. 8, 331–340 27. Hayashi, T. et al. Single-cell full-length total RNA sequencing
(2007). uncovers dynamics of recursive splicing and enhancer RNAs.
2. Auton, A. et al. A global reference for human genetic variation. Nat. Commun. 9, 619 (2018).
Nature 526, 68–74 (2015). 28. Greener, J. G., Filippis, I. & Sternberg, M. J. Predicting protein
3. Robson, B. Computers and viral diseases. preliminary dynamics and allostery using multi-protein atomic distance
bioinformatics studies on the design of a synthetic vaccine and a constraints. Structure 25, 546–558 (2017).
preventative peptidomimetic antagonist against the SARS-CoV-2 29. Zea, D. J., Anfossi, D., Nielsen, M. & Marino-Buslje, C. MIToS.jl:
(2019-nCoV, COVID-19) coronavirus. Comput. Biol. Med. 119, mutual information tools for protein sequence analysis in the Julia
103670 (2020). language. Bioinformatics 33, 564–565 (2017).
4. Seefeld, K. & Linder, E. Statistics Using R with Biological Examples 30. Cock, P. J. A. et al. Biopython: freely available Python tools
(K. Seefeld, 2007). for computational molecular biology and bioinformatics.
5. Ekmekci, B., McAnany, C. E. & Mura, C. An introduction to Bioinformatics 25, 1422–1423 (2009).
programming for bioscientists: a Python-based primer. PLoS 31. Kunzmann, P. & Hamacher, K. Biotite: a unifying open source
Comput. Biol. 12, e1004867 (2016). computational biology framework in Python. BMC Bioinformatics
6. Sengupta, A. & Edelman, A. Julia High Performance (Packt 19, 346 (2018).
Publishing, 2019). 32. Perera, R. Programming languages for interactive computing.
7. Nazarathy, Y. & Klok, H. Statistics with Julia: Fundamentals for Data Electron. Notes Theor. Comput. Sci. 203, 35–52 (2008).
Science, Machine Learning and Artificial Intelligence (Springer, 33. Kirk, P. D. W., Babtie, A. C. & Stumpf, M. P. H. Systems biology (un)
2021). certainties. Science 350, 386–388 (2015).
8. Bezanson, J., Edelman, A., Karpinski, S. & Shah, V. B. Julia: a fresh 34. Kirk, P., Thorne, T. & Stumpf, M. P. Model selection in systems
approach to numerical computing. SIAM Rev. 59, 65–98 (2017). and synthetic biology. Curr. Opin. Biotechnol. 24, 767–774
9. Lauwens, B. & Downey, A. Think Julia: How to Think like a (2013).
Computer Scientist (O’Reilly Media, 2021). 35. Warne, D. J., Baker, R. E. & Simpson, M. J. Simulation and inference
10. Marx, V. The big challenges of big data. Nature 498, 255–260 algorithms for stochastic biochemical reaction networks:
(2013). from basic concepts to state-of-the-art. J. R. Soc. Interface 16,
11. Björnsson, B. et al. Digital twins to personalize medicine. Genome 20180943 (2019).
Med. 12, 4 (2019). 36. Filippi, S. et al. Robustness of MEK-ERK dynamics and origins of
12. Laubenbacher, R., Sluka, J. P. & Glazier, J. A. Using digital twins in cell-to-cell variability in MAPK signaling. Cell Rep. 15, 2524–2535
viral infection. Science 371, 1105–1106 (2021). (2016).

Nature Methods | Volume 20 | May 2023 | 655–664 663


Perspective https://doi.org/10.1038/s41592-023-01832-z

37. Michailovici, I. et al. Nuclear to cytoplasmic shuttling of ERK Science Foundation (DMS 2045327). T.E.H. acknowledges NIH
promotes differentiation of muscle stem/progenitor cells. 1UF1NS108176. The information, data and work presented herein was
Development 141, 2611–2620 (2014). funded in part by the Advanced Research Projects Agency—Energy
38. MacLean, A. L., Rosen, Z., Byrne, H. M. & Harrington, H. A. under award numbers DE-AR0001222 and DE-AR0001211, as well as
Parameter-free methods distinguish Wnt pathway models and National Science Foundation award number IIP-1938400. The views
guide design of experiments. Proc. Natl Acad. Sci. USA 112, and opinions of the authors expressed herein do not necessarily state
2652–2657 (2015). or reflect those of the US Government or any agency thereof. M.P.H.S.
39. Loman, T. E. et al. Catalyst: fast biochemical modeling with Julia. acknowledges funding from the University of Melbourne Driving
Preprint at bioRxiv https://doi.org/10.1101/2022.07.30.502135 Research Momentum initiative and Volkswagen Foundation Life?
(2022). program grant (grant number 93063), as well as support through an
40. Harrington, H. A., Feliu, E., Wiuf, C. & Stumpf, M. P. Cellular Australian Research Council Laureate Fellowship.
compartments cause multistability and allow cells to process
more information. Biophys. J. 104, 1824–1831 (2013). Author contributions
41. Mogensen, P. K. & Riseth, A. N. Optim: a mathematical E.R. and M.P.H.S. conceived of the concept of the project and were in
optimization package for Julia. J. Open Source Softw. 3, charge of the overall direction and planning. All authors contributed to
615 (2018). writing the manuscript and have read and approved the final version.
42. Dunning, I., Huchette, J. & Lubin, M. JuMP: a modeling language
for mathematical optimization. SIAM Rev. 59, 295–320 (2017). Competing interests
43. Ge, H., Xu, K. & Ghahramani, Z. Turing: a language for flexible E.R. is a Sales Engineer at JuliaHub. C.R. is the Vice President of
probabilistic inference. In Proc. 21st International Conference on Modeling and Simulation at JuliaHub and Director of Scientific
Artificial Intelligence and Statistics 1682–1690 (Proc. Machine Research at Pumas-AI. T.E.H. is a steward of the Julia project. H.N. is a
Learning Res., 2018). Senior Computer Scientist at RelationalAI. J.G.G., A.L.M. and M.P.H.S.
44. Liepe, J. et al. A framework for parameter estimation and declare no competing interests.
model selection from experimental data in systems biology
using approximate bayesian computation. Nat. Protoc. 9, Additional information
439–456 (2014). Supplementary information The online version contains
45. Harris, C. R. et al. Array programming with NumPy. Nature 585, supplementary material available at
357–362 (2020). https://doi.org/10.1038/s41592-023-01832-z.
46. Stanitzki, M. & Strube, J. Performance of Julia for high energy
physics analyses. Comput. Softw. Big Sci. 5, 10 (2021). Correspondence should be addressed to Michael P. H. Stumpf.
47. Rackauckas, C. et al. Accelerated predictive healthcare analytics
with Pumas, a high performance pharmaceutical modeling Peer review information Nature Methods thanks Nico Stuurman
and simulation platform. Preprint at bioRxiv https://doi.org/ and the other, anonymous, reviewers for their contribution to the
10.1101/2020.11.28.402297 (2020). peer review of this work. Primary Handling Editor: Rita Strack, in
48. Whitney, T. & Taylor, V. Increasing women and underrepresented collaboration with the Nature Methods team.
minorities in computing: the landscape and what you can do.
Computer 51, 24–31 (2018). Reprints and permissions information is available at
49. Sharpe, J. Computer modeling in developmental biology: www.nature.com/reprints.
growing today, essential tomorrow. Development 144,
4214–4225 (2017). Publisher’s note Springer Nature remains neutral with regard to
50. Rackauckas, C. Benchmark of ODE solvers in Julia. https://github. jurisdictional claims in published maps and institutional affiliations.
com/SciML/MATLABDiffEq.jl (2019).
Springer Nature or its licensor (e.g. a society or other partner) holds
Acknowledgements exclusive rights to this article under a publishing agreement with
We thank all attendees of the Birds of a Feather session Julia for the author(s) or other rightsholder(s); author self-archiving of the
Biologists at JuliaCon2021; D. F. Gleich for allowing us to run an accepted manuscript version of this article is solely governed by the
experiment on his servers; and R. Patro for discussions about Rust. E.R. terms of such publishing agreement and applicable law.
acknowledges financial support through a University of Melbourne
PhD scholarship. A.L.M. acknowledges support from the National © Springer Nature America, Inc. 2023, corrected publication 2023

Nature Methods | Volume 20 | May 2023 | 655–664 664

You might also like