Elements and Principles of Data Analysis
Elements and Principles of Data Analysis
Abstract
arXiv:1903.07639v1 [stat.AP] 18 Mar 2019
The data revolution has led to an increased interest in the practice of data analysis. As a
result, there has been a proliferation of “data science” training programs. Because data science
has been previously defined as an intersection of already-established fields or union of emerging
technologies, the following problems arise: (1) There is little agreement about what is data
science; (2) Data science becomes secondary to established fields in a university setting; and
(3) It is difficult to have discussions on what it means to learn about data science, to teach
data science courses and to be a data scientist. To address these problems, we propose to
define the field from first principles based on the activities of people who analyze data with
a language and taxonomy for describing a data analysis in a manner spanning disciplines.
Here, we describe the elements and principles of data analysis. This leads to two insights:
it suggests a formal mechanism to evaluate data analyses based on objective characteristics,
and it provides a framework to teach students how to build data analyses. We argue that the
elements and principles of data analysis lay the foundational framework for a more general
theory of data science.
1 Introduction
The data revolution has led to an increased interest in the practice of data analysis and increased
demand for training and education in this area [Cleveland, 2001, Nolan and Lang, 2010, Workgroup,
2014, Baumer, 2015, PricewaterhouseCoopers, 2019, Hardin et al., 2015, Kaplan, 2018, Hicks and
Irizarry, 2018]. This revolution has also led to the creation of the term data science, which is often
defined only in relation to existing fields of study [Conway, 2010, Tierney, 2012, Matter, 2013,
Harris, 2013], such as the intersection of computer science, statistics, and substantive expertise.
For example, Drew Conway’s data science Venn diagram [Conway, 2010] defines data science as
the intersection of hacking skills, substantive expertise, and statistics. An alternate approach is to
define data science as the union of emerging methods, technologies or languages, including such
tools as R, Python, SQL, Hadoop, or any of a long list of applications [Mason and Wiggins, 2010].
However, defining data science only in relation to other fields or technologies creates an existential
problem for anything that we might consider as the field of data science. Two major problems with
this approach are (1) there is no agreement on what are the foundations of data science and what
are the core activities of data scientists, and (2) because data science is defined only in relation to
existing fields of study or the tools that others use, it makes data science secondary in stature to
those other fields in the university setting. The result is a lack of an identity for what data science
1
is and what data scientists should be doing, and an inability to communicate and teach consistent
We propose to use an alternative definition of data science based on what a data scientist does,
as others have taken a similar approach [Mason and Wiggins, 2010, Hochster, 2014], but here we
Data science is the science and design of (1) actively creating a question to inves-
tigate a hypothesis with data, (2) connecting that question with the collection of appro-
priate data and the application of appropriate methods, algorithms, computational tools
or languages in a data analysis, and (3) communicating and making decisions based on
new or already established knowledge derived from the data and data analysis.
These three states do not necessarily occur in a linear order, and data scientists constantly move
Within the core definition of data science, a data scientist builds a data analysis [Tukey, 1962,
Tukey and Wilk, 1966, Box, 1976, Wild, 1994, Chatfield, 1995, Wild and Pfannkuch, 1999, Cook
and Swayne, 2007], which has been defined as “the investigative process used to extract knowledge,
information, and insights about reality by examining data" [Grolemund and Wickham, 2014]. In a
typical data analysis, the data scientist starts with a question in mind and a set of collected data
to investigate a hypothesis, all of which may evolve over time. However, there is a great amount
of variation in defining what is a data analysis and how a data scientist can construct or create
a data analysis. Furtheremore, there is little in the way of language available that data scientists
can use to describe what makes analyses different from each other.
Other fields, such as art or music, have overcome similar challenges by defining a language that
can be used to construct or create works of art and to characterize the variation between different
pieces of art. More formally, an artist can create a piece of art using the elements and principles
specific to that area. The elements of art are color, line, shape, form, and texture [of Art, 2019]; and
the principles of art are the means by which the artist uses to compose or to organize the elements
within a work of art [Marder, 2018]. For example, an artist can use the principle of contrast (or
emphasis) to combine elements in a way that stresses the differences between those elements, such
as combining two contrasting colors, black and white. The principles of art, by themselves, are not
used to evaluate a piece of art, but they are meant to be objective characteristics that can describe
Here, we define a language for describing data analyses that can be used to construct and
to create a data analysis and to characterize the variation between data analyses. We denote
this language as the elements and principles of data analysis and we use them to describe the
fundamental concepts for the practice and teaching of creating a data analysis. Furthermore, this
2
language provides the vocabulary and framework to have an informed debate and discussion on
what a data scientist does in the larger context of defining data science. We argue that this is a
more productive way to define data science, where different individuals or disciplines can emphasize
different elements and principles of a data analysis. In addition, having a descriptive language for
data analyses gives us an efficient way to communicate about data analyses and to describe lessons
In scientific fields, a data analysis serves to quantify and characterize evidence in data. Scientists
develop hypotheses about the world and collect data in a manner guided by those hypotheses. A
data analysis is then conducted by a data analyst to formally determine the strength of evidence
in favor of one hypothesis vis-a-vis an alternative hypothesis. There are a variety of methods and
tools available to the data analyst for accomplishing this goal, some of which may provide different
summaries of the data and the evidence. Part of the data analyst’s job is to choose the appropriate
set of methods, algorithms, computational tools or languages to assess the strength of evidence for
a particular hypothesis.
Decisions about how the analysis is conducted are made while simultaneously considering three
aspects, which are all fundamental to the larger field of data science, but are treated as key
inputs to the data analysis here: (i) the hypotheses, or more broadly the scientific questions being
addressed, (ii) the data, and (iii) the audience that will ultimately view or digest the data analysis.
It is important to note that by the word hypothesis, we are not referring to statistical hypothesis
testing, but instead we refer to it as a proposed explanation for a phenomenon, which can be broadly
These decisions by the analyst are critical because, while analysts will choose to analyze the
data in a manner that they believe is valid for the hypothesis, data, and audience, analyses of the
same data set can vary widely depending on the specific choices made by the analyst [Silberzahn
et al., 2018], including variation in the methods, tooling, and workflow. However, we currently
lack a language to describe how to construct and to create a data analysis and to characterize
the variation between data analyses. To address this, we can leverage the ideas from other fields,
such as art and music, to define a language for describing data analyses based on the elements
and principles chosen by the data analyst. Briefly, the elements of an analysis are the individual
basic components of the analysis that, when assembled together by the analyst, make up the entire
analysis. The principles of the analysis are prioritized qualities or characteristics that are relevant
to the analysis, as a whole or individual components, and that can be objectively observed or
3
measured. In the sections below, we go into further detail on these two important topics, namely
Finally, an analysis will usually result in potentially three basic outputs. The first is the analysis
itself, which we imagine as living in an analytic container which might be a set of files including
a Jupyter notebook or R Markdown document, a dataset, and a set of ancillary code files. The
analytic container is essentially the “source code” of the analysis and it is the basis for making
modifications to the analysis and for reproducing its findings. In addition to the container, there
will usually be an analytic product, which is the executed version of the analysis in the analytic
container, containing the executed code producing the results and output that the analyst chooses
to include, which might be a PDF document or HTML file. Finally, the analyst will often produce
an analytic presentation, which might be a slide deck, PDF document, or other presentation format,
which is the primary means by which the data analysis is communicated to the audience. Elements
included in the analytic presentation may be derived from the analytic container, analytic product,
or elsewhere.
To provide a concrete example, we present here a brief vignette that describes a hypothetical data
Roger has a long-term collaboration with a pediatrician who has just finished executing a
clinical trial to see if a new asthma treatment can decrease exhaled nitric oxide (eNO, a measure of
pulmonary inflammation) in children with asthma, as compared to the standard of care. She sends
Roger the data upon completion of the data collection and asks for a summary of the findings as
Roger receives the data and immediately notices that the "treatment" variable is coded as "0"
and "1". He calls his collaborator to inquire about the coding scheme and she clarifies that "0"
indicates standard of care and "1" indicates the new drug. Upon hearing that, Roger re-codes
the data to have more informative labels. Roger then makes a histogram of the outcome (which
is continuous) to see if there is anything unusual. He then makes side-by-side boxplots of eNO
by treatment group to see if there is any difference between the groups. As a final step he runs
a two-sample t-test to test for a difference in eNO between groups. Upon seeing those results he
puts together a brief slide deck presentation of the results for his collaborator and schedules a call
to discuss next steps. A representation of the code and output Roger used for this initial analysis
is presented in Figure 1.
The unexecuted code presented above is the analytic container which describes what was done
and provides a mechanism for reproducing or modifying the analysis. The executed code is the
4
Figure 1: Sample analytic container, analytic product, and analytic presentation.
analytic product and includes model output (the t-test) and two plots. Finally, the slide deck is the
analytic presentation and is a key vehicle by which the collaborator will interact with the analysis.
All of the items presented above are elements of a data analysis and will be described in greater
The elements of a data analysis are the fundamental components of a data analysis used by the data
analyst: code, code comments, data visualization, non-data visualization, narrative text, summary
statistics, tables, and statistical models or computational algorithms [Breiman, 2001] (Table 1).
Code and code comments are two of the most commonly used elements by the data analyst to
and the non-executable instructions that describe the action or result of the surrounding code.
These can be an entire line, multiple lines or a short snippet. Examples of code include defining
variables or writing functions. Code comments and narrative text are related because they both
can include expository phrases or sentences that describe what is happening in the data analysis
in a human readable format. However, the difference between the two is the code comment has a
symbol in front of the narrative text, which instructs the document container to not execute this
element.
There are two types of visualizations elements used in data analysis, data visualization and
non-data visualization, where the former can be a plot, figure or graph illustrating a visual repre-
5
sentation of the data and the latter is figure relevant to the data analysis but does not necessarily
There are two main types of summary elements of a data analysis: summary statistics and
tables, where the former are one (or more than one) dimensional numerical quantities derived
from the data, such as mean or standard deviation, while the latter is an ordered arrangement of
either data or summaries of the data into a row and column format. The last element of a data
analysis is the statistical model or computational algorithm, which an analyst uses to investigate
algorithms.
Element Description
Narrative text Expository phrases or sentences that describe what is happening in the
data analysis in a human readable format
Code A series of programmatic instructions to execute a particular program-
ming or scripting language
Code comment Non-executable code or text near or inline with code that describes the
expected action/result of the surrounding code or provides context
Data visualization A plot, figure or graph illustrating a visualize representation of the data.
Narrative diagram A diagram or flowchart without data
Summary statistics Numerical quantities derived from the data, such as the mean, standard
deviation, etc.
Table An ordered arrangement of data or summaries of data in rows and
columns
Statistical model Mathematical model or algorithm concerning the underlying data phe-
or computational nomena or data-generation process, predictive ability, or computational
algorithm algorithm
Table 1: Elements of a data analysis. This table describes eight elements that are used by
the data analyst to build the data analysis. These elements include code, code comments, data
visualization, non-data visualization, narrative text, summary statistics, tables, and statistical
models or computational algorithms.
In addition to these elements, there are also contextual inputs to the data analysis, such as
the main scientific question or hypothesis, the data, the choice of programming language to use,
the audience, and the document or container for the analysis, such as a Jupyter or R Notebooks.
We do not include these as elements of data analysis, because these inputs are not necessarily
decided or fundamentally modified by the analyst. Often an upstream entity such as a manager
provides the framework for these contextual inputs. However, we note that often the data analyst
will be expected to decide or contribute to these contextual inputs. In addition, it may be the
analyst’s job to provide feedback on some of these inputs in order to further refine or modify them.
For example, an analyst may be aware that a specific programming language is more appropriate
6
3 Principles of data analysis
The principles illustrated by a data analysis are prioritized qualities or characteristics that are
relevant to the analysis, as a whole or individual components, and that can be objectively observed
or measured. Their presence (or absence) in the analysis is not dependent on the characteristics
of the audience viewing the analysis, but rather the relative weight assigned to each principle by
the analyst can be highly dependent on the audience’s needs. In addition, the weighting of the
principles by the analyst can be influenced outside constraints or resources, such as time, budget, or
access to individuals to ask context-specific questions, that can impose restrictions on the analysis.
The weighting of the principles per se is not meant to convey a value judgment with respect to
the overall quality of the data analysis. Rather, the requirement is that multiple people viewing
an analysis could reasonably agree on the fact that an analysis gives high or low weight to certain
principles. In Section 5 we describe some hypothetical data analyses that demonstrate how the
various principles can be weighted differently. Next, we describe six principles that we believe are
Data Matching. Data analyses with high data matching have data readily measured or available
to the analyst that directly matches the data needed to investigate a hypothesis or problem with
data analytic elements (Figure 2). In contrast, a scientific question may concern quantities that
cannot be directly measured or are not available to the analyst. In this case, data matched to the
hypothesis may be surrogates or covariates to the underlying data phenomena. While we consider
the main scientific question or hypothesis and the data to be contextual inputs to the data analysis,
we consider this a principle of data analysis because the analyst selects data analytic elements that
are used to investigate the hypothesis, which depends on how well the data are matched. If the
data are poorly matched, the analyst will not only need to investigate the main hypothesis with one
set of data analytic elements, but also will need to use additional elements that describe how well
the surrogate data is related to the underlying data phenomena to investigate the main hypothesis.
It is important to note, problems and hypotheses can be more or less specific, which will
impose strong or weak constraints on the range of data matching to the problem. Highly specific
hypotheses or problems tend to induce strong constraints to investigate with data analytic elements.
Less specific problems emit a large range of potential data to investigate the hypothesis. Data that
can be readily measured or are available to the analyst to directly address a specific hypothesis
results in high data matching, but depending on the problem specificity, can result in a narrow or
multiple, complementary elements (Figure 3). For example, using a 2 × 2 table, a scatter plot,
7
Figure 2: The data matching principle of data analysis. Data analyses with high data
matching have data readily measured or available to the analyst that directly matches the data
needed to investigate a hypothesis or problem with data analytic elements. In contrast, a hypothesis
may concern quantities that cannot be directly measured or are not available to the analyst. In
this case, data matched to the hypothesis may be surrogates or covariates to the underlying data
phenomena that may need additional elements to describe how well the surrogate data is related
to the underlying data phenomena to investigate the main hypothesis.
and a correlation coefficient are three different elements that could be employed to address the
hypothesis that two predictors are correlated. Analysts that are exhaustive in their approach use
complementary tools or methods to address the same hypothesis, knowing that each given tool
reveals some aspects of the data but obscures other aspects. As a result, the combination of
elements used may provide a more complete picture of the evidence in the data than any single
element.
Skeptical. An analysis is skeptical if multiple, related hypotheses are considered using the same
data (Figure 4). Analyses, to varying extents, consider alternative explanations of observed phe-
nomena and evaluate the consistency of the data with these alternative explanations. Analyses
that do not consider alternate explanations have no skepticism. For example, to examine the rela-
tionship between a predictor X and an outcome Y, an analyst may choose to use different models
containing different sets of predictors that my potentially confound that relationship. Each of these
different models represents a different but related hypothesis about the X-Y relationship. A sep-
arate question that arises is whether the configuration of hypotheses or alternative explanations
are relevant to the problem at hand. However, often that question can only be resolved using
8
Figure 3: The exhaustive principle of data analysis. An analysis is exhaustive if specific
questions or hypotheses are addressed using multiple, complementary elements. Given a hypothesis
or scientific question, the analyst can select an element or set of complementary elements to
investigate the hypothesis. The more complementary elements that are used to investigate the
hypothesis, the more exhaustive the analysis is, which provides a more complete picture of the
evidence in the data than any single element.
Figure 4: The skeptical principle of data analysis. An analysis is skeptical if multiple, related
hypotheses or alternative explanations of observed phenomena are considered using the same data
and offer consistency of the data with these alternative explanations. In contrast, analyses that do
not consider alternate explanations have no skepticism.
the primary hypothesis or question, but give important context or supporting information to the
analysis (Figure 5). Any given analysis will contain elements that directly contribute to the results
9
Figure 5: The second-order principle of data analysis. An analysis is second-order if it
includes ancillary elements that do not directly address the primary hypothesis/question but give
important context to the analysis. Examples of ancillary elements could be background informa-
tion of how the data were collected, and expository explanations or analyses comparing different
statistical methods or software packages. While these details may be of interest and provide useful
background, they likely do not directly influence the analysis itself.
or conclusions, as well as some elements that provide background or context or are needed for other
reasons, such as if the data are less well matched to the investigating the hypothesis (Figure 2).
Second-order analyses contain more of these background/contextual elements in the analysis, for
better or for worse. For example, in presenting an analysis of data collected from a new type of
machine, one may include details of who manufactured the machine, or why it was built, or how
it operates. Often, in studies where data are collected in the field, such as in people’s homes, field
workers can relay important about the circumstances under which the data were collected. In
both examples, these details may be of interest and provide useful background, but they may not
directly influence the analysis itself. Rather, they may play a role in interpreting the results and
visualizing data that are influential in explaining how the underlying data phenomena or data-
generation process connects to any key output, results, or conclusions (Figure 6). While the
totality of an analysis may be complex and involve a long sequence of steps, transparent analyses
extract one or a few elements from the analysis that summarize or visualize key pieces of evidence
in the data that explain the most “variation” or are most influential to understanding the key
results or conclusion. One aspect of being transparent is showing the approximate mechanism by
10
Figure 6: The transparency principle of data analysis. Transparent analyses present an
element or set of elements summarizing or visualizing data that are influential in explaining how
the underlying data phenomena or data-generation process connects to any key output, results,
or conclusions. While the totality of an analysis may be complex and involve a long sequence of
steps, transparent analyses extract one or a few elements from the analysis that summarize or
visualize key pieces of evidence in the data that explain the most “variation” or are most influential
to understanding the key results or conclusion.
Reproducible. An analysis is reproducible if someone who is not the original analyst can take the
published code and data and compute the same results as the original analyst (Figure 7). In the
terminology of our framework, given the elements of the data analysis, we can produce the exact
same results of the analysis. Critical to reproducibility is the availability of the analytic container
to others who may wish to re-examine the results. Much has been written about reproducibility
and its inherent importance in science, so we do not repeat that here [Peng, 2011]. We simply
add that reproducibility (or lack thereof) is usually easily verified and is not dependent on the
characteristics of the audience viewing the analysis. Reproducibility also speaks to the coherence
of the workflow in the analysis in that the workflow should show how the data are transformed to
The framework of elements and principles described thus far allows us to define a data analysis as
a collection of elements whose choice is guided by the relative weight given to a set of principles.
Given the contextual inputs, the data analysis is built upon atomic units which occur both on the
computer and in the analyst’s head (Figure 8). In each atomic unit, the data analyst first chooses
a principle to investigate a hypothesis or scientific question. Then, the analyst alternates between
11
Figure 7: The reproducible principle of data analysis. An analysis is reproducible if someone
who is not the original analyst (Analyst 2) can take the same data and the same elements of the
data analysis and produce the exact same results as the original analyst (Analyst 1). In contrast,
analyses that conclude in different results are less reproducible.
two stages until the analyst exits the atomic unit. The actions in the atomic unit are denoted in
Figure 8 by (1) the analyst selects an element to investigate a hypothesis or scientific question,
(2) the analyst interprets the output, (3) the analyst decides whether or not to place the element
from step (1) into the analytic container or analytic product or not, and (4) the analyst decides
whether to continue in the atomic unit by selecting another element to investigate the hypothesis
or question or to exit the atomic unit entirely and end this line of investigation.
While action (2) happens on the computer or console, action (3) happens in the analyst’s head.
In action (3), if an element is not recorded into the analytic container or analytic product, then the
audience does not see it as part of the analysis. At the conclusion of the atomic unit, the analyst
can use the information gained or not gained to guide the choice of what will be the next atomic
unit.
As an example, let us assume our hypothesis is that two features X and Y are associated.
The analyst begins the atomic unit by choosing a principle to investigate this hypothesis (1), for
example the principle of exhaustiveness, where the analyst begins by picking an element (2), such
as a 2-dimensional scatter plot, to investigate the hypothesis. The element is executed in the
console, which might reveal evidence for or against an association. The analyst then interprets the
result in his or her head. At this point, the analyst must make two decisions: if he or she wants
to place the data visualization element in the analytic container (or analytic product) or not (3)
and if he or she would like to continue the investigation with another element or exit the atomic
12
Figure 8: The atomic unit of data analysis. Given the contextual inputs, the data analysis is
built upon atomic units of data analyses, which occurs both on the computer and in the analyst’s
head. In each atomic unit, the data analyst first chooses a principle to investigate a hypothesis
or scientific question (Stage A). Then, the analyst alternates between Stages B and C until the
analyst exits the atomic unit, either by choosing to end the line of investigation or choosing to
invoke a new principle. The actions in the atomic unit are denoted in by (1) the analyst selects
an element to investigate a hypothesis or scientific question, which is executed in the computer or
console, (2) the analyst interprets the output in his or her head, (3) the analyst decides whether or
not to place the element from (1) in the data container or data product or not, which means the
element is never recorded in the data container or product and the audience does not see it as part
of the analysis, and (4) the analyst decides to whether to continue in the atomic unit by selecting
another element to investigate the hypothesis or question or to exit the atomic unit entirely and
end this line of investigation.
unit (4). Let’s assume the analyst decides to incorporate the element into the data container and
also decides to continue the investigation of the hypothesis because the principle of exhaustivness
has been invoked by the analyst. Then, let’s assume the analyst decides to quantitatively assess
the strength of the association between X and Y using another element. If the analyst selects an
element (4), such as the mean of X, to quantify the strength of association between X and Y,
then the analyst would not learn anything (2) because a univariate summary statistic (or element)
will not provide information on a 2-dimensional relationship between X and Y. Next, the analyst
chooses not to record this element into the data container or data product (3) and chooses to
select a different element (4), such as Pearson correlation summary statistic, which would be a
more informative element to use when assessing the strength of the association between X and Y.
The analyst interprets the results (2) and then makes the decision to place the element into the
data container or product (3) and to end this investigation of the original hypothesis (4). In this
13
way, the analyst has completed one atomic unit of data analysis to form a portion of the final
5 Vignettes
To make these ideas more concrete, we provide four vignettes in this section where we describe
how a data analysis could invoke or not invoke certain principles of data analysis.
5.1 Vignette 1
Background. Roger is interested in understanding the relationship between outdoor air pollution
concentrations and variables predictive of pollution levels (e.g. temperature, wind speed, distance
to road, traffic counts, etc.). However, monitoring of air pollution is expensive and time-consuming,
and so he is interested in developing a prediction model for predicting air pollution levels where
Analysis. Using available data on air pollution concentrations as well as 20 other variables that
he thinks would be predictive of pollution levels, he fits a linear model using measured monitor-
level pollution concentrations as the outcome. Once the analysis is complete, he includes all of the
code and the writeup in a Jupyter Notebook. The document and all the corresponding data are
Results. In his analysis report, he indicates that temperature had a large coefficient in the model
and was statistically significant. He further reports that a 1 degree increase in temperature was
associated with a 2.5 unit increase in pollution. Other coefficients that were statistically significant
The stated goal of this analysis is to build a prediction model for predicting unobserved levels of
air pollution. A linear model is fit and then the coefficients of the model are interpreted1 .
• Matching to the Data: The data appear highly appropriate for addressing the problem
of building a prediction model for pollution. Observed monitoring data are available for the
outcome and 20 covariates that potentially related to pollution are used as predictors.
1 While the goal was to build a predictive analysis, the analyst ultimately built an inferential analysis. The
principles of data analysis do not characterize the validity or success of the analysis, nor the strength or quality of
evidence for the hypothesis of interest. However, we propose a framework for how the elements and principles of
data analysis might be used for these ideas in Section 6.3.
14
• Exhaustive: There is little evidence of exhaustiveness in this report. There is no attempt
to try alternative approaches to see if improvements can be made. Essentially, one model
• Skeptical: There is little skepticism in the analysis as only one approach was taken and
only one question/hypothesis was explored. There is also no checking of assumptions or the
summary included in the analysis that highlights the evidence in the data or data-generation
process that reveal how the reported results are influenced by features in the data.
• Reproducible: The code and data are made available on GitHub and the code for imple-
menting the model is organized in an R Markdown document. The analysis therefore appears
reproducible.
5.2 Vignette 2
Background. Stephanie works as a data scientist at a small startup company that sells widgets
over the Internet through an online store on the company’s web site. One day, the CEO comes by
Stephanie’s desk and asks her how many customers have typically shown up at the store’s website
each day for the past month. The CEO waits by Stephanie’s desk for the answer.
Analysis. Stephanie launches her statistical analysis software and, typing directly into the
software’s console, immediately pulls records from the company’s database for the past month.
She then groups the records by day and tabulates the number of customers. From this daily
tabulation she then calculates the mean and the median count.
Results. Stephanie verbally reports the daily mean and median count to the CEO standing over
her shoulder. She also notes that in the past month the web site experienced some unexpected
down time when it was inaccessible to the world for a few hours.
This scenario is a typical “quick analysis” that is often done under severe time constraints and
where perhaps only an approximate answer is required. In such circumstances, there is often a
• Matching to Data: The data are essentially perfectly matched to the problem. The
database tracks all visitors to the web site and the analysis used data directly from the
database.
15
• Exhaustive: There is some evidence of exhaustiveness here as the analysis presented both
the mean and the median as a summary of the typical number of customers per day.
• Skeptical: The analysis did not address any other hypotheses or questions.
• Reproducible: Given that the results were verbally relayed and that the analysis was
conducted on the fly in the statistical software’s console, the analysis is not reproducible.
• Second-order: Noting that the web site experienced some downtime is a second order detail
that suggests that the result presented may not be typical for a monthly mean. However,
the information does not imply that the summary statistic is incorrect and therefore does
• Transparent: The analysis lacks transparency, but is nevertheless fairly simple. One way
to make this analysis more transparent would be to include a histogram or a time series plot
of daily values.
5.3 Vignette 3
Background. Stephanie is conducting a data analysis to see if unconventional oil and gas extraction
(e.g. hydraulic “fracking”) methods are affecting the health of populations living near oil and gas
deposits in Colorado. Because precise data on the operation of oil and gas wells is not publicly
available, she can only obtain the location of each well and a five-year window in which the well
was operating. This isn’t ideal, but she considers doing the analysis with this coarsened data more
valuable than not doing it. She is able to obtain insurance records from Medicare from everyone
in the state aged 65 years and older and so can examine health claims within this population and
Analysis. For her primary analysis, Stephanie decides to compare hospitalizations for respira-
tory disease during times when a well was in operation to otherwise similar times before the well
was active. She first makes a time series plot of hospitalizations and annotates the plot with the
windows of time when oil and gas wells were in operation. As part of her report, she provides an
extensive description of how the oil wells operate and the schedule of activity by which they tend
to adhere. She then runs a generalized linear regression model (GLM) with numbers of hospitaliza-
tions for respiratory diseases as the outcome and an indicator of well activity as the key predictor.
Potential confounders are adjusted for directly in the regression model. To test the sensitivity of
her findings to her chosen model, she fits a series of additional models using different functional
forms on the various confounders. Finally, as an alternative modeling approach, Stephanie imple-
ments a propensity score model for the indicator of well operation and then uses a doubly robust
16
Results. The time series plot, generalized linear regression model, and doubly robust estimator
all appear to provide similar and consistent results. Adjusting for different sets of potential con-
founders in the GLM does not appear to modify the primary results much. Stephanie decides that
the time series plot effectively tells the story of the relationship between well activity and hospital-
izations and decides to include that in her final analysis. The results from the GLM, doubly robust
estimator, and various sensitivity analyses are all presented in the analysis. Because the topic is
of great interest to many stakeholders, she writes the entire analysis in an R Markdown document
where the code for all of the analyses and models can be easily accessed. The R Markdown docu-
ment and well activity data are placed on a GitHub repository. Because the hospitalization data
are considered protected health information, she cannot post that dataset publicly.
• Matching to Data: The data are not particularly well-matched to the problem because
proaches (GLM and DR estimator) were taken to address the primary question and a time
• Skeptical: Multiple hypotheses were considered in the GLM in the consideration of different
sets of confounders, demonstrating some amount of skepticism in the analysis. There did not
appear to be strong evidence that any of the potential confounders played a key role in
• Second-order: The information about how wells operate is second-order and does not
• Transparent: The presentation of a single element, the time series plot, demonstrates how
• Reproducible: The analysis is partially reproducible because the modeling code and well
data are available to others. However, because the hospitalization data are not available, it
is not possible for others to recreate the analysis results in their entirety.
5.4 Vignette 4
Background. Roger is considering is trying to determine whether ambient nickel being generated
given data on both variables and must decide what methods to apply to address this question
17
Analysis. Roger initially decides to calculate the Pearson correlation coefficient between the
two variables. Upon doing so he sees that the coefficient is positive and concludes that the two
variables are positively correlated. He then decides to consider one more element, a scatter plot of
the two variables, to further investigate this question of correlation. Upon seeing the scatter plot it
is clear that there is an outlier in the nickel variable that is driving the positive correlation. If the
outlier were removed, the correlation would likely be zero (although he does not actually compute
this value). Seeing the scatter plot, Roger decides not to include it in the final analysis.
Results. The final analysis includes the Pearson correlation coefficient along with a conclusion
that ambient nickel and respiratory illness are “likely positively correlated”.
At this point, the analysis consists of the correlation coefficient and its interpretation.
• Exhaustive: Although the analyst used two elements (the correlation coefficient and the
scatter plot) to explore the problem, only one is presented in the final analysis. As a result,
• Skeptical: There is no skepticism in the final analysis. One way to the skepticism could have
been increased is if the correlation coefficient had been calculated with the outlier removed.
• Transparent: This analysis is simple, yet lacks transparency. Including the scatter plot
6 Discussion
In developing the elements and principles of data analysis, our goal is to define a language for de-
scribing data analyses that can be used to characterize the variation between analyses. Ultimately,
we hope that this language can serve as the foundational framework for developing a theory of data
analysis or data science. While the elements are the building blocks for a data analysis, the prin-
ciples can be wielded by the analyst to create data analyses with diversity in content and design.
It is important to reiterate that the inclusion or exclusion of certain elements or the weighting
of different principles in a data analysis do not determine the overall quality or success of a data
analysis.
18
The framework we have developed here provides a mechanism for describing differences between
analyses in a manner that is not specifically tied to the science or the application underlying the
analysis. Being able to describe differences in this manner allows data analysts, who may be
working in different fields or on disparate applications, to have set of concepts that they can use to
have meaningful discussions. The elements and principles therefore broaden the landscape of data
analysts and allows people from different areas to converse about their work and have a distinct
shared identity.
When considering the foundations of data analysis, one cannot ignore the work of Tukey [Tukey,
1962], where he suggested that data analysis be thought of as a scientific field, as opposed to more
like mathematics. In thinking of data analysis that way, the field’s identity would be centered
around a set of problems rather than a set of tools. Our work builds on this work and that of
many others in that we similarly believe that data analysis should be thought of in a different
way. However, perhaps different from Tukey’s time, there is a substantial heterogeneity in the
population of people doing data analysis today. As a result, there is a need to develop a language
for discussing data analysis that is targeted at a somewhat higher level of abstraction than Tukey’s.
The work of Grolemund and Wickham [Grolemund and Wickham, 2014] is closely related to
what we propose here, and in particular, the atomic unit of data analysis discussed in Section 4.
They draw on ideas from the field of cognitive science and characterize data analysis as a sense-
making task whereby theories or expectations are set and then compared to reality (data). Any
discrepancies between the two are further examined and then theories are possibly modified.
Our work and the work of Grolemund and Wickham (as well as Tukey and others) share a de-
sire for a more formal theory of data analysis. However, a key distinction of our work is our focus
on observable outputs from a data analysis. Grolemund and Wickham’s cognitive model is useful
for describing the data analysis process and for shedding light on how it might be improved, but
much of that process is typically not observed by outsiders. We attempt to characterize observed
outputs—the analytic container and analytic product—of a data analysis so that individual anal-
yses can be described and compared to other analyses in a concrete manner. Having a language
that can be used to describe specific analyses has benefits in particular for teaching data analysis
because it allows students and teachers to discuss the various aspects of an analysis and debate
The elements and principles we have laid out here do not make for a complete framework for
thinking about data analysis and leave a number of issues unresolved. We touch on some of those
issues here and point to how they could be addressed in future work.
19
6.1 Honesty and Intention
The principles described above are designed to be descriptive and to allow one to characterize the
wide range of data analyses that might be conducted in different fields and areas of study. In
describing the principles, we do make some assumptions about the intentions of both the analyst
conducting the analysis and any potential audiences that may view the analysis. However, it
is clear from the historical record that some analyses are not done with the best of intentions.
The possibilities range from benign neglect, to misunderstandings about methodology, to outright
fraud or intentional deceit. These analyses, unsavory as their origins may be, are nevertheless data
analyses. Therefore, they should be describable according to the principles outlined here.
Vignette 4 describes a scenario where an element was employed (a scatter plot) which showed
evidence that appeared to contradict a previous element (the correlation coefficient). As part of
the atomic unit of data analysis, the data analyst is constantly making decisions about whether
to include an element in the final analysis or not. In this hypothetical example, the analyst chose
not to include the scatter plot. The final analysis did provide evidence concerning the correlation
between variables X and Y. However, it could be argued that the evidence is misleading because,
unknown to us at the present time, a scatter plot would reveal that the bulk of the data exhibit a
negative correlation.
The fact that an analyst has produced misleading evidence is troubling, but such an outcome
does not necessarily have a one-to-one relationship with dishonest intention. On the contrary,
misleading evidence can arise from even the most honest of intentions. In particular, when sample
sizes are small, models do not capture hidden relationships, there is significant measurement error,
or for any number of other analytical reasons, evidence can lead us to believe something for which
the opposite is true. However, such situations are not generally a result of fraud or intentional
deceit. They are often a result of the natural iteration and incremental advancement of science.
Can the principles be used to “detect” dishonest data analyses? Perhaps. For example, the
hypothetical posed above is neither exhaustive nor skeptical. If the analyst had been forced in
some way to consider other data analytic elements or alternative hypotheses, it is possible that
the other aspects of the data would have been revealed. However, a truly wily analyst will be
able to make an analysis appear exhaustive and skeptical, while still being misleading. There is
no guarantee that dishonest analyses will always give certain principles specific weights.
Concern about dishonest analyses is of course highly relevant, but we do not envision the set
of principles outlined here as being able to definitively discriminate them from honest analyses.
Our goal here is to provide a method for characterizing data analyses on a prima facie basis, using
nothing but the analysis presented. Furthermore, it is likely not possible to design an instrument,
which takes solely the data analysis as its input, that can detect such phenomena. Rather, it
20
would make more sense to implement other methods to either deter or prevent such analyses from
The characterization of a data analysis is bound to conflate two important, but distinct, entities.
First is the analysis as presented and second is the analyst who conducted that data analysis. For
might require more scrutiny than an analysis presented by a seasoned veteran. A data analyst that
has established a solid track record of analyzing data in a specific area might be more trustworthy
than a data analyst with no track record. When reviewing data analysis, it is natural to consider
While both the analysis presented and the analyst behind it are important in the evaluation of
the conclusions of an analysis, it is important to consider them separately. A key reason is because
an analysis is presented to an audience, and therefore the audience by definition has all of the
information about that analysis before them. Specific information about the data analyst is seldom
available unless the audience has a personal relationship with the analyst or if the audience is very
familiar with their work and has seen past examples. Therefore, requiring any characterization of
The audience may have partial information about the person doing the analysis, such as who has
provided money to fund the analysis. Disclosures about funding sources are common in academic
publications because their is shared understanding that conducting research or data analysis in
an area where significant financial incentives are at stake can cause one to be biased in certain
directions. Readers of a data analysis on the effects of smoking on lung cancer may be more
skeptical of the conclusions drawn if they know the analysis was funded by a tobacco company
which has a significant financial stake in selling cigarettes. However, in general, it is important
that the audience does not over-correct and evaluate the entire analysis based solely on information
about the analyst. The audience may of course question other things outside the analysis, such as
the study design, the data collection, or various other aspects that are often more influential than
The practice of data analysis is a rich, complicated, challenging topic because involves not only
the data analysis (analytic container, analytic product, or analytic presentation) built by the
data analyst, but also requires consideration of contextual inputs such as the hypotheses, data,
audience, and any additional constraints on the analysis. Furthermore, the practice of data analysis
21
also requires the ability to characterize the validity or the success of the analysis, and the strength
or quality of evidence for the hypothesis of interest. Creating a framework for the elements and
principles of a data analysis as defined here points us in a few clear directions towards addressing
these topics.
First, this language and taxonomy for describing data analyses can be used to have a healthy
debate and discussion on the larger definition of data science (1). We argue that describing
data analyses from first principles and defining data science based on what a data scientist does
is a more productive way for describing variation in data analysis and variation in the meaning of
data science in a manner that spans disciplines, where different data analysts or disciplines may
Second, this framework can be used to describe how a data analyst can select informative
elements to investigate a given hypothesis, where informative is defined as, if for a given hypothe-
sis, there is a collection of elements presented that provides evidence for or against that hypothesis
(Figure 9). In contrast, if a data analyst chooses elements that do not provide evidence for or
against an hypothesis, then he or she does not learn anything from the data, and the analysis is
not informative. For example, in Vignette 1 the goal was to build a predictive analysis, the analyst
built an inferential analysis, which is not informative for prediction. Another example is, if the
analyst wants to learn if two features X and Y are correlated, they could make a histogram of
X and a histogram of Y. However, these two histograms do not provide any information about
the correlation of the two features (although they may provide other information about the data).
Therefore, the use of histograms to address the question of correlation is not informative. Instead,
a 2-dimensional scatter plot would reveal evidence for or against a hypothesis of non-zero corre-
lation. The definition of informative is inherently making a statement about the quality of the
analysis (where more informative analyses are of a higher quality), and therefore, informativness
is not a principle of data analyses as we have defined them here. Rather, informativeness would
In addition to informative analyses, one could also imagine this framework being used to un-
derstand the robustness of conclusions to the choices made by an analyst in terms of evaluating
the quality of a data analysis. This might be a combination of principles, such as the combination
of exhaustiveness and skepticism with the goal of critically thinking about the nature and quality
of evidence that a data analyst is able to extract from a particular data source. For example,
this might entail evaluating the strength and quality of evidence for the particular hypothesis of
Third, describing the variation between data analyses as variation in the weighting of different
principles suggests a formal mechanism for evaluating the success of a data analysis. In
22
Figure 9: Selecting informative elements in a data analysis. An analysis is informative if
for a given hypothesis, there is a collection of elements that provides evidence for or against that
hypothesis. Given a hypothesis or scientific question, the analyst selects an element to investigate
the hypothesis, such as a data visualization. If the analyst selects an uninformative element, such
as a 1-dimensional data visualization, this would lead to not finding any evidence for or against
the hypothesis. In contrast, if the analyst selects an informative element, such as a 2-dimensional
data visualization, which would lead to finding evidence for or against the hypothesis.
particular, every data analysis has an audience that views the analysis and the audience may
have a different idea of how these various principles should be weighted. One audience may value
reproducibility and exhaustiveness while another audience may value interactivity and brevity.
Neither set of weightings is correct or incorrect, but the success of an analysis may depend on
how well-matched the analyst’s weightings are to the audience’s. Similarly, data analysts may be
Vignette 2 above, speed was of the essence and so there was little time for extensive skepticism
of the situation, an analyst who goes against the principle weightings that are demanded by the
constraints may have some explaining to do. That said, audiences may be open to such explanation
if the analyst can make a convincing argument in favor of a different set of weightings.
Fourth, another important area of consideration is the teaching of data science. We do not
discuss this topic at length here, but the elements and principles may provide an efficient framework
to teach students data science at scale, which is a significant problem given the demand for
data science skills in the workforce. Because much data science education involves experiential
learning with a mentor in a kind of apprenticeship model, there is a limit on how quickly students
23
can learn the relevant skills while they gain experience. Having a formal language for describing
different aspects of data science that does not require mimicking the actions of a teacher or through
a time-consuming mentorship may serve to compress the education of data scientists and to increase
Finally, the development of elements and principles for data analysis provides a foundation for
a more general theory of data science. For example, one could imagine defining mathematical
or set operators on the elements of data analysis and consider the ideas of independence and
exchangeability. One could define the formal projection mapping between a given data analysis and
a principle of data analysis. Alternatively, one could combine one or more elements into coherent
activities to define units or sections of a data analysis, such as the “introduction”, “setup”, “data
and “export” units. There might not be a formal ordering of the units and the units can appear in
a data analysis once, more than once or not at all. Then, a set of units can be assembled together
into canonical forms of data analyses, which are likely to vary across disciplines.
7 Summary
The demand for data science skills has grown significantly in recent years, leading to a re-examination
of the practice and teaching of data analysis. Having a formal set of elements and principles for
characterizing data analyses allows data analysts to describe their work in a manner that is not
confounded by the specific application or area of study. Having concrete elements and principles
also opens many doors for further exploration and formalization of the data analysis process. The
benefits of developing elements and principles include setting the basis for a distinct identity for
the field of data science and providing a potential mechanism for teaching data science at scale.
8 Back Matter
SCH and RDP equally conceptualized, wrote and approved the manuscript.
8.2 Acknowledgements
We would like to thank Elizabeth Ogburn, Kayla Frisoli, Jeff Leek, Brian Caffo, Kasper Hansen,
Rafael Irizarry and Genevera Allen for the discussions and their insightful comments and sugges-
24
References
Ben Baumer. A data science course for undergraduates: Thinking with data. The American
G. E. P. Box. Science and statistics. Journal of the American Statistical Association, 71(356):
791–799, 1976.
Leo Breiman. Statistical modeling: The two cultures. Statistical Science, 16(3):199–215, 2001.
William S. Cleveland. Data science: An action plan for expanding the technical areas of the field
Drew Conway. THE DATA SCIENCE VENN DIAGRAM, 2010. URL http://drewconway.com/
zia/2013/3/26/the-data-science-venn-diagram.
D. Cook and D. F. Swayne. Interactive and dynamic graphics for data analysis with R and GGobi.
Garrett Grolemund and Hadley Wickham. A cognitive interpretation of data analysis. International
Johanna Hardin, Roger Hoerl, Nicolas J. Horton, and Deobrah Nolan. Data science in statistics
curricula: Preparing students to “think with data”. The American Statistician, 69:343–353, 2015.
Harlan Harris. The Data Products Venn Diagram, 2013. URL http://www.datacommunitydc.
org/blog/2013/09/the-data-products-venn-diagram.
Stephanie C. Hicks and Rafael A. Irizarry. A guide to teaching data science. The American
org/10.1080/00031305.2017.1356747.
What-is-data-science.
Daniel Kaplan. Teaching stats for data science. The American Statistician, 72(1):89–96, 2018. doi:
Jeffery T Leek and Roger D Peng. Statistics. what is the question? Science, 347(6228):1314–5, 03
25
Lisa Marder. The 7 Principles of Art and Design, 2018. URL https://www.thoughtco.com/
principles-of-art-and-design-2578740.
Hilary Mason and Chris Wiggins. A Taxonomy of Data Science, 2010. URL http://www.
dataists.com/2010/09/a-taxonomy-of-data-science/.
Ulrich Matter. Data Science in Business/Computational Social Science in Academia?, 2013. URL
http://giventhedata.blogspot.com/2013/03/data-science-in-businesscomputational.
html.
Deborah Nolan and Duncan Temple Lang. Computing in the statistics curricula. The American
1198/tast.2010.09132.
National Gallery of Art. The Elements of Art. National Gallery of Art, 2019. URL https:
//www.nga.gov/education/teachers/lessons-activities/elements-of-art.html.
2011.
PricewaterhouseCoopers. What’s next for the data science and analytics job market?, 2019. URL
https://www.pwc.com/us/en/library/data-science-and-analytics.html.
M. Witkowiak, S. Yoon, and B. A. Nosek. Many analysts, one data set: Making transparent how
variations in analytic choices affect results. Advances in Methods and Practices in Psychological
2012/06/data-science-is-multidisciplinary.html.
John W Tukey. The future of data analysis. The annals of mathematical statistics, 33(1):1–67,
1962.
26
W. Tukey and M. B. Wilk. Data analysis and statistics: an expository overview. In In Proceedings
of the November 7-10, 1966, fall joint computer conference, pages 695–709, 1966.
C. J. Wild. Embracing the "wider view" of statistics. The American Statistician, 48(2):163–171,
1994.
lines for Undergraduate Programs in Statistical Science. American Statistical Association, 2014.
URL http://www.amstat.org/education/curriculumguidelines.cfm.
27