Thinking With Data - Summary
Thinking With Data - Summary
Max shorn
Preface
The preface of the book discusses the importance of working with data to
produce knowledge that can be used by both humans and machines. The
author, a data strategy consultant, emphasizes the need to ask the right
questions to create meaningful insights. The problem of turning
observations into knowledge is not new and has been tackled by experts in
various fields. The purpose of the book is to help readers understand that
they are not alone in this pursuit and can learn from past experiences. The
author also provides an incomplete list of things that can be improved with
data, including answering questions, telling stories, discovering patterns,
and automating processes.
Max shron explains that working with data requires both hard skills (e.g.,
data cleaning, mathematical modeling) and soft skills (e.g., problem-solving,
organization, critical thinking) to make data useful. These soft skills are often
overlooked but are essential for turning observations into knowledge. The
author suggests that data scientists can learn from other fields such as
design, critical thinking, and program evaluation to improve their skills. The
author emphasizes that a focus on why rather than how is crucial for
successful data work and that the best data professionals already
understand this.
The author suggests that creating a scope for a data problem is the first
step in finding structure. A scope is an outline of a story that explains why
we are working on a problem and how we expect the story to end. A well-
scoped problem includes four parts: context, need, vision, and outcome,
which can be remembered using the mnemonic CoNVO. The author
recommends that we write down and refine our CoNVO to clarify our
thoughts and communicate with others. Having a clear scope is crucial to
getting a data project started, and the author will explore other aspects of
the process in subsequent chapters.
Context (Co):
The author explains that every data project has a context, which is the
defining frame apart from the particular problems being solved. The
context includes understanding who the project is for, their long-term
goals, and what work is being furthered. The context sets the overall tone
for the project and guides the choices made about what to pursue. The
context can emerge from talking to people and understanding their
mission. It is important to clearly articulate the long-term goals of the
people being helped, even when embedded within an organization. The
context can include relevant details like deadlines that help prioritize the
work. Finally, the author notes that while curiosity and hunger for
understanding can be a context, treating every situation only as a chance
to satisfy our own interests can pass up opportunities to provide value to
others.
Needs (N):
The author explains that correctly identifying needs in a data project is
tough but crucial. The needs should be presented in terms that are
meaningful to the organization, and they are fundamentally about
understanding some part of how the world works. The author
compares data science needs to needs from other related disciplines
like software engineering, applied machine learning, and risk management. The
author provides some examples of common needs in data projects, such as
identifying the most profitable location for a business, understanding why
customers leave a website, deciding between two competing vendors, and
determining the effectiveness of an email campaign. Finally, the author
gives a famous example of a data project that involves identifying pregnant
women from their shopping habits.
Vision (V)
The importance of having a clear vision before starting data work is
emphasized in this passage. A vision provides a glimpse of what it will
look like to meet the need with data and could consist of a mockup or
argument sketch. Mockups are low-detail idealizations of the final result
and can take the form of a few sentences or a user interface sketch,
while argument sketches provide a loose outline of the statements that
will make the work relevant and correct. Having a good mental library of
examples and experimenting with new ideas helps to refine the vision.
Creating things in quantity is more important than creating things of
quality when starting out, as it helps to learn from mistakes quickly.
Mockups and argument sketches serve different purposes in the
planning and design of a project.
A mockup is a low-detail idealization or representation of what the final
result of a project might look like. It is a simplified version of the end
goal, usually in the form of a graph, report, or user interface for a tool.
The purpose of a mockup is to provide an example of the kind of result
we would expect, or an illustration of the form that results might take.
On the other hand, an argument sketch is a loose outline of the
statements that will make the project relevant and correct. It outlines the
logic behind the solution and emphasizes why the project is important.
The purpose of an argument sketch is to help orient ourselves and
communicate the rationale behind the project to others.
So, while a mockup provides a visual representation of what the end
result might look like, an argument sketch focuses on the reasoning
behind the project and why it matters. Both are important tools in the
planning and design of a project and can be used in conjunction with
each other to create a comprehensive plan.
Example1
• Vision: The media organization trying to define user engagement will get a
report outlining why a particular user engagement metric is the ideal one,
with supporting examples; models that connect that metric to revenue,
growth, and subscriptions; and a comparison against other metrics.
• Mockup: Users who score highly on engagement metric A are more likely
to be readers at one, three, and six months than users who score highly on
engagement metrics B or C. Engagement metric A is also more correlated
with lifetime value than the other metrics.
• Argument sketch: The media organization should use this particular
engagement metric going forward because it is predictive of other valuable
outcomes.
This text is about a media organization trying to figure out which metric to use
to measure user engagement. They will receive a report that explains why a
specific metric is the best one to use, with examples and models that show
how it connects to revenue, growth, and subscriptions.
There is also a mockup that shows users who score highly on engagement
metric A are more likely to be readers at one, three, and six months than users
who score highly on engagement metrics B or C. Engagement metric A is also
more correlated with lifetime value than the other metrics.
Based on this information, the media organization should use engagement
metric A going forward because it is more predictive of valuable outcomes.
Example2
• Vision: The marketing department will get a spreadsheet that can be dropped into the existing
workflow. It will fill in some characteristics of a city and the spreadsheet will indicate what the estimated
value would be
• Mockup: By inputting gender and age skew and performance results for 20 cities, an estimated return
on investment is placed next to each potential new market. Austin, Texas, is a good place to target based
on age and gender skew, performance in similar cities, and its total market size.
• Argument sketch: The department should focus on city X, because it is most likely to bring in high
value. The definition of high value that we’re planning to use is substantiated for the following reasons….
This text is about a marketing department that wants to identify which city
they should focus on for their marketing campaign. They will receive a
spreadsheet that can be easily integrated into their existing workflow, which
will provide an estimated value for each city based on certain characteristics.
The mockup shows that by inputting data on gender and age skew, as well as
performance results for 20 different cities, an estimated return on investment
is displayed next to each potential new market. Based on this data, Austin,
Texas is identified as a good place to target due to its age and gender skew,
performance in similar cities, and total market size.
The argument sketch suggests that the marketing department should focus on
a specific city (referred to as "city X") because it is most likely to bring in high
value. The definition of "high value" is supported by certain reasons, which are
not provided in this text.
OutCome
--------------------------------------------------------------------------------------
Chapter 2 : What Next?
Talk to experts in the subject matter, especially people who work on a task
all the time and have built up strong intuition. Their intuition may or may
not match the data, but having their perspective is invaluable at building
your intuition.
-Rapid investigation
Start from the finished idea and figure out what is needed immediately
prior in order to achieve the outcome. Then see what is prior to that, and
prior to that, and so on, until you arrive at data or knowledge you already
have.
-More mockups
Pretend you are the final consumer or user of a project, and think out loud
about the process of interacting with the finished work.
Deep Dive: Real Estate and Public Transit
The text provides an example of a New York residential real estate company that
wants to use public transit data to improve its understanding of rental
prices. After some discussion, the company's needs are clarified: to
confirm the relationship between rental prices and public transit access,
to identify mispriced apartments, and to predict real estate prices. The
goal is to improve the company's profitability, and while public transit
data is the focus, other data may also be important. The text emphasizes
the importance of subject matter knowledge and intuition in understanding
the problem and envisioning the end result, such as a graph of price
against nearness to transit. Sketching a mockup of the graph can be a
helpful exercise. The text explores different ways to capture the
relationship between public transit access and apartment prices,
including graphs, statistical models, and maps. Each approach has its
strengths and weaknesses. A scatterplot is easy to make but potentially
misleading, while a statistical model can collapse variation in the data but
may miss interesting patterns. A map is limited in accounting for non-
spatial variables but is easier to inspect. The goal is to meet the
company's needs, including verifying the relationship between public
transit and rental prices, identifying mispriced apartments, and predicting real
estate prices. Different approaches can lend themselves to different
arguments and can be used to analyze different aspects of the data.
The final output of the project will likely be more complex than the initial
ideas presented. The process involves starting with strong assumptions
and gradually relaxing them until a feasible solution is found. Concrete
examples, such as plugging intersections into real estate websites or reading
classifieds, can spur the imagination and provide insights into the data.
Even if the details take up most of the time, they are only epicycles on
top of a larger point, and it is important not to forget the overall goal of
the project. Immersing oneself in particular examples is useful, even if it
only means reading the first few lines of a table before building a model.
Deep Dive Continued: Working Forward
"Deep Dive Continued: Working Forward" discusses the next steps in the
process of working on a data project after imagining the end goal. It
begins by suggesting that it is important to think about what kind of
data is needed to define the variables, and uses the example of
apartment prices to illustrate the process. The article then moves on to
the concept of transit access and how it can be defined and measured.
The article notes that refining the vision of the project is important, and
that it may require acquiring additional information to distinguish the
effect of proximity to transit from other factors. It also highlights the
importance of recognizing the causal nature of the question being
asked and the need for a causal argument pattern.
The article concludes by discussing the limitations that may be faced in
the project, the importance of thinking about the outcome and how the
work will be used, and the need to consider maintenance and cross-
checking of results. Overall, the article emphasizes the importance of
careful planning and consideration of the many factors involved in a
data project.
Deep Dive Continued: Scaffolding
"Deep Dive Continued: Scaffolding," the author discusses the
importance of scaffolding in a data project, which involves breaking
down the project into smaller, manageable tasks to avoid wasting time
and to stay focused on the larger picture.
The article suggests that, especially at the beginning of a project, it is
important to focus on building intuition by using simple tabulations,
visualizations, and reorganized raw data. It also suggests using models
that can be easily fit and interpreted, or models that have
great predictive performance without much work, as a starting point for
a predictive task.
To avoid getting too deep into exploratory steps and losing sight of the
larger picture, the article recommends setting time limits for exploratory
projects and writing down a scaffolding plan to lay out the next few
goals and expected shifts.
The article also emphasizes the importance of understanding the
argument being made and using that to prioritize analysis. Building
exploratory scatterplots should precede the building of a model, and
building simple models before tackling more complex ones will save
time and energy.
Overall, the article highlights the importance of prioritizing aims and
proceeding in a way that is as instructive as possible at every step.
Verifying Understanding
"Verifying Understanding" emphasizes the importance of
building explicit checks with partners or decision makers in
any scaffolding plan to ensure that their needs are properly
understood. Overcommunication is better than assuming that both
parties are on the same page.
The article suggests finding a convenient medium to explain the
partners' needs back to them and asking if they have understood things
properly. The goal is to achieve conceptual agreement, not a detailed
critique of the project, unless the partners are data-savvy and
particularly interested.
It is also important to talk to those who will be needed to implement the
final work and make sure that their requirements are being represented.
The article emphasizes the value of partners' intuitive understandings of
the processes they are looking to understand better and suggests
spending time talking to people who deal with a process to get the
intuition needed to build a data-based argument that can create real
knowledge.
The article concludes by stating that the mark of having grasped the
problem well is being able to explain the strategy to partners in terms
that matter to them and receiving enthusiastic understanding in return.
Regular check-ins with partners and decision makers are important to
ensure that the project is on track and meeting their needs.
Getting Our Hands Dirty
In the article, the author emphasizes the importance of spending time
on data gathering and transformation to ensure that the evidence
gathered can serve as a valid argument. The article suggests starting
with easy transformations, such as exploratory graphs or summary
statistics, before moving on to more sophisticated or complicated
models.
Once the data is transformed and ready to serve as evidence, the article
suggests evaluating the strength of the arguments by updating the
initial paragraphs with new information. The author emphasizes the
need to consider alternative interpretations of the data and to ensure
that the argument makes sense given the data that has been collected.
The article provides an example of how the evaluation process can lead
to further considerations, such as questioning the representativeness of
the sample of apartments and exploring options for addressing
potential biases. The author emphasizes that strong arguments will
rarely consist of a single piece of evidence and that multiple strands of
evidence may be necessary to support an argument.
Overall, the article highlights the importance of careful planning and
evaluation at each step of the data analysis process to ensure that the
evidence gathered can serve as a valid argument.
The passage discusses the importance of presenting data in a way that is
both useful and pleasing to the audience. This involves considering the
audience's needs and preferences, such as whether to present a single
graph or multiple models and summaries, or whether to present claims
in a chain or in parallel. The presentation format is also important, such as
deciding which elements of a tool should be optional or fixed, or which
examples to use in a narrative. The author emphasizes that presentation
matters and that genre conventions and tone should be considered. The
process of working with data is not always linear, and involves an
interplay of clarification and action to arrive at a better outcome.
Chapter three : Arguments
The passage discusses the importance of arguments in making
knowledge out of observations. While observations alone are not
enough to act on, connecting them to how the world works through
arguments allows us to create different kinds of knowledge, such
as accurate mental models or algorithmic understanding. Understanding how
arguments work is important in working with data, as it helps us better
convey complicated ideas, build projects in stages, find inspiration from
well-known patterns of argument, substitute techniques, make our
results more coherent, present our findings, and convince ourselves and
others that our tools work as expected. Without a good understanding
of arguments, our work with data may be small and disconnected.
The passage highlights that most of the prior beliefs and facts that go
into an argument are not explicitly spelled out. It is important to
consider the audience's existing knowledge and understanding, as this
defines what is reasonable in a given situation. Common sources of prior
or background knowledge include scientific laws, the validity of data sources,
algebraic or mathematical theory, commonly known facts, and "wisdom" or
commonly accepted beliefs. While it may be necessary to provide
additional information or clarification for certain audiences, most details
of techniques can be considered background knowledge. By taking into
account the audience's existing knowledge and understanding, we can
make a more effective argument that is reasonable and convincing.
Building an Argument
The text discusses the use of visual schematics in setting up an argument,
but the author suggests that sentences and sentence fragments are more
flexible and practical. The author notes that prior knowledge and facts
often enter into an argument without being acknowledged. The text
provides examples of how ideas are expressed in written language and tags
concepts such as "claims" to make them more transparent. The author
cautions that the example given is fictional and not to be cited as truth
without proper research.
Claims
In the context of building an argument, a claim is a statement that can
be reasonably doubted but that the presenter believes they can make a
case for. All arguments contain one or more claims, with one main claim
supported by one or more subordinate claims. The claim should be
framed in terms that decision makers care about and not in technical jargon.
The details justifying the claim may involve different techniques, but
they all serve as support rather than the big idea. In presenting an
argument, it is important to use strong arguments and techniques that
can make a case for a causal relationship, rather than relying solely on
statistical tools.
Evidence, Justification, and Rebuttals
This excerpt emphasizes the importance of evidence in constructing an
argument and highlights how raw data needs to be transformed into a
more compact and intelligible form to be incorporated into an
argument. The transformation of data puts an interpretation on it by
highlighting essential things. The excerpt also provides examples of
different types of transformations, such as counting sales or making a
map of taxi pickups in a city.
In the context of a transit example, the excerpt suggests that a high-
resolution map of apartment prices overlaid on a transit map or a two-
dimensional histogram or scatterplot can be reasonable evidence to
show the relationship between transit access and apartment prices.
However, for more robust claims such as forecasting the relationship
into the future, more sophisticated models may be necessary. Overall,
the excerpt highlights the importance of transforming data into
evidence to support an argument and using appropriate types of
evidence for different claims.
Evidence and Transformations
This excerpt highlights the importance of evidence and transformations
in constructing an argument. It provides examples of different types of
evidence, such as a graph of historical price data or a map of home
prices overlaid on a transit map, and how they can demonstrate a
relationship between variables. The excerpt also emphasizes the need
for a justification to connect the evidence to the claim being made. The
justification provides a logical connection between the evidence and the
claim, and can be supported by different methods such as visual
inspection, cross-validation, or single-variable validation.The excerpt
also acknowledges the potential rebuttals to an argument and how they
need to be addressed to make a water-tight argument. Rebuttals may
include issues such as improper randomization, small sample size, or
outdated data. Finally, the excerpt introduces the concept of the degree
of qualification of an argument, which provides some degree of
certainty in the conclusions, ranging from possible to definite.
Overall, the excerpt highlights the importance of using appropriate
evidence, transformations, and justifications, and being aware of
potential rebuttals to construct a convincing argument.
Evidence:
In building an argument, evidence is the information or data that
supports the claim being made. Evidence can come in various forms,
such as statistics, research findings, expert opinions, facts, anecdotes,
and examples. It is important to ensure that the evidence is reliable,
relevant, and sufficient to support the claim being made. Raw data
needs to be transformed into something that is more easily
understandable and incorporated into the argument. Transformations
can include a graph, model, sentence, or map that highlights essential
information.
Justification:
The justification is the reason that connects the evidence to the claim
being made. It provides a logical connection between the evidence and
the claim. Justifications can include visual inspection, cross-validation, or
other methods that support the claim being made. It is important to
ensure that the justification is sound and convincing to the audience.
Rebuttals:
Rebuttals are the reasons why a justification may not hold in a particular
case. They are the yes-but-what-if questions that naturally arise in any
argument. It is important to be aware of these rebuttals and address
them appropriately to make a convincing argument. Rebuttals may
include issues such as outdated data, incorrect error functions, or small
sample sizes.
Degree of qualification:
The degree of qualification of an argument provides some degree of
certainty in the conclusions, ranging from possible to definite. Deductive
logic provides definite certainty in its conclusions, but its use in practice
is limited. In most cases, arguments provide some degree of probability
or likelihood, and it is important to have a sense of how strong the
conclusion is to avoid making false claims.
Overall, building an argument involves presenting a claim, providing
evidence to support the claim, justifying the evidence, addressing
rebuttals, and providing a degree of qualification for the argument. It is
important to ensure that the argument is sound, convincing, and well-
supported to persuade the audience to accept the claim being made.
Adding Justifications, Qualifications, and Rebuttals
This text discuss a correlation between travel time to the rest of the city
and apartment prices. A 5% reduction in travel time results in a 10%
increase in apartment prices, and vice versa. This relationship has been
observed in historical apartment prices and transit access data. A model
trained on 20 years of raw price data shows a high probability of this
relationship. Big changes in transit access have a visible effect on
apartment prices within a year, indicating a strong probability of a
relationship. Closing a train stop results in a 10% decline in apartment
prices within two blocks over the next year in at least 70% of cases.
However, there are three possible confounding factors. First, there may
be nicer apartments being built closer to train lines, but new
construction or renovation takes place over longer time scales than price
changes. Second, there may be large price swings even in the absence
of changing transit access, but the average price change between
successive years for all apartments is only 5%. Third, it is possible
that transit improvements or reductions and changes in apartment price
are both caused by some external change, like a decline in
neighborhood quality. However, the city in question has added transit
options spottily and rent often goes up regardless of these transit
changes.
Note that
In the context of reasoning and communication, an argument is a set of
statements or premises presented in support of a conclusion or claim. The
goal of an argument is to persuade an audience or interlocutor of the truth or
validity of the conclusion or claim being made. Arguments can take many
forms, including deductive arguments, inductive arguments, and abductive arguments, and
they can be evaluated based on their logical validity, soundness, and
persuasiveness. In the context of data science, arguments may involve presenting
evidence, statistical analysis, and models to support a particular conclusion or
decision.
DEFINITION
Disputes of definition occur when there is a particular way we want to
label something, and we expect that label to be contested. Definitions in
a data context are about trying to make precise relationships in an
imprecise world. There are two activities involving definitions that can
happen in an argument: making a case that a general, imprecise term fits a
particular example, or clarifying an existing idea into something that can
be counted with. A good definition should be useful, consistent, and the
best option among reasonable alternatives. Discussions about how well
a definition fits with prior ideas can take many forms, such as how well it
agrees with previously accepted definitions or how well it captures good
examples. A definition that an audience has no prior understanding of
will not be contested.
VALUE
this text discusses the concept of value in decision-making. To make
judgments, we need to select and defend criteria of goodness that apply
to a given situation. Different values, such as ease of interpretability,
precision, and completeness, may be important in different scenarios.
The text highlights two key issues in disputes of value: how our goals
determine which values are most important, and whether the value has
been properly applied in a given situation. The values that matter
depend on our goals, and we need to make a case for why a particular
model or metric is interpretable, concise, elegant, accurate, or robust.
Ultimately, our values are shaped by our goals, and we need to argue for
why a particular value is important in a given context.
POLICY
The text discusses disputes of policy, which arise when we want to
determine whether a particular course of action is the right one.
Examples of such disputes include whether to reach out to paying
members more often by email or whether a particular implementation
of an encryption standard is good enough to use. The four stock issues of
disputes of policy are: (1) is there a problem, (2) where is credit or blame
due, (3) will the proposal solve it, and (4) will it be better on balance?
These can be distilled into Ill, Blame, Cure, and Cost. To make a case for
a particular policy proposal, we need to show that there is a problem
worth fixing, pinpoint the source of the problem, demonstrate that our
proposed solution will work, and determine whether the benefits
outweigh the costs. Having a framework to think through disputes of
policy can be invaluable in making effective arguments for change.
General Topics
The text discusses general topics, which are patterns of argument
identified by Aristotle that are repeatedly applied across every field.
These classic varieties of arguments include specific-to-general,
comparison, comparing things by degree, comparing sizes, and
considering the possible as opposed to the impossible. While some of
these patterns may seem remote from data work, they occur constantly
in various fields. For example, specific-to-general reasoning and
reasoning by analogy may require more exposition to understand how
they apply to data work.
SPECIFIC-TO-GENERAL
The text explains specific-to-general reasoning, a pattern of argument
that involves reasoning from examples to make a point about a larger
pattern. In data-focused ideas, statistical models often use this
reasoning. This pattern also occurs in user experience testing, where a small
amount of users are studied in great detail to draw conclusions about
the larger user base. Another example is an illustration, which uses one
or more examples to build intuition for a topic and provide grounding
for the audience. Concrete examples, sticky graphics, and explanations
of the prototypical can help ground arguments and improve the chance
of a strong takeaway, even when they are incidental to the body of the
argument.
GENERAL-TO-SPECIFIC
The text discusses general-to-specific arguments, which involve using
beliefs about general patterns to make inferences for specific examples. This
pattern is the opposite of specific-to-general reasoning and is often
used to make tentative conclusions based on general patterns. For example, if
a company is experiencing rapid revenue growth, it seems plausible to infer
that the company will find it easy to raise money. However, this
argument can be rebutted by pointing out that the specific example
may not have the properties of the general pattern and may be an
outlier. This pattern is often used in data work when applying a statistical
model to a new example after using specific-to-general reasoning to
claim that a sample can stand in for the whole.
ARGUMENT BY ANALOGY
The text discusses "Argument by Analogy," which can be either literal or
figurative. Literal analogies involve two similar types of things, while figurative
analogies involve two dissimilar things that are argued to be
alike. Mathematical models are an example of figurative analogies, where
behavior in one domain can help understand behavior in another
domain. However, the rebuttal for argument by analogy is that what may
hold for one thing may not hold for another, and there may be
unaccounted-for factors. Despite this, when the analogy
holds, mathematical modeling is a powerful way to reason.
Special Arguments
The text discusses the special argument strategies that are used in data
science and other mathematically inclined disciplines, such as engineering,
machine learning, and business intelligence. These argument strategies
can be mixed and matched in various ways. The text highlights three
specific special arguments: optimization, bounding cases, and cost/benefit
analysis. However, there are many additional argument strategies that
can be observed through careful analysis of case studies within data
science or related fields.
optimization
The text describes optimization as an argument strategy in data science, which
involves determining the best way to do something given certain
constraints. This can involve creating processes such as recommending
movies, assigning people to advertising buckets, or packing boxes based
on customer preferences. While optimization also occurs when fitting a
model through error minimization, this activity is less argumentative
compared to debating what to optimize. Deciding what to optimize
often involves a value judgment, which can be more interesting and
controversial than the optimization process itself.
BOUNDING CASE
The text explains the argument strategy of bounding cases, which involves
determining the highest or lowest reasonable values for something. There are
two major ways to make an argument about bounding cases: sensitivity
analysis and simulation or statistical sensitivity analysis. Sensitivity analysis involves
varying assumptions to determine the best or worst-case values, while
simulation involves making assumptions about the plausibility of values and
simulating likely results. The resulting bounds can be used to calculate
percentiles of revenue or uptime, depending on the relevant risk profile of the
audience. Bounding cases can provide rough estimates for important numbers,
which can be useful in pushing decisions forward or determining if a project is
feasible.
COST/BENEFIT ANALYSIS
The text describes the argument strategy of cost/benefit analysis, which
involves putting each possible outcome from a decision in terms of a
common unit such as time, money, or lives saved. The goal is to
maximize the benefit or minimize the cost and compare proposed
policies against alternatives. Cost/benefit analyses can be combined with
bounding case analyses to strengthen the argument, such as when the
lowest plausible benefit from one action is greater than the highest
plausible benefit from another. However, the rebuttals to a cost/benefit
analysis include the possibility of miscalculated costs and benefits, the
limitations of using this method to make decisions, and the potential for
other costs or benefits to reorder the answers.
Designs
The text discusses the purpose of causal analysis, which is to deal with the
problem of confounders, and the use of designs to structure an analysis
and reduce uncertainty in understanding cause and effect. A design is a
method for grouping units of analysis according to certain criteria. The
text also mentions multiple causation, where more than one thing has to be
true simultaneously for an outcome to come about, illustrated by the
example of a tree falling down due to both wind and lack of trimming.
The goal of investigating causation is to understand how a treatment
would change an outcome in a new scenario, and how well a causal
explanation generalizes to new situations. Examples include
understanding the causal effect of smoking on lung cancer and the causal
effect of a drug on a disease.
Intervention Designs
The text discusses different design strategies for controlling confounders in
experiments. In randomized controlled trials, treatment is randomly assigned to
subjects to ensure the treatment is independent of potential
confounders. However, there are rebuttals to this method, such as
improper randomization or new subjects being different from those in
the experiment. Nonrandom intervention, such as applying an electrical
shock to a subject, can also show a relationship if potential confounders
are eliminated. The causal relationship can be considered generalizable if the
same result is observed in multiple experiments. However, unmeasured
confounders can still exist even in a randomized experiment. A truly randomized
experiment would involve random selection of people from
The population and random selection of treatment executors in
addition to applying the treatment at random, but this design is rarely
used.
Observational Designs
The text explains that observational designs are used when it is impractical,
costly, or unethical to perform interventional experiments. These designs
are necessary for most socially interesting causal estimation problems,
and they are frequently used in business when it is not feasible to take
down a part of the business to see how it affects profit. Instead, natural
differences in exposure or temporary outages are used to determine the
causal effect of critical components. In observational designs, clever
strategies are employed to infer generalizability without the ability to
intervene.
Natural Experiments
The text describes natural experiments as a clever strategy for observational
designs to infer causality. For example, when investigating the effect of
being convicted of a crime in high school on earnings from age 25 to 40,
we can compare the earnings of those who were arrested but not
convicted of a crime with those who were convicted. These groups are
likely to be similar on other variables that serve as a natural control.
Within-subject interrupted time series designs can also be used when
natural between-subject controls are not available. By eliminating other
contenders and looking at the same subject, we can make a strong
argument for causality. However, our conclusion will necessarily be less
definite than if we were able to randomly assign treatments to individual
cases. We need to gather as much information as possible to find highly
similar situations, some of which have experienced a treatment and
some of which have not, to try to make a statement about the effect of
the treatment on the outcome. Detailed data collection, such as demographic
information, behavioral studies, or surveys, can also help tease out confounding
factors and provide stronger tools for reasoning about the world.
Statistical Methods
The text explains that when all else fails, we can turn to statistical methods to
establish causal relationships. Statistical methods can be based on
causal graphs or matching. Causal graphs assume a plausible series of
relationships to identify what kinds of relationships we should not see.
For example, in a randomized clinical trial, we should not see a correlation
between patient age and their treatment group. Matching can be
deterministic or probabilistic. Deterministic matching tries to find similar units
across some number of variables, while probabilistic matching, such
as propensity score matching, builds a model to account for the probability of
being treated. Both methods aim to create artificial natural experiments
to reduce confounding factors. While there are good theoretical reasons
to prefer propensity score matching, it can be worthwhile to skip the
added difficulty of fitting the intermediate model when dealing with a
small number of variables