0% found this document useful (0 votes)
300 views

Thinking With Data - Summary

This document discusses the importance of scoping a data project before beginning work. It emphasizes that understanding why the project is being done, identifying clear needs, and envisioning the desired outcome are crucial first steps. The author introduces the concept of a "CoNVO" to scope a project, which stands for Context, Need, Vision, and Outcome. Examples are provided for each element of a well-scoped CoNVO, and it is argued that taking the time to thoroughly scope a project will lead to more meaningful insights and results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
300 views

Thinking With Data - Summary

This document discusses the importance of scoping a data project before beginning work. It emphasizes that understanding why the project is being done, identifying clear needs, and envisioning the desired outcome are crucial first steps. The author introduces the concept of a "CoNVO" to scope a project, which stands for Context, Need, Vision, and Outcome. Examples are provided for each element of a well-scoped CoNVO, and it is argued that taking the time to thoroughly scope a project will lead to more meaningful insights and results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Thinking With Data

Max shorn

Preface
The preface of the book discusses the importance of working with data to
produce knowledge that can be used by both humans and machines. The
author, a data strategy consultant, emphasizes the need to ask the right
questions to create meaningful insights. The problem of turning
observations into knowledge is not new and has been tackled by experts in
various fields. The purpose of the book is to help readers understand that
they are not alone in this pursuit and can learn from past experiences. The
author also provides an incomplete list of things that can be improved with
data, including answering questions, telling stories, discovering patterns,
and automating processes.

Max shron explains that working with data requires both hard skills (e.g.,
data cleaning, mathematical modeling) and soft skills (e.g., problem-solving,
organization, critical thinking) to make data useful. These soft skills are often
overlooked but are essential for turning observations into knowledge. The
author suggests that data scientists can learn from other fields such as
design, critical thinking, and program evaluation to improve their skills. The
author emphasizes that a focus on why rather than how is crucial for
successful data work and that the best data professionals already
understand this.

Scoping: Why Before How :


The author argues that starting with a dataset and applying techniques to
it without first doing a lot of thinking and having any structure is a short
road to simple questions and unsurprising results. As professionals working
with data, our domain of expertise has to be the full problem, not merely
the columns to combine, transformations to apply, and models to fit.
Picking the right techniques has to be secondary to asking the right
questions. To create things of lasting value, we have to understand
elements as diverse as the needs of the people we're working with, the
shape that the work will take, the structure of the arguments we make, and
the process of what happens after we "finish." The author suggests that
having structure and adapting ideas from other disciplines can make
professional data work as valuable as possible.

The author suggests that creating a scope for a data problem is the first
step in finding structure. A scope is an outline of a story that explains why
we are working on a problem and how we expect the story to end. A well-
scoped problem includes four parts: context, need, vision, and outcome,
which can be remembered using the mnemonic CoNVO. The author
recommends that we write down and refine our CoNVO to clarify our
thoughts and communicate with others. Having a clear scope is crucial to
getting a data project started, and the author will explore other aspects of
the process in subsequent chapters.

Context (Co):
The author explains that every data project has a context, which is the
defining frame apart from the particular problems being solved. The
context includes understanding who the project is for, their long-term
goals, and what work is being furthered. The context sets the overall tone
for the project and guides the choices made about what to pursue. The
context can emerge from talking to people and understanding their
mission. It is important to clearly articulate the long-term goals of the
people being helped, even when embedded within an organization. The
context can include relevant details like deadlines that help prioritize the
work. Finally, the author notes that while curiosity and hunger for
understanding can be a context, treating every situation only as a chance
to satisfy our own interests can pass up opportunities to provide value to
others.

Needs (N):
The author explains that correctly identifying needs in a data project is
tough but crucial. The needs should be presented in terms that are
meaningful to the organization, and they are fundamentally about
understanding some part of how the world works. The author
compares data science needs to needs from other related disciplines
like software engineering, applied machine learning, and risk management. The
author provides some examples of common needs in data projects, such as
identifying the most profitable location for a business, understanding why
customers leave a website, deciding between two competing vendors, and
determining the effectiveness of an email campaign. Finally, the author
gives a famous example of a data project that involves identifying pregnant
women from their shopping habits.

The process of discovering needs requires asking questions, listening, and


brainstorming until needs can be clearly articulated. Writing down
thoughts can help identify flaws in reasoning. The process is similar to that
of a designer, where listening to people, condensing information, and
bringing ideas back to people is important. Sometimes the information
needed to refine a need is a detailed understanding of a process. Needs
should be related to a specific action or a larger strategic question. Needs
should be specified in words that are important to the organization. Needs
are problems that can be solved with knowledge, not a lack of a particular
tool. Tools are used to accomplish things, and a particular tool may not be
the best solution for a need.

Vision (V)
The importance of having a clear vision before starting data work is
emphasized in this passage. A vision provides a glimpse of what it will
look like to meet the need with data and could consist of a mockup or
argument sketch. Mockups are low-detail idealizations of the final result
and can take the form of a few sentences or a user interface sketch,
while argument sketches provide a loose outline of the statements that
will make the work relevant and correct. Having a good mental library of
examples and experimenting with new ideas helps to refine the vision.
Creating things in quantity is more important than creating things of
quality when starting out, as it helps to learn from mistakes quickly.
Mockups and argument sketches serve different purposes in the
planning and design of a project.
A mockup is a low-detail idealization or representation of what the final
result of a project might look like. It is a simplified version of the end
goal, usually in the form of a graph, report, or user interface for a tool.
The purpose of a mockup is to provide an example of the kind of result
we would expect, or an illustration of the form that results might take.
On the other hand, an argument sketch is a loose outline of the
statements that will make the project relevant and correct. It outlines the
logic behind the solution and emphasizes why the project is important.
The purpose of an argument sketch is to help orient ourselves and
communicate the rationale behind the project to others.
So, while a mockup provides a visual representation of what the end
result might look like, an argument sketch focuses on the reasoning
behind the project and why it matters. Both are important tools in the
planning and design of a project and can be used in conjunction with
each other to create a comprehensive plan.
Example1

• Vision: The media organization trying to define user engagement will get a
report outlining why a particular user engagement metric is the ideal one,
with supporting examples; models that connect that metric to revenue,
growth, and subscriptions; and a comparison against other metrics.
• Mockup: Users who score highly on engagement metric A are more likely
to be readers at one, three, and six months than users who score highly on
engagement metrics B or C. Engagement metric A is also more correlated
with lifetime value than the other metrics.
• Argument sketch: The media organization should use this particular
engagement metric going forward because it is predictive of other valuable
outcomes.
This text is about a media organization trying to figure out which metric to use
to measure user engagement. They will receive a report that explains why a
specific metric is the best one to use, with examples and models that show
how it connects to revenue, growth, and subscriptions.
There is also a mockup that shows users who score highly on engagement
metric A are more likely to be readers at one, three, and six months than users
who score highly on engagement metrics B or C. Engagement metric A is also
more correlated with lifetime value than the other metrics.
Based on this information, the media organization should use engagement
metric A going forward because it is more predictive of valuable outcomes.

Example2
• Vision: The marketing department will get a spreadsheet that can be dropped into the existing
workflow. It will fill in some characteristics of a city and the spreadsheet will indicate what the estimated
value would be

• Mockup: By inputting gender and age skew and performance results for 20 cities, an estimated return
on investment is placed next to each potential new market. Austin, Texas, is a good place to target based
on age and gender skew, performance in similar cities, and its total market size.

• Argument sketch: The department should focus on city X, because it is most likely to bring in high
value. The definition of high value that we’re planning to use is substantiated for the following reasons….

This text is about a marketing department that wants to identify which city
they should focus on for their marketing campaign. They will receive a
spreadsheet that can be easily integrated into their existing workflow, which
will provide an estimated value for each city based on certain characteristics.
The mockup shows that by inputting data on gender and age skew, as well as
performance results for 20 different cities, an estimated return on investment
is displayed next to each potential new market. Based on this data, Austin,
Texas is identified as a good place to target due to its age and gender skew,
performance in similar cities, and total market size.
The argument sketch suggests that the marketing department should focus on
a specific city (referred to as "city X") because it is most likely to bring in high
value. The definition of "high value" is supported by certain reasons, which are
not provided in this text.
OutCome

The outcome should be distinct from the vision, focusing on what


will happen when the work is completed. The examples given
include the metrics email for a nonprofit, training the marketing
team to use a model, and incorporating engagement
metrics into performance measures across an organization. Follow-up
meetings and studies are necessary to ensure success. the
outcomes for the media mention finder project, which includes
integrating it with the existing mention database, training staff to
use the dashboard, informing the IT person of the tool's existence
and maintaining it, and periodically updating the system. The
developers in charge of integration will oversee bug fixes and
updates, and a follow-up will be conducted three months after
delivery to assess the system's effectiveness. The paragraph also
emphasizes the importance of considering who will handle the
project next, how to keep the work relevant, and what changes the
work will bring about for the partners, and how to verify that
those changes have occurred. This approach can improve the
chances of producing work that is useful and used.
Seeing the Big Picture
This passage discusses the importance of understanding the big
picture when working on a data project. It shows an example of a
poorly structured problem statement compared to a well-thought-
out scope. The well-thought-out scope considers the media
organization's goals and challenges and outlines the intended
outcome of the project, which is to identify the ideal user engagement
metric and connect it to revenue, growth, and subscriptions. The
passage emphasizes that focusing solely on the

technical aspects of the project without considering the context,


need, vision, and outcome can lead to wasted time and energy.

--------------------------------------------------------------------------------------
Chapter 2 : What Next?

This passage discusses the importance of taking time to clarify the


details of a data project before diving into the data work. It highlights
the four areas of a project scope: context, needs, vision, and outcome.
While there is a natural tension between exploring the data quickly and
spending more time planning upfront, taking the time to think deeply
beforehand maximizes the chances of doing something useful. The
passage describes the process of clarifying the details of the problem,
which includes discussions with decision makers and implementers, defining
key terms, considering arguments, posing open questions, and deciding
on the order to pursue ideas. The passage emphasizes that larger
projects have messier beginnings, and careful note-taking is crucial
throughout the project to reproduce the work and catch errors.
Refining the Vision
This passage emphasizes the importance of refining the vision for a data
project by improving intuition about the problem. This is achieved by
asking pointed questions and generating new ways to frame the
problem using techniques like kitchen sink interrogation. The passage
provides examples of questions that can be asked to illustrate the
breadth of the problem and the core work that needs to be undertaken.
The focus is on generating questions rather than finding answers. The
passage also highlights that how we frame the problem is more
important than the tools and techniques used to answer it. Overall, the
passage emphasizes the importance of spending time upfront to
maximize understanding and save time down the line.

techniques for identifying alternative metrics and validating them. One


approach is to collect questions to understand what is known and
unknown. Another is to work backward from mockups or argument
sketches to identify necessary steps and potential issues. These techniques
remain useful even after a clear vision is established. Roleplaying and
considering the consumer experience can also help catch early
problems. Continually referring back to the early process can ensure that
results align with the envisioned scenario.

Techniques for refining the vision


-Interviews

Talk to experts in the subject matter, especially people who work on a task
all the time and have built up strong intuition. Their intuition may or may
not match the data, but having their perspective is invaluable at building
your intuition.
-Rapid investigation

Get order of magnitude estimates, related quantities, easy graphs, and so


on, to build intuition for the topic.
-Kitchen sink interrogation

Ask every question that comes to mind relating to a need or a data


collection process. Just the act of asking questions will open up new ideas.
Before it was polluted as a concept, this was the original meaning of the
term brainstorming.
-Working backward

Start from the finished idea and figure out what is needed immediately
prior in order to achieve the outcome. Then see what is prior to that, and
prior to that, and so on, until you arrive at data or knowledge you already
have.
-More mockups

Drawing further and more well-defined idealizations of the outcome not


only helps to figure out what the actual needs are, but also more about
what the final result might look like.
-Roleplaying

Pretend you are the final consumer or user of a project, and think out loud
about the process of interacting with the finished work.
Deep Dive: Real Estate and Public Transit
The text provides an example of a New York residential real estate company that
wants to use public transit data to improve its understanding of rental
prices. After some discussion, the company's needs are clarified: to
confirm the relationship between rental prices and public transit access,
to identify mispriced apartments, and to predict real estate prices. The
goal is to improve the company's profitability, and while public transit
data is the focus, other data may also be important. The text emphasizes
the importance of subject matter knowledge and intuition in understanding
the problem and envisioning the end result, such as a graph of price
against nearness to transit. Sketching a mockup of the graph can be a
helpful exercise. The text explores different ways to capture the
relationship between public transit access and apartment prices,
including graphs, statistical models, and maps. Each approach has its
strengths and weaknesses. A scatterplot is easy to make but potentially
misleading, while a statistical model can collapse variation in the data but
may miss interesting patterns. A map is limited in accounting for non-
spatial variables but is easier to inspect. The goal is to meet the
company's needs, including verifying the relationship between public
transit and rental prices, identifying mispriced apartments, and predicting real
estate prices. Different approaches can lend themselves to different
arguments and can be used to analyze different aspects of the data.

The final output of the project will likely be more complex than the initial
ideas presented. The process involves starting with strong assumptions
and gradually relaxing them until a feasible solution is found. Concrete
examples, such as plugging intersections into real estate websites or reading
classifieds, can spur the imagination and provide insights into the data.
Even if the details take up most of the time, they are only epicycles on
top of a larger point, and it is important not to forget the overall goal of
the project. Immersing oneself in particular examples is useful, even if it
only means reading the first few lines of a table before building a model.
Deep Dive Continued: Working Forward
"Deep Dive Continued: Working Forward" discusses the next steps in the
process of working on a data project after imagining the end goal. It
begins by suggesting that it is important to think about what kind of
data is needed to define the variables, and uses the example of
apartment prices to illustrate the process. The article then moves on to
the concept of transit access and how it can be defined and measured.
The article notes that refining the vision of the project is important, and
that it may require acquiring additional information to distinguish the
effect of proximity to transit from other factors. It also highlights the
importance of recognizing the causal nature of the question being
asked and the need for a causal argument pattern.
The article concludes by discussing the limitations that may be faced in
the project, the importance of thinking about the outcome and how the
work will be used, and the need to consider maintenance and cross-
checking of results. Overall, the article emphasizes the importance of
careful planning and consideration of the many factors involved in a
data project.
Deep Dive Continued: Scaffolding
"Deep Dive Continued: Scaffolding," the author discusses the
importance of scaffolding in a data project, which involves breaking
down the project into smaller, manageable tasks to avoid wasting time
and to stay focused on the larger picture.
The article suggests that, especially at the beginning of a project, it is
important to focus on building intuition by using simple tabulations,
visualizations, and reorganized raw data. It also suggests using models
that can be easily fit and interpreted, or models that have
great predictive performance without much work, as a starting point for
a predictive task.
To avoid getting too deep into exploratory steps and losing sight of the
larger picture, the article recommends setting time limits for exploratory
projects and writing down a scaffolding plan to lay out the next few
goals and expected shifts.
The article also emphasizes the importance of understanding the
argument being made and using that to prioritize analysis. Building
exploratory scatterplots should precede the building of a model, and
building simple models before tackling more complex ones will save
time and energy.
Overall, the article highlights the importance of prioritizing aims and
proceeding in a way that is as instructive as possible at every step.

Verifying Understanding
"Verifying Understanding" emphasizes the importance of
building explicit checks with partners or decision makers in
any scaffolding plan to ensure that their needs are properly
understood. Overcommunication is better than assuming that both
parties are on the same page.
The article suggests finding a convenient medium to explain the
partners' needs back to them and asking if they have understood things
properly. The goal is to achieve conceptual agreement, not a detailed
critique of the project, unless the partners are data-savvy and
particularly interested.
It is also important to talk to those who will be needed to implement the
final work and make sure that their requirements are being represented.
The article emphasizes the value of partners' intuitive understandings of
the processes they are looking to understand better and suggests
spending time talking to people who deal with a process to get the
intuition needed to build a data-based argument that can create real
knowledge.
The article concludes by stating that the mark of having grasped the
problem well is being able to explain the strategy to partners in terms
that matter to them and receiving enthusiastic understanding in return.
Regular check-ins with partners and decision makers are important to
ensure that the project is on track and meeting their needs.
Getting Our Hands Dirty
In the article, the author emphasizes the importance of spending time
on data gathering and transformation to ensure that the evidence
gathered can serve as a valid argument. The article suggests starting
with easy transformations, such as exploratory graphs or summary
statistics, before moving on to more sophisticated or complicated
models.
Once the data is transformed and ready to serve as evidence, the article
suggests evaluating the strength of the arguments by updating the
initial paragraphs with new information. The author emphasizes the
need to consider alternative interpretations of the data and to ensure
that the argument makes sense given the data that has been collected.
The article provides an example of how the evaluation process can lead
to further considerations, such as questioning the representativeness of
the sample of apartments and exploring options for addressing
potential biases. The author emphasizes that strong arguments will
rarely consist of a single piece of evidence and that multiple strands of
evidence may be necessary to support an argument.
Overall, the article highlights the importance of careful planning and
evaluation at each step of the data analysis process to ensure that the
evidence gathered can serve as a valid argument.
The passage discusses the importance of presenting data in a way that is
both useful and pleasing to the audience. This involves considering the
audience's needs and preferences, such as whether to present a single
graph or multiple models and summaries, or whether to present claims
in a chain or in parallel. The presentation format is also important, such as
deciding which elements of a tool should be optional or fixed, or which
examples to use in a narrative. The author emphasizes that presentation
matters and that genre conventions and tone should be considered. The
process of working with data is not always linear, and involves an
interplay of clarification and action to arrive at a better outcome.
Chapter three : Arguments
The passage discusses the importance of arguments in making
knowledge out of observations. While observations alone are not
enough to act on, connecting them to how the world works through
arguments allows us to create different kinds of knowledge, such
as accurate mental models or algorithmic understanding. Understanding how
arguments work is important in working with data, as it helps us better
convey complicated ideas, build projects in stages, find inspiration from
well-known patterns of argument, substitute techniques, make our
results more coherent, present our findings, and convince ourselves and
others that our tools work as expected. Without a good understanding
of arguments, our work with data may be small and disconnected.

thinking explicitly about arguments is a powerful technique with a long


history in various fields, such as philosophy, law, humanities, and
academic debate. This technique helps us structure our thinking with
data and can be applied at any point in problem-solving, from gathering
ideas at the beginning to ensuring coherence before releasing a final
product. By thinking about the arguments we are making, we can better
understand the connections between our observations and the
processes that shape the world, leading to more effective problem-
solving and knowledge creation.

An argument is a set of reasons or evidence presented to support a


claim or proposition. It is a way of persuading someone to accept a
certain point of view or to take a particular action. Arguments can be
made in many forms, including written or spoken discourse, visual media, or
even nonverbal communication. The goal of an argument is to convince
the audience of the validity of a claim or position, and to persuade them
to take a certain course of action or adopt a certain belief.
Audience and Prior Beliefs
The passage discusses the importance of considering the audience's
prior beliefs and knowledge when crafting an argument. The ideal
audience is skeptical but friendly, questioning our observations without
being dismissive or swayed by emotional appeals. It is important to make an
argument that moves from statements the audience already believes to
new statements they may not have believed before. While simplifying or
expanding on certain points for a particular audience is acceptable, lying
or preying on gullibility or ignorance is not. The background knowledge
of the audience should also be taken into account, as certain statements
may be taken for granted while others require an argument. By
considering the audience's prior beliefs and knowledge, we can make a
more effective and convincing argument that establishes knowledge in a
defensible way.

The passage highlights that most of the prior beliefs and facts that go
into an argument are not explicitly spelled out. It is important to
consider the audience's existing knowledge and understanding, as this
defines what is reasonable in a given situation. Common sources of prior
or background knowledge include scientific laws, the validity of data sources,
algebraic or mathematical theory, commonly known facts, and "wisdom" or
commonly accepted beliefs. While it may be necessary to provide
additional information or clarification for certain audiences, most details
of techniques can be considered background knowledge. By taking into
account the audience's existing knowledge and understanding, we can
make a more effective argument that is reasonable and convincing.

Building an Argument
The text discusses the use of visual schematics in setting up an argument,
but the author suggests that sentences and sentence fragments are more
flexible and practical. The author notes that prior knowledge and facts
often enter into an argument without being acknowledged. The text
provides examples of how ideas are expressed in written language and tags
concepts such as "claims" to make them more transparent. The author
cautions that the example given is fictional and not to be cited as truth
without proper research.

Claims
In the context of building an argument, a claim is a statement that can
be reasonably doubted but that the presenter believes they can make a
case for. All arguments contain one or more claims, with one main claim
supported by one or more subordinate claims. The claim should be
framed in terms that decision makers care about and not in technical jargon.
The details justifying the claim may involve different techniques, but
they all serve as support rather than the big idea. In presenting an
argument, it is important to use strong arguments and techniques that
can make a case for a causal relationship, rather than relying solely on
statistical tools.
Evidence, Justification, and Rebuttals
This excerpt emphasizes the importance of evidence in constructing an
argument and highlights how raw data needs to be transformed into a
more compact and intelligible form to be incorporated into an
argument. The transformation of data puts an interpretation on it by
highlighting essential things. The excerpt also provides examples of
different types of transformations, such as counting sales or making a
map of taxi pickups in a city.
In the context of a transit example, the excerpt suggests that a high-
resolution map of apartment prices overlaid on a transit map or a two-
dimensional histogram or scatterplot can be reasonable evidence to
show the relationship between transit access and apartment prices.
However, for more robust claims such as forecasting the relationship
into the future, more sophisticated models may be necessary. Overall,
the excerpt highlights the importance of transforming data into
evidence to support an argument and using appropriate types of
evidence for different claims.
Evidence and Transformations
This excerpt highlights the importance of evidence and transformations
in constructing an argument. It provides examples of different types of
evidence, such as a graph of historical price data or a map of home
prices overlaid on a transit map, and how they can demonstrate a
relationship between variables. The excerpt also emphasizes the need
for a justification to connect the evidence to the claim being made. The
justification provides a logical connection between the evidence and the
claim, and can be supported by different methods such as visual
inspection, cross-validation, or single-variable validation.The excerpt
also acknowledges the potential rebuttals to an argument and how they
need to be addressed to make a water-tight argument. Rebuttals may
include issues such as improper randomization, small sample size, or
outdated data. Finally, the excerpt introduces the concept of the degree
of qualification of an argument, which provides some degree of
certainty in the conclusions, ranging from possible to definite.
Overall, the excerpt highlights the importance of using appropriate
evidence, transformations, and justifications, and being aware of
potential rebuttals to construct a convincing argument.

Evidence:
In building an argument, evidence is the information or data that
supports the claim being made. Evidence can come in various forms,
such as statistics, research findings, expert opinions, facts, anecdotes,
and examples. It is important to ensure that the evidence is reliable,
relevant, and sufficient to support the claim being made. Raw data
needs to be transformed into something that is more easily
understandable and incorporated into the argument. Transformations
can include a graph, model, sentence, or map that highlights essential
information.

Justification:
The justification is the reason that connects the evidence to the claim
being made. It provides a logical connection between the evidence and
the claim. Justifications can include visual inspection, cross-validation, or
other methods that support the claim being made. It is important to
ensure that the justification is sound and convincing to the audience.
Rebuttals:

Rebuttals are the reasons why a justification may not hold in a particular
case. They are the yes-but-what-if questions that naturally arise in any
argument. It is important to be aware of these rebuttals and address
them appropriately to make a convincing argument. Rebuttals may
include issues such as outdated data, incorrect error functions, or small
sample sizes.

Degree of qualification:
The degree of qualification of an argument provides some degree of
certainty in the conclusions, ranging from possible to definite. Deductive
logic provides definite certainty in its conclusions, but its use in practice
is limited. In most cases, arguments provide some degree of probability
or likelihood, and it is important to have a sense of how strong the
conclusion is to avoid making false claims.
Overall, building an argument involves presenting a claim, providing
evidence to support the claim, justifying the evidence, addressing
rebuttals, and providing a degree of qualification for the argument. It is
important to ensure that the argument is sound, convincing, and well-
supported to persuade the audience to accept the claim being made.
Adding Justifications, Qualifications, and Rebuttals
This text discuss a correlation between travel time to the rest of the city
and apartment prices. A 5% reduction in travel time results in a 10%
increase in apartment prices, and vice versa. This relationship has been
observed in historical apartment prices and transit access data. A model
trained on 20 years of raw price data shows a high probability of this
relationship. Big changes in transit access have a visible effect on
apartment prices within a year, indicating a strong probability of a
relationship. Closing a train stop results in a 10% decline in apartment
prices within two blocks over the next year in at least 70% of cases.
However, there are three possible confounding factors. First, there may
be nicer apartments being built closer to train lines, but new
construction or renovation takes place over longer time scales than price
changes. Second, there may be large price swings even in the absence
of changing transit access, but the average price change between
successive years for all apartments is only 5%. Third, it is possible
that transit improvements or reductions and changes in apartment price
are both caused by some external change, like a decline in
neighborhood quality. However, the city in question has added transit
options spottily and rent often goes up regardless of these transit
changes.

Deep Dive: Improving College Graduation Rates


Suppose a university wants to start a pilot program to offer special
assistance to incoming college students who are at risk of failing out of
school(context). The university needs a way to identify at-risk students
and find effective interventions(needs). The solution is to create
a predictive model of failure rates and design an experiment to test several
interventions(vision). The goal is to present findings in terms that
decision-makers understand(outcome), such as well-calibrated failure
probabilities and expected lift in graduation rates. To make the
experiment easy to run, the focus is on students who fail out after the
first year. It's important to interview decision-makers to determine a
reasonable range of expense for raising the graduate rate by 1%, assess the
historical graduation rates, and costs of interventions. Finally, not all
students who drop out could have been helped enough to stay enrolled,
and some admitted students were false positives to begin with.

To make the project useful, we need to define "failout" and differentiate


it from dropout, which might not be caused by poor academic
performance. We can define failout as a student who drops out either
due to failing grades or on their own accord but was in the bottom
quartile of grades for freshmen across the university. This definition is
justifiable because students who drop out on their own and had poor
grades are likely dropping out in part because they performed poorly.
Next, we focus on modeling and predicting the probability that
someone will fail out in their first year. We identify potentially predictive
facts about each student, collect data, and clean it up. If we get a good
classifier, we can identify students with a high risk of failing out. A good
classifier will be as spread out as possible while having a high
correspondence with reality.
Our argument will be that we have the ability to forecast which students
will fail out accurately enough to try to intervene. Our definition of
failout is consistent with what the administration is interested in. We can
group students we didn't fit the model on and see that the predictions
closely match the reality. The error in our predictions is only a few
percent on average, accurate enough to be acceptable. Students who
dropped out on their own but had poor grades have essentially failed
out, which is consistent with how the term is used in other university
contexts.
Let's design an experiment to test the effectiveness of interventions for
at-risk students. We can split students into groups by their risk level and
intervention type. If money is tight, we can split students into high- and
low-risk groups and choose students at random. Using our risk model,
we can set appropriate prior probabilities for each category. We can
estimate the range of effectiveness and cost for each intervention and
derive reasonable categories for high and low risk.
Our argument is that the experiment is worth performing because the
high and low ranges of costs are low compared to a reasonable range of
benefits, and their cost is fairly low overall because we can target our
interventions accurately and design the experiment within the budget.
We should break up students into high- and low-risk groups and choose
students from the population at random to minimize confounding
factors.
These arguments would have actual data and graphs to provide
evidence, and it's important to arrive at them iteratively. Early versions of
arguments may have logical errors, misrepresentations, or unclear
passages. It's best to try to inhabit the mind of a friendly but skeptical
audience and find holes in the argument. Finding a real-life friendly
skeptic to explain the argument to is even better.

Note that
In the context of reasoning and communication, an argument is a set of
statements or premises presented in support of a conclusion or claim. The
goal of an argument is to persuade an audience or interlocutor of the truth or
validity of the conclusion or claim being made. Arguments can take many
forms, including deductive arguments, inductive arguments, and abductive arguments, and
they can be evaluated based on their logical validity, soundness, and
persuasiveness. In the context of data science, arguments may involve presenting
evidence, statistical analysis, and models to support a particular conclusion or
decision.

Chapter Four : Patterns of Reasoning


When studying arguments, we can benefit from exploring patterns that
have been noticed and explored by others. These patterns provide a
map to lead us to well-worn trails that take us where we need to go.
However, we can't simply lift up patterns from other disciplines and
apply them precisely to data science, as there are significant differences
between fields. Instead, we can take insights from these patterns and
mold them to fit our needs.
There are three groups of patterns that we will explore. The first group is
categories of disputes, which provide a framework for understanding
how to make a coherent argument. The second group is general topics,
which give general strategies for making arguments. The last group is
special topics, which are the strategies for making arguments specific to
working with data.
Causal reasoning is a special topic that is so important that it is covered
separately in Chapter 5. By understanding these patterns of reasoning,
we can effectively structure our arguments and make persuasive cases in
the field of data science.
Categories of Disputes
Categories of disputes refer to a powerful way to organize our thoughts
in arguments. A point of dispute is the part of an argument where the
audience pushes back, and we need to make a case to win over the
skeptical audience. Disputes can be classified into four categories: fact,
definition, value, and policy. The classification system was created by
ancient rhetoricians and has been adapted by successive generations to
fit modern needs.
Fact disputes concern whether something is true or false. In data
science, this might involve disagreements over the accuracy or validity
of data, statistical methods, or models.
Definition disputes concern the meaning of a term or concept. In data
science, this might involve disagreements over how to define a
particular variable or metric.
Value disputes concern whether something is good or bad, right or
wrong. In data science, this might involve disagreements over the ethical
implications of using certain data or models.
Policy disputes concern what course of action should be taken. In data
science, this might involve disagreements over the best way to
implement a particular data-driven decision or intervention.
Once we identify the kind of dispute we are dealing with, we can
demonstrate the issues that need to be addressed more effectively.
Understanding the categories of disputes can help us structure our
arguments more persuasively in the field of data science.
FACT
The concept of disputes of fact arises when there are disagreements
about what is true or what has occurred. This can be a small component
of a larger argument and often involves outlining criteria that would
convince an audience to agree to an answer prior to any data being
collected. There are two stock issues for disputes of fact: What is a
reasonable truth condition, and is that truth condition satisfied? An example
of a famous dispute of fact is the claim that a debt-to-GDP ratio of over
90% results in negative real GDP growth. This claim was based on a paper by
Reinhart and Rogoff, but it was later found to be flawed by Herndon,
Ash, and Pollin, who made a counter-claim that there is no sharp drop
in GDP growth at a 90% debt-to-GDP ratio. Their truth condition was
a smoothed graph fit to the original data, without any bucketing, and they
carried out their examination in R, which provided better statistical
tools and easier replication and debugging.

DEFINITION
Disputes of definition occur when there is a particular way we want to
label something, and we expect that label to be contested. Definitions in
a data context are about trying to make precise relationships in an
imprecise world. There are two activities involving definitions that can
happen in an argument: making a case that a general, imprecise term fits a
particular example, or clarifying an existing idea into something that can
be counted with. A good definition should be useful, consistent, and the
best option among reasonable alternatives. Discussions about how well
a definition fits with prior ideas can take many forms, such as how well it
agrees with previously accepted definitions or how well it captures good
examples. A definition that an audience has no prior understanding of
will not be contested.

VALUE
this text discusses the concept of value in decision-making. To make
judgments, we need to select and defend criteria of goodness that apply
to a given situation. Different values, such as ease of interpretability,
precision, and completeness, may be important in different scenarios.
The text highlights two key issues in disputes of value: how our goals
determine which values are most important, and whether the value has
been properly applied in a given situation. The values that matter
depend on our goals, and we need to make a case for why a particular
model or metric is interpretable, concise, elegant, accurate, or robust.
Ultimately, our values are shaped by our goals, and we need to argue for
why a particular value is important in a given context.
POLICY
The text discusses disputes of policy, which arise when we want to
determine whether a particular course of action is the right one.
Examples of such disputes include whether to reach out to paying
members more often by email or whether a particular implementation
of an encryption standard is good enough to use. The four stock issues of
disputes of policy are: (1) is there a problem, (2) where is credit or blame
due, (3) will the proposal solve it, and (4) will it be better on balance?
These can be distilled into Ill, Blame, Cure, and Cost. To make a case for
a particular policy proposal, we need to show that there is a problem
worth fixing, pinpoint the source of the problem, demonstrate that our
proposed solution will work, and determine whether the benefits
outweigh the costs. Having a framework to think through disputes of
policy can be invaluable in making effective arguments for change.

General Topics
The text discusses general topics, which are patterns of argument
identified by Aristotle that are repeatedly applied across every field.
These classic varieties of arguments include specific-to-general,
comparison, comparing things by degree, comparing sizes, and
considering the possible as opposed to the impossible. While some of
these patterns may seem remote from data work, they occur constantly
in various fields. For example, specific-to-general reasoning and
reasoning by analogy may require more exposition to understand how
they apply to data work.

SPECIFIC-TO-GENERAL
The text explains specific-to-general reasoning, a pattern of argument
that involves reasoning from examples to make a point about a larger
pattern. In data-focused ideas, statistical models often use this
reasoning. This pattern also occurs in user experience testing, where a small
amount of users are studied in great detail to draw conclusions about
the larger user base. Another example is an illustration, which uses one
or more examples to build intuition for a topic and provide grounding
for the audience. Concrete examples, sticky graphics, and explanations
of the prototypical can help ground arguments and improve the chance
of a strong takeaway, even when they are incidental to the body of the
argument.

GENERAL-TO-SPECIFIC
The text discusses general-to-specific arguments, which involve using
beliefs about general patterns to make inferences for specific examples. This
pattern is the opposite of specific-to-general reasoning and is often
used to make tentative conclusions based on general patterns. For example, if
a company is experiencing rapid revenue growth, it seems plausible to infer
that the company will find it easy to raise money. However, this
argument can be rebutted by pointing out that the specific example
may not have the properties of the general pattern and may be an
outlier. This pattern is often used in data work when applying a statistical
model to a new example after using specific-to-general reasoning to
claim that a sample can stand in for the whole.

ARGUMENT BY ANALOGY
The text discusses "Argument by Analogy," which can be either literal or
figurative. Literal analogies involve two similar types of things, while figurative
analogies involve two dissimilar things that are argued to be
alike. Mathematical models are an example of figurative analogies, where
behavior in one domain can help understand behavior in another
domain. However, the rebuttal for argument by analogy is that what may
hold for one thing may not hold for another, and there may be
unaccounted-for factors. Despite this, when the analogy
holds, mathematical modeling is a powerful way to reason.
Special Arguments
The text discusses the special argument strategies that are used in data
science and other mathematically inclined disciplines, such as engineering,
machine learning, and business intelligence. These argument strategies
can be mixed and matched in various ways. The text highlights three
specific special arguments: optimization, bounding cases, and cost/benefit
analysis. However, there are many additional argument strategies that
can be observed through careful analysis of case studies within data
science or related fields.

optimization
The text describes optimization as an argument strategy in data science, which
involves determining the best way to do something given certain
constraints. This can involve creating processes such as recommending
movies, assigning people to advertising buckets, or packing boxes based
on customer preferences. While optimization also occurs when fitting a
model through error minimization, this activity is less argumentative
compared to debating what to optimize. Deciding what to optimize
often involves a value judgment, which can be more interesting and
controversial than the optimization process itself.

BOUNDING CASE
The text explains the argument strategy of bounding cases, which involves
determining the highest or lowest reasonable values for something. There are
two major ways to make an argument about bounding cases: sensitivity
analysis and simulation or statistical sensitivity analysis. Sensitivity analysis involves
varying assumptions to determine the best or worst-case values, while
simulation involves making assumptions about the plausibility of values and
simulating likely results. The resulting bounds can be used to calculate
percentiles of revenue or uptime, depending on the relevant risk profile of the
audience. Bounding cases can provide rough estimates for important numbers,
which can be useful in pushing decisions forward or determining if a project is
feasible.
COST/BENEFIT ANALYSIS
The text describes the argument strategy of cost/benefit analysis, which
involves putting each possible outcome from a decision in terms of a
common unit such as time, money, or lives saved. The goal is to
maximize the benefit or minimize the cost and compare proposed
policies against alternatives. Cost/benefit analyses can be combined with
bounding case analyses to strengthen the argument, such as when the
lowest plausible benefit from one action is greater than the highest
plausible benefit from another. However, the rebuttals to a cost/benefit
analysis include the possibility of miscalculated costs and benefits, the
limitations of using this method to make decisions, and the potential for
other costs or benefits to reorder the answers.

Chapter Five : Causality


The chapter discusses the importance of causal reasoning in data science, as
people will often interpret claims as having a cause-and-effect nature.
Causal reasoning provides more insight into a problem than just
correlational observations. While there are no tools that can generate
defensible causal knowledge, establishing causation requires making a
strong argument. Non-causal relationships are still useful, but ideally,
knowledge allows us to intervene and improve the world. Causal
arguments can range from those that barely hint at causality to those that
establish undeniable causality. Techniques like randomized controlled trials are
at the far end of the causal spectrum. Despite the challenges of causal
analysis, people want analysis with causal power, and it is necessary to
come to terms with this fact. Causal analysis is a deep topic, and there
are multiple canons of thought, each concerned with slightly different
scenarios. Data science, as a field primarily concerned with generalizing
knowledge in highly specific domains, is able to sidestep many of the
issues that snarl causal analysis in more scientific domains.
Defining Causality
The text discusses different definitions of causality, with one simple
interpretation being the alternate universe perspective. This perspective
is suitable for many data problems, such as understanding how things
might have been if the world had been manipulated a bit. For instance,
we can use this perspective to understand the causal relationship
between a drug and a disease. We can imagine two universes, one
where the patient took the drug, and another where the patient did not
take the drug. By comparing the outcomes in these two universes, we
can understand the causal effect of the drug on the disease.
Causal explanations can also be about fixed physical laws, such as the
laws of physics that govern how a chemical reaction happens. For
example, we can use these laws to understand why a certain chemical
reaction takes place in a particular way.
The text uses the example of a tree on a busy street to illustrate how
causality works. If the tree is trimmed, it may be less likely to fall down.
In causal parlance, the tree has either been exposed or not exposed to a
treatment. The next year the tree may have fallen down or not, which is
known as the outcome. We can imagine two universes, one where the
tree was treated with a trimming, and another where it was not. By
comparing the outcomes in these two universes, we can understand the
causal effect of the trimming on the tree falling down.
However, a causal relationship is not the same as a statistical one. For
example, if we look at 20 trees, half of which were trimmed, and we find
that all the untrimmed trees fell down while none of the trimmed trees
fell down, we might conclude that trimming the tree prevents it from
falling down. But what if we were to add another variable, such as wind,
that mostly explains the difference? More of the trees that are exposed
to the wind were left untrimmed. We say that exposure to the wind
confounded our potential measured relationship between trimming and
falling down.
The goal of a causal analysis is to find and account for as many
confounders as possible, observed and unobserved, to determine which
states always precede others. In an ideal world, we would know
everything we needed to in order to pin down which states always
preceded others. For instance, if we want to understand the causal
relationship between smoking and lung cancer, we might need to
account for variables such as age, gender, and exposure to air pollution
to get a more accurate estimate of the causal effect of smoking on lung
cancer.

Designs
The text discusses the purpose of causal analysis, which is to deal with the
problem of confounders, and the use of designs to structure an analysis
and reduce uncertainty in understanding cause and effect. A design is a
method for grouping units of analysis according to certain criteria. The
text also mentions multiple causation, where more than one thing has to be
true simultaneously for an outcome to come about, illustrated by the
example of a tree falling down due to both wind and lack of trimming.
The goal of investigating causation is to understand how a treatment
would change an outcome in a new scenario, and how well a causal
explanation generalizes to new situations. Examples include
understanding the causal effect of smoking on lung cancer and the causal
effect of a drug on a disease.
Intervention Designs
The text discusses different design strategies for controlling confounders in
experiments. In randomized controlled trials, treatment is randomly assigned to
subjects to ensure the treatment is independent of potential
confounders. However, there are rebuttals to this method, such as
improper randomization or new subjects being different from those in
the experiment. Nonrandom intervention, such as applying an electrical
shock to a subject, can also show a relationship if potential confounders
are eliminated. The causal relationship can be considered generalizable if the
same result is observed in multiple experiments. However, unmeasured
confounders can still exist even in a randomized experiment. A truly randomized
experiment would involve random selection of people from
The population and random selection of treatment executors in
addition to applying the treatment at random, but this design is rarely
used.

Observational Designs
The text explains that observational designs are used when it is impractical,
costly, or unethical to perform interventional experiments. These designs
are necessary for most socially interesting causal estimation problems,
and they are frequently used in business when it is not feasible to take
down a part of the business to see how it affects profit. Instead, natural
differences in exposure or temporary outages are used to determine the
causal effect of critical components. In observational designs, clever
strategies are employed to infer generalizability without the ability to
intervene.

Natural Experiments
The text describes natural experiments as a clever strategy for observational
designs to infer causality. For example, when investigating the effect of
being convicted of a crime in high school on earnings from age 25 to 40,
we can compare the earnings of those who were arrested but not
convicted of a crime with those who were convicted. These groups are
likely to be similar on other variables that serve as a natural control.
Within-subject interrupted time series designs can also be used when
natural between-subject controls are not available. By eliminating other
contenders and looking at the same subject, we can make a strong
argument for causality. However, our conclusion will necessarily be less
definite than if we were able to randomly assign treatments to individual
cases. We need to gather as much information as possible to find highly
similar situations, some of which have experienced a treatment and
some of which have not, to try to make a statement about the effect of
the treatment on the outcome. Detailed data collection, such as demographic
information, behavioral studies, or surveys, can also help tease out confounding
factors and provide stronger tools for reasoning about the world.
Statistical Methods
The text explains that when all else fails, we can turn to statistical methods to
establish causal relationships. Statistical methods can be based on
causal graphs or matching. Causal graphs assume a plausible series of
relationships to identify what kinds of relationships we should not see.
For example, in a randomized clinical trial, we should not see a correlation
between patient age and their treatment group. Matching can be
deterministic or probabilistic. Deterministic matching tries to find similar units
across some number of variables, while probabilistic matching, such
as propensity score matching, builds a model to account for the probability of
being treated. Both methods aim to create artificial natural experiments
to reduce confounding factors. While there are good theoretical reasons
to prefer propensity score matching, it can be worthwhile to skip the
added difficulty of fitting the intermediate model when dealing with a
small number of variables

Chapter Six : Putting It All Together


We should look at some extended examples to see the method of full problem thinking in action. By
looking at the scoping process, the structure of the arguments, and some of the exploratory steps (as
well as the wobbles inevitably encountered), we can bring together the ideas we have discussed into a
coherent whole. The goal of this chapter is not to try to use everything in every case, but instead to use
these techniques to help structure our thoughts and give us room to think through each part of a
problem. These examples are composites, lightly based on real projects.

Deep Dive: Predictive Model for Conversion Probability


The section describes a scenario where a consumer product company offers
a free trial for its service and runs targeted online ads to attract potential
customers. The company's business model is to provide such a useful service that
after 30 days, as many users will sign up for the continued service as possible.
However, determining the effectiveness of an ad takes 30 days, and in the
meantime, the company spends a considerable amount of money on ineffective
ads.
To address this issue, the company seeks the help of a data scientist to create
a predictive model based on user behavior and demographics that can predict
the expected value of each new user. The model's output would be something
like, "This user is 50% less likely than baseline to convert to being a paid user.
This user is 10% more likely to convert to being a paid user. This user… etc. In
aggregate, all thousand users are 5% less likely than baseline to convert.
Therefore, it would make sense to end this advertisement campaign early
because it is not attracting the right people."
The company needs to know the quality of a user based on information gathered
in just the first few days after a user has started using the service. The data
scientist can imagine a black box that takes in user behavior and demographic
information from the first few days and spits out a quality metric. For example, by
clicking around, talking to decision-makers, or already being familiar with the
service, the data scientist may find that there are a dozen or so actions that a user
can take with this service. The data scientist can count those and break them
down by time or platform, providing a reasonable first stab at behavior.
The quality metric should indicate how many users will convert to paid
customers. If possible, the company should go directly for a probability of
conversion. But since the action the company can take is to decide whether to
pull the plug on an advertisement, what the company is interested in is the
expected value of each new user, a combination of the probability of conversion
and the lifetime value of a new conversion. The data scientist needs to build a
predictive model that takes in data about behavior and demographics and puts
out a dollar figure. The company can then compare that dollar figure against the
cost of running the ad in the first place.
The data scientist needs to create a pipeline for calculating the cost of each ad
per person that the ad is shown to. The company will need to evaluate users
either once or periodically between 1 and 30 days to judge the value of each user
and then compare that value information to the cost of running the
advertisement. Typical decisions would be to continue running an advertisement,
to stop running one, or to ramp up spending on one that is performing
exceptionally well.
It is essential to measure the causal efficacy of the model on improving revenue.
The company needs to ensure that the advertisements that are being targeted to
be cut actually deserve it. By selecting some advertisements at random to be
spared from cutting, the company can check back in 30 days or so to see how
accurately it has predicted the conversion to paid users. If the model is accurate,
the conversion probabilities should be roughly similar to what was predicted, and
the short-term or estimated lifetime value should be similar as well.
To demonstrate that the cure is likely to be as they say it is, the company needs
to provisionally check the quality of the model against held-out data. In the
longer term, they want to see the quality of the model for advertisements that
are left to run. In this particular case, the normal model quality checks (ROC
curves, precision and recall) are poorly suited for models that have only 1–2%
positive rates. Instead, they have to turn to an empirical/predicted probability
plot.
The audience for this discussion is decision-makers who will need to approve the
project and build-out. The main claim is that the model should be used to predict
the quality of advertisements after only a few days of running them. The "ill" is
that it takes 30 days to get an answer about the quality of an ad. The "blame" is
that installation probability is not a sufficient predictor of conversion probability.
The "cure" is a cost-predictive model and a way of calculating the cost of running
a particular advertisement. The "cost" (or rather, the benefit) is that by cutting out
advertisements at five days, they will not spend 25 days worth of money on
unhelpful advertisements.
In the end, the discussion may center on the quality of the model, but other
aspects need to be considered as well. The reliability of the model compared to
the cost range of running the ads needs to be assessed. The company needs to
consider the time cost of developing the model, implementing it, and running it.
If the model is good, the answer is almost certainly yes, especially if they can get
a high-quality answer in the first few days. The more automated this process is,
the more time it will take up-front, but the more time it will save in the long run.
With even reasonable investment, it should save far more than is spent.

Deep Dive: Calculating Access to Microfinance


The challenge in this project is to calculate access to microfinance
in South Africa, where small loans are provided as startup capital for a
business. The project's goal is to create a map that demonstrates where
access is lacking, which could be used to track success and drive policy.
The map will include one or more summary statistics that could more
concisely demonstrate success.
To achieve this, the project team needs to define what "access" means.
They claim that "has access to microfinance" can be reasonably
calculated by seeing, for each square kilometer, whether that square
kilometer is within 10 kilometers of a microloan office as the crow flies.
This definition is superior to showing degrees of access because there is
not much difference between a day's round-trip travel and a half-day's
round-trip travel. A map of South Africa, colored by population, masked
to only those areas outside of 10 kilometers distance to a microloan
office, is a good visual metric of access.
The project team needs to geocode the microloan offices into latitude
and longitude pairs. They also need population maps available for South
Africa (using a 1 km scale) to determine where the highest priority
places are in order to place microfinance offices. They will need district-
level administrative boundaries and some way to mash those
boundaries up with the population and office location maps to generate
per-district or per-metropolitan percentage of population that is within
10 kilometers of a microloan office.
The project team should generate a more readable map (including
appropriate boundaries and cities to make it interpretable) to be shared
in a draft form with some of the decision makers. If everyone is still on
the same page, then the next priority should be calculating the summary
statistics and checking those again with the substantive experts. A
comprehensive written record of the important decisions that went into
making the map and summary statistics is essential.
In conclusion, calculating access to microfinance in South Africa can be
achieved by defining what "access" means, geocoding the microloan
offices into latitude and longitude pairs, using population maps, and
generating a readable map with appropriate boundaries and cities. The
project team needs to leave behind a comprehensive written record of
the important decisions that went into making the map and summary
statistics.
Context: The South African government is interested in where there are
gaps in microloan coverage. The nonprofit organization that tracks
microfinance around the world is interested in how new data sets can be
brought to bear on answering questions like this.
Need: We need to calculate access to microfinance in South Africa and
create a map that demonstrates where access is lacking, which could be
used to track success and drive policy. We also need one or
more summary statistics that could more concisely demonstrate success.
Vision: Our goal is to define access to microfinance as whether a square
kilometer is within 10 kilometers of a microloan office as the crow flies.
We will create a map of South Africa, colored by population, masked to
only those areas outside of 10 kilometers distance to a microloan office,
which will be a good visual metric of access. We will also generate per-
district or per-metropolitan percentage of population that is within 10
kilometers of a microloan office.
Outcome: We will deliver the maps to the nonprofit organization, which
will take them to the South African government. We can potentially work
with the South African government to receive regularly updated maps
and statistics. We will leave behind a comprehensive written record of
the important decisions that went into making the map and summary
statistics.
Decision Maker: This sounds like a great project. How will you geocode
the microloan offices into latitude and longitude pairs?
Project Team: We will use the addresses of the microloan offices and
a geocoding service to convert them into latitude and longitude pairs.
This will allow us to plot them on a map alongside the population
density map to get a sense of what a reasonable final map will look like.
Decision Maker: That makes sense. What about the summary statistics?
How will you calculate those?
Project Team: We will calculate per-district or per-metropolitan
percentage of population that is within 10 kilometers of a microloan
office by mashing up the district-level administrative boundaries with
the population and office location maps. This will provide a concise and
actionable complement to the maps.
Decision Maker: Okay, I think I understand. Once you have the maps and
summary statistics, how will you share them with us?
Project Team: We will share a draft version of the maps with appropriate
boundaries and cities to make it interpretable. We will also generate
tables separated by district to present the summary statistics. We can
then have a meeting to go over the maps and statistics and make any
necessary adjustments before delivering the final copies.
Wrapping Up
Data science, as a field, is overly concerned with the technical tools for
executing problems and not nearly concerned enough with asking the right
questions. It is very tempting, given how pleasurable it can be to lose
oneself in data science work, to just grab the first or most interesting data
set and go to town. Other disciplines have successfully built up techniques
for asking good questions and ensuring that, once started, work continues
on a productive path. We have much to gain from adapting their techniques
to our field. 74 | THINKING WITH DATA We covered a variety of techniques
appropriate to working professionally with data. The two main groups were
techniques for telling a good story about a project, and techniques for
making sure that we are making good points with our data. The first
involved the scoping process. We looked at the context, need, vision, and
outcome (or CoNVO) of a project. We discussed the usefulness of brief
mockups and argument sketches. Next, we looked at additional steps for
refining the questions we are asking, such as planning out the scaffolding
for our project and engaging in rapid exploration in a variety of ways. What
each of these ideas have in common is that they are techniques designed to
keep us focused on two goals that are in constant tension and yet mutually
support each other: diving deep into figuring out what our goals are and
getting lost in the process of working with data. Next, we looked at
techniques for structuring arguments. Arguments are a powerful theme in
working with data, because we make them all the time whether we are
aware of them or not. Data science is the application of math and
computers to solve problems of knowledge creation; and to create
knowledge, we have to show how what is already known and what is
already plausible can be marshaled to make new ideas believable. We
looked at the main components of arguments: the audience, prior beliefs,
claims, justifications, and so on. Each of these helps us to clarify and
improve the process of making arguments. We explored how explicitly
writing down arguments can be a very powerful way to explore ideas. We
looked at how techniques of transformation turn data into evidence that
can serve to make a point. We next explored varieties of arguments that are
common across data science. We looked at classifying the nature of a
dispute (fact, definition, value, and policy) and how each of those disputes
can be addressed with the right claims. We also looked at specific argument
strategies that are used across all of the data-focused disciplines, such as
optimization, cost/benefit analysis, and casual reasoning. We looked at
causal reasoning in depth, which is fitting given its prominent place in data
science. We looked at how causal arguments are made and what some of
the techniques are for doing so, such as randomization and within-subject
studies. Finally, we explored some more in-depth examples. Data science is
an evolving discipline. But hopefully in several years, this material will seem
obvious to every practitioner, and a clear place to start for every beginner.

You might also like