Data-Science Needed
Data-Science Needed
The world we live in is complex, random, and uncertain. At the same time,
it's one big data-generating machine. As we go about our daily lives, we
constantly produce data that can be captured and analyzed to gain
insights about the world around us. This process of turning the real world
into data and then using statistical inference to understand the underlying
processes is the foundation of data science.
Needed Statistical Inference-
• The world we live in is complex, random, and uncertain. At the same time, it’s one big data-generating
machine.
• As we commute to work on subways and in cars, as our blood moves through our bodies, as we’re shopping,
emailing, procrastinating at work by browsing the Internet and watching the stock market, as we’re building
things, eating things, talking to our friends and family about things, while factories are producing products,
this all at least potentially produces data.
• Imagine spending 24 hours looking out the window, and for every minute, counting and recording the number
of people who pass by. Or gathering up everyone who lives within a mile of your house and making them tell
you how many email messages they receive every day for the next year.
• Imagine heading over to your local hospital and rummaging around in the blood samples looking for patterns
in the DNA. That all sounded creepy, but it wasn’t supposed to. The point here is that the processes in our
lives are actually data-generating processes.
• We’d like ways to describe, understand, and make sense of these pro‐ cesses, in part because as scientists we
just want to understand the world better, but many times, understanding these processes is part of the solution
to problems we’re trying to solve.
• Data represents the traces of the real-world processes, and exactly which traces we gather are decided by our
data collection or sampling method. You, the data scientist, the observer, are turning the world into data, and
this is an utterly subjective, not objective, process.
• After separating the process from the data collection, we can see clearly that there are two sources of
randomness and uncertainty. Namely, the randomness and uncertainty underlying the process itself, and the
uncertainty associated with your underlying data collection methods.
• Once you have all this data, you have somehow captured the world, or certain traces of the world. But you
can’t go walking around with a huge Excel spreadsheet or database of millions of transactions and look at it
and, with a snap of a finger, understand the world and process that generated it.
• “This overall process of going from the world to the data, and then from the data back to the world, is the
field of statistical inference.”
Needed Statistical Inference
Data Collection Statistical Inference
The processes in our lives are data- The overall process of going from the real
generating. We gather traces of these world to data and then back to
processes through data collection and understanding the world is the field of
sampling methods, which are subjective statistical inference, which allows us to
and introduce uncertainty. draw conclusions about the processes that
generated the data.
1 2 3
Statistical Modeling
To make sens e of the data, we create
statistical models that represent our
understanding of the underlying
processes . These models use parameters
to capture the relationships in the data.
Populations and Samples
Population Sample Sampling
Statistical modeling-
• Before you get too involved with the data and start coding, it’s useful to draw a picture of what you think the
underlying process might be with your model. What comes first? What influences what? What causes what?
What’s a test of that?
• But different people think in different ways. Some prefer to express these kinds of relationships in terms of
math. The mathematical ex‐ pressions will be general enough that they have to include parameters, but the
values of these parameters are not yet known.
What is a Model?
1 Repres enting R eality 2 Expres s ing R elations hips
A model is an attempt to understand Models can be expressed using
and represent the nature of reality mathematical equations or diagrams
through a particular lens, such as that capture the relationships
architectural, biological, or between variables and parameters.
mathematical.
3 Es timating Parameters
The parameters in a model are unknown values that need to be estimated using the
observed data. This process of fitting the model is a key step in statistical inference.
Probability Distributions-