0% found this document useful (0 votes)
9 views

Data-Science Needed

Uploaded by

Thendral
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Data-Science Needed

Uploaded by

Thendral
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Data S cience

The world we live in is complex, random, and uncertain. At the same time,
it's one big data-generating machine. As we go about our daily lives, we
constantly produce data that can be captured and analyzed to gain
insights about the world around us. This process of turning the real world
into data and then using statistical inference to understand the underlying
processes is the foundation of data science.
Needed Statistical Inference-

• The world we live in is complex, random, and uncertain. At the same time, it’s one big data-generating
machine.
• As we commute to work on subways and in cars, as our blood moves through our bodies, as we’re shopping,
emailing, procrastinating at work by browsing the Internet and watching the stock market, as we’re building
things, eating things, talking to our friends and family about things, while factories are producing products,
this all at least potentially produces data.
• Imagine spending 24 hours looking out the window, and for every minute, counting and recording the number
of people who pass by. Or gathering up everyone who lives within a mile of your house and making them tell
you how many email messages they receive every day for the next year.
• Imagine heading over to your local hospital and rummaging around in the blood samples looking for patterns
in the DNA. That all sounded creepy, but it wasn’t supposed to. The point here is that the processes in our
lives are actually data-generating processes.
• We’d like ways to describe, understand, and make sense of these pro‐ cesses, in part because as scientists we
just want to understand the world better, but many times, understanding these processes is part of the solution
to problems we’re trying to solve.
• Data represents the traces of the real-world processes, and exactly which traces we gather are decided by our
data collection or sampling method. You, the data scientist, the observer, are turning the world into data, and
this is an utterly subjective, not objective, process.
• After separating the process from the data collection, we can see clearly that there are two sources of
randomness and uncertainty. Namely, the randomness and uncertainty underlying the process itself, and the
uncertainty associated with your underlying data collection methods.
• Once you have all this data, you have somehow captured the world, or certain traces of the world. But you
can’t go walking around with a huge Excel spreadsheet or database of millions of transactions and look at it
and, with a snap of a finger, understand the world and process that generated it.

• “This overall process of going from the world to the data, and then from the data back to the world, is the
field of statistical inference.”
Needed Statistical Inference
Data Collection Statistical Inference
The processes in our lives are data- The overall process of going from the real
generating. We gather traces of these world to data and then back to
processes through data collection and understanding the world is the field of
sampling methods, which are subjective statistical inference, which allows us to
and introduce uncertainty. draw conclusions about the processes that
generated the data.

1 2 3

Statistical Modeling
To make sens e of the data, we create
statistical models that represent our
understanding of the underlying
processes . These models use parameters
to capture the relationships in the data.
Populations and Samples
Population Sample Sampling

In statistical inference, the A sample is a subset of the The process of selecting a


population refers to the population that is observed sample from the population
entire set of objects or units or measured. Samples are is called sampling. The way
of interest, such as all used to estimate the sample is chosen can
emails sent by employees characteristics of the larger introduce additional
at a company. The population, as it is often uncertainty and bias into
population size is denoted impractical or impossible to the data, which must be
as N. measure the entire accounted for in the
population. statistical analysis.
What is a model?
• Humans try to understand the world around them by representing it in different ways. Architects capture
attributes of buildings through blueprints and three-dimensional, scaled-down versions.
• A model is our attempt to understand and represent the nature of reality through a particular lens, be it
architectural, biological, or mathematical.

Statistical modeling-
• Before you get too involved with the data and start coding, it’s useful to draw a picture of what you think the
underlying process might be with your model. What comes first? What influences what? What causes what?
What’s a test of that?
• But different people think in different ways. Some prefer to express these kinds of relationships in terms of
math. The mathematical ex‐ pressions will be general enough that they have to include parameters, but the
values of these parameters are not yet known.
What is a Model?
1 Repres enting R eality 2 Expres s ing R elations hips
A model is an attempt to understand Models can be expressed using
and represent the nature of reality mathematical equations or diagrams
through a particular lens, such as that capture the relationships
architectural, biological, or between variables and parameters.
mathematical.

3 Es timating Parameters
The parameters in a model are unknown values that need to be estimated using the
observed data. This process of fitting the model is a key step in statistical inference.
Probability Distributions-

• Probability distributions are the foundation of statistical


models.
• When we get to linear regression and Naive Bayes, you will see
how this happens in practice.
• Back in the day, before computers, scientists observed real-
world phenomenon, took measurements, and noticed that
certain mathematical shapes kept reappearing. The classical
example is the height of hu‐ mans, following a normal
distribution—a bell-shaped curve, also called a Gaussian
distribution, named after Gauss.
• Other common shapes have been named after their observers as
well (e.g., the Poisson distribution and the Weibull
distribution), while other shapes such as Gamma distributions
or exponential distributions are named after associated
mathematical objects.
• Not all processes generate data that looks like a named
distribution, but many do. We can use these functions as
building blocks of our models. an illustration of the various
common shapes, and to remind you that they only have names
because someone observed them enough times to think they
deserved names. There is actually an infinite number of
possible distributions.
Probability Distributions

Foundational Concept Named Distributions


P robability distributions are the foundation Certain mathematical shapes, such as the
of statistical models, as they provide a normal distribution, P oisson distribution,
way to assign probabilities to different and Weibull distribution, have been
outcomes or events. observed to appear in many real-world
phenomena and have been given names.

Functional Form Infinite Possibilities


P robability distributions have a specific While many real-world processes can be
functional form that includes parameters, modeled using named distributions, there
which can be estimated from the are an infinite number of possible probability
observed data to model the underlying distributions that can be used to represent
process. different types of data and processes.
Probability Distributions in Models

Parameters Probability Density Random Variables


Function
In mathematical models, The random variables in a
Greek letters are used to Probability distributions are model, denoted by x or y, are
represent the unknown expressed as probability assumed to follow a
parameters, such as μ and σ density functions, which map corresponding probability
in the normal distribution, the random variable to a distribution, which allows us
which need to be estimated positive real number and to make probabilistic
from the data. must integrate to 1 to be a statements about the
valid probability distribution. outcomes.
Fitting a Model

Specify Model Estimate Parameters Interpret Results


The first step in fitting a The model parameters are The fitted model can now
model is to specify the then estimated using be used to make
functional form of the optimization methods predictions or draw
relationship between the applied to the observed conclusions about the
variables, based on your data, resulting in a fitted underlying process that
understanding of the model with specific generated the data, which
underlying process. parameter values. is the goal of statistical
inference.
From Data to Ins ight
Data Collection 1
The data-generating processes in
the real world are captured through
subjective data collection methods, 2 S tatis tic al Modeling
introducing uncertainty into the Statistical models are used to
data. represent the relationships in the
data, with unknown parameters that
S tatis tic al Inference 3 need to be estimated from the
The process of fitting the model to observed data.
the data and drawing conclusions
about the underlying processes is
the core of statistical inference in
data science.
Conclusion
The field of data science is built upon the foundation of statistical inference, which allows us to turn
the complex, uncertain, and data-generating world into insights and understanding. By carefully
modeling the relationships in the data and fitting those models to the observed information, data
scientists can uncover the patterns and processes that shape the world around us.

You might also like