Data Science and Visualization (21CS644) : Text Books
Data Science and Visualization (21CS644) : Text Books
by
Dr. ROOPA. H
Associate Professor
Department of Information Science & Engineering,
Bangalore Institute of Technology
TEXT BOOKS:
1. Doing Data Science. Cathy O Neil and Rachel Schutt, O Reilly Media,
Inc 2013.
2. Data Visualization workshop, Tim Grobmann and Mario Dobler, Packt,
Publishing, ISBN 9781800568112.
REFERENCE BOOKS:
1
Module -1 :
Introduction to Data Science:
Introduction: What is Data Science? Big Data and Data Science hype and
getting past the hype, Why now? Datafication, Current landscape of
perspectives, Skill sets. Needed
09-Hours
Textbook 1: Chapter 1
2
Need for Big Data
The advancement in technology has led to the increase in production and
3
How Much Data Do We have?
Google processes 20 PB a day (2008)
Facebook has 60 TB of daily logs
eBay has 6.5 PB of user data + 50 TB/day (5/2009)
1000 genomes project: 200 TB
Big Data
Big Data is any data that is expensive to manage and hard to extract
value from
Volume
The size of the data
Velocity
The latency of data processing relative to the growing demand for
interactivity
Variety and Complexity
the diversity of sources, formats, quality, structures.
4
Big Data
Types of Data:
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
Graph Data
Social Network, Semantic Web (RDF), …
Streaming Data
5
What To Do With These Data?
Aggregation and Statistics
Data warehousing and OLAP
Indexing, Searching, and Querying
Keyword based search
Pattern matching (XML/RDF)
Knowledge discovery
Data Mining
Statistical Modeling
6
Big Data and Data Science Hype:
Data scientists, "The Sexiest Job of the 21st Century" (Davenport and Patil,
Harvard Business Review, 2012)
Much of the data science explosion is coming from the tech-world
What does Data Science mean?
Is it the science of Big Data?
What is Big Data anyway?
Who does Data Science and where?
What existed before Data Science came along?
Is it simply a rebranding of statistics and machine learning?
“Anything that has to call itself a science isn’t.”
Hype increases noise-to-signal ratio in perceiving reality and makes it
harder to focus on the gems
7
Datafication:
In May/June 2023, Kenneth Neil Cukier and Viktor Myer- Schoenberger
wrote an article called “The Rise of Big Data”. They defined
“Once we datafy things, we can transform their purpose and turn the
information into new forms of value.”
8
What is Data Science?
Theories and techniques from many fields and disciplines are
used to investigate and analyze a large amount of data to help
decision makers in many industries such as science,
engineering, economics, politics, finance, and education
Computer Science
Pattern recognition, visualization, data warehousing, High performance
computing, Databases, AI
Mathematics
Mathematical Modeling
Statistics
Statistical and Stochastic modeling, Probability.
9
Data Science take I
Mike Driscoll (CEO of Meta market): Defines Data Science as
“Data science, as it’s practiced, is a blend of Red-Bull-fueled
hacking and espresso-inspired statistics.
10
Data Scientists
Data Scientist
The Sexiest Job of the 21st Century
They find stories, extract knowledge. They are not reporters.
Data scientists are the key to realizing the opportunities presented
by big data. They bring structure to it, find compelling patterns in
it, and advise executives on the implications for products, processes,
and decisions.
11
What do data scientists do?
“define what data science is by what data scientists get paid to do” (O’Neil and Schutt)
In academia, a data scientist is trained in some discipline, works with large
amounts of data, grapples with computational problems posed by the
structure, size, messiness, and the complexity and nature of the data, and
solves real-world problems.
In industry, a data scientist
knows how to extract meaning from and interpret data, which
requires both tools and methods from statistics and machine learning, as
well as being human.
spends a lots of effort in collecting, cleaning, and munging data
utilizing statistics and software engineering skills.
performs exploratory data analysis, finds patterns, builds models, and
algorithms.
communicates the findings in clear language and with data visualizations so
that even if her/his colleagues unfamiliar with the data can understand the
implications
23 Dr.Roopa H, Department of ISE, BIT
12
Data Science team • Individual data scientist
profiles are merged to make a
Data science team
13
WHAT IS DATA SCIENCE?
…solving problems with data…
scientific,
collect & clean & use data
social, or data
understand format to create
business problem
data data solution
problem
…sounds cool!
What makes a good data scientist?
14
WHAT IS DATA ANALYSIS?
…using data to discover useful information…
15
Statistical Thinking in the Age of Big Data:
Big Data is a vague term, used loosely, if often, these days. But put simply,
the catchall phrase means three things. First, it is a bundle of technologies.
Second, it is a potential revolution in measurement. And third, it is a point
of view, or philosophy, about how decisions will be—and perhaps should
be—made in the future.
-Steve Lohr,The NewYork Times
16
33 Dr.Roopa H, Department of ISE, BIT
17
Use N to represent the total number of observations in the
population.
Example:
Suppose your population was all emails sent last year by employees at a huge corporation,
BigCorp. Then a single observation could be a list of things: the sender’s name, the list of
recipients, date sent, text of email, number of characters in the email, number of sentences
in the email, number of verbs in the email, and the length of time until first reply.
Sample is a subset of the units of size n in order to examine the observations
to draw conclusions and make inferences about the population.
In the BigCorp email example, you could make a list of all the employees and select 1/10th
of those people at random and take all the email they ever sent, and that would be your
sample.
Alternatively, sample 1/10th of all email sent each day at random, and that would be your
sample.
Both these methods are reasonable, and both methods yield the same sample size.
But if you took them and counted how many email messages each person sent, and used that
to estimate the underlying distribution of emails sent by all indiviuals at BigCorp, you might
get entirely different answers.
18
Populations and Samples of Big Data:
• Sampling solves some engineering challenges
• How much data you need at hand really depends on what your goal is: for analysis or
inference purposes, you typically don’t need to store all the data all the time.
Bias
Even if we have access to all of Facebook’s or Google’s or Twitter’s data corpus, any
inferences made from that data should not be extended to draw conclusions about humans
beyond those sets of users, or even those users for any particular day.
Kate Crawford, a principal scientist at Microsoft Research, describes in her Strata talk,
“Hidden Biases of Big Data,” how if you analyzed tweets immediately before and after
Hurricane Sandy, you would think that most people were supermarket shopping pre-Sandy
and partying post-Sandy. However, most of those tweets came from NewYorkers.
First of all, they’re heavier Twitter users than, say, the coastal New Jerseyans, and second of
all, the coastal New Jerseyans were worrying about other stuff like their house falling down
and didn’t have time to tweet.
In other words, you would you would think that Hurricane Sandy wasn’t all that bad if you
used tweet data to understand it. The only conclusion you can actually draw is that this is
what Hurricane Sandy was like for the subset of Twitter users (who themselves are not
representative of the general US population), whose situation was not so bad that they
didn’thave time to tweet.
37 Dr.Roopa H, Department of ISE, BIT
19
An example from that very article—election night polls—is in itself a great counter-example:
even if we poll absolutely everyone who leaves the polling stations, we still don’t count people
who decided not to vote in the first place. And those might be the very people we’d need to
talk to understand our country’s voting problems.
Indeed, we’d argue that the assumption we make that N=ALL is one of the biggest problems
we face in the age of Big Data. It is, above all, a way of excluding the voices of people who
don’t have the time, energy, or access to cast their vote in all sorts of informal, possibly
unannounced, elections.
20
A model is our attempt to understand and represent the nature of reality through a
particular lens, be it architectural, biological, or mathematical.
A model is an artificial construction where all extraneous detail has been removed or
abstracted. Attention must always be paid to these abstracted details after a model has been
analyzed to see what might have been overlooked.
In the case of proteins, a model of the protein backbone with side chains by itself is
removed from the laws of quantum mechanics that govern the behavior of the electrons,
which ultimately dictate the structure and actions of proteins. In the case of a statistical
model, we may have mistakenly excluded key variables, included irrelevantones,
or assumed a mathematical structure divorced from reality.
41 Dr.Roopa H, Department of ISE, BIT
Statistical modelling:
Before you get too involved with the data and start coding, it’s useful to draw
a picture of what you think the underlying process might be with your model.
What comes first? What influences what? What causes what? What’s a test of
that?
But different people think in different ways. Some prefer to express these
kinds of relationships in terms of math.
Other people prefer pictures and will first draw a diagram of data flow,
possibly with arrows, showing how things affect other things or what happens
over time. This gives them an abstract picture of the relationships before
choosing equations to express them.
21
One place to start is exploratory data analysis (EDA), which we will cover in a later section.
This entails making plots and building intuition for your particular dataset. EDA helps out a
lot, as well as trial and error and iteration.
For example, you can (and should) plot histograms and look at scatterplots to start getting a
feel for the data. Then you just try writing something down, even if it’s wrong first (it will
probably be wrong first, but that doesn’t matter).
Start building up your arsenal of potential models throughout this book. Some of the building
blocks of these models are probability distributions.
Probability distributions:
Probability distributions are the foundation of statistical models.
Back in the day, before computers, scientists observed real-world phenomenon, took
measurements, and noticed that certain mathematical shapes kept reappearing. The classical
example is the height of humans, following a normal distribution—a bell-shaped curve, also
called a Gaussian distribution, named after Gauss.
Other common shapes have been named after their observers as well (e.g., the Poisson
distribution and the Weibull distribution), while other shapes such as Gamma distributions or
exponential distributions are named after associated mathematical objects.
Natural processes tend to generate measurements whose empirical shape could be
approximated by mathematical functions with a few parameters that could be estimated from
the data.
43 Dr.Roopa H, Department of ISE, BIT
22
Fitting a model:
Fitting a model means that you estimate the parameters of the model using the observed
data.
Fitting the model often involves optimization methods and algorithms, such as maximum
likelihood estimation, to help get the parameters.
Fitting the model is when you start actually coding: your code will read in the data, and
you’ll specify the functional form that you wrote down on the piece of paper.
Initially you should have an understanding that optimization is taking place and how it
works, but you don’t have to code this part yourself—it underlies the R or Python
functions.
Over fitting:
Over fitting is the term used to mean that you used a dataset to estimate the parameters of
your model, but your model isn’t that good at capturing reality beyond your sampled data.
23