0% found this document useful (0 votes)
63 views

Data Science and Visualization (21CS644) : Text Books

data science

Uploaded by

sudhanva0703
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views

Data Science and Visualization (21CS644) : Text Books

data science

Uploaded by

sudhanva0703
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

BANGALORE INSTITUTE OF TECHNOLOGY

Department of Information Science and Engineering


K.R.Road, V.V.Puram, Bangalore – 560 004

Data Science and Visualization (21CS644)

by

Dr. ROOPA. H
Associate Professor
Department of Information Science & Engineering,
Bangalore Institute of Technology

TEXT BOOKS:

1. Doing Data Science. Cathy O Neil and Rachel Schutt, O Reilly Media,
Inc 2013.
2. Data Visualization workshop, Tim Grobmann and Mario Dobler, Packt,
Publishing, ISBN 9781800568112.

REFERENCE BOOKS:

1. Mining of Massive Datasets, Anand Rajaraman and Jeffrey D. Ullman,


Cambridge University Press, 2010.
2. Data Science from Scratch, Joel Grus, Shroff Publisher, /Oreilly
Publisher Media.
3. A handbook for data driven design by Andy krik.

2 Dr.Roopa H, Department of ISE, BIT

1
Module -1 :
Introduction to Data Science:

Introduction: What is Data Science? Big Data and Data Science hype and
getting past the hype, Why now? Datafication, Current landscape of
perspectives, Skill sets. Needed

Statistical Inference: Populations and samples, Statistical modelling,


probability distributions, fitting a model.

09-Hours
Textbook 1: Chapter 1

3 Dr.Roopa H, Department of ISE, BIT

Data All Around


 Lots of data is being collected and warehoused
 Web data, e-commerce
 Financial transactions, bank/credit transactions
 Online trading and purchasing
 Social Network

4 Dr.Roopa H, Department of ISE, BIT

2
Need for Big Data
 The advancement in technology has led to the increase in production and

storage of voluminous amounts of data.

 Earlier megabytes (10^6 B) were used

 Nowadays petabytes (10^15 B) are used for processing, analysis,

discovering new facts and generating new knowledge.

 Conventional systems for storage, processing and analysis pose challenges

in large growth in volume of data, variety of data, various forms and


formats, increasing complexity, faster generation of data and need of
quickly processing, analyzing and usage.

5 Dr.Roopa H, Department of ISE, BIT

Figure Evolution of Big Data and their characteristics


6 Dr.Roopa H, Department of ISE, BIT

3
How Much Data Do We have?
 Google processes 20 PB a day (2008)
 Facebook has 60 TB of daily logs
 eBay has 6.5 PB of user data + 50 TB/day (5/2009)
 1000 genomes project: 200 TB

 Cost of 1 TB of disk: $35


 Time to read 1 TB disk: 3 hrs
(100 MB/s)

7 Dr.Roopa H, Department of ISE, BIT

Big Data
Big Data is any data that is expensive to manage and hard to extract
value from

 Volume
 The size of the data
 Velocity
 The latency of data processing relative to the growing demand for
interactivity
 Variety and Complexity
 the diversity of sources, formats, quality, structures.

8 Dr.Roopa H, Department of ISE, BIT

4
Big Data

9 Dr.Roopa H, Department of ISE, BIT

Types of Data:
 Relational Data (Tables/Transaction/Legacy Data)
 Text Data (Web)
 Semi-structured Data (XML)
 Graph Data
 Social Network, Semantic Web (RDF), …
 Streaming Data

10 Dr.Roopa H, Department of ISE, BIT

5
What To Do With These Data?
 Aggregation and Statistics
 Data warehousing and OLAP
 Indexing, Searching, and Querying
 Keyword based search
 Pattern matching (XML/RDF)
 Knowledge discovery
 Data Mining
 Statistical Modeling

11 Dr.Roopa H, Department of ISE, BIT

Big Data and Data Science


 Analytics of Big Data could enable discovery of new facts, knowledge and
strategy in a number of fields, such as manufacturing, business, finance,
healthcare, medicine and education.
 “… the sexy job in the next 10 years will be statisticians,” Hal Varian,
Google Chief Economist
 The U.S. will need 140,000-190,000 predictive analysts and 1.5
million managers/analysts by 2018. McKinsey Global Institute’s June 2011
 New Data Science institutes being created or repurposed – NYU,
Columbia, Washington, UCB,...
 New degree programs, courses, boot-camps:
 e.g., at Berkeley: Stats, I-School, CS, Astronomy…
 One proposal (elsewhere) for an MS in “Big Data Science”

12 Dr.Roopa H, Department of ISE, BIT

6
Big Data and Data Science Hype:
 Data scientists, "The Sexiest Job of the 21st Century" (Davenport and Patil,
Harvard Business Review, 2012)
 Much of the data science explosion is coming from the tech-world
 What does Data Science mean?
 Is it the science of Big Data?
 What is Big Data anyway?
 Who does Data Science and where?
 What existed before Data Science came along?
 Is it simply a rebranding of statistics and machine learning?
 “Anything that has to call itself a science isn’t.”
 Hype increases noise-to-signal ratio in perceiving reality and makes it
harder to focus on the gems

13 Dr.Roopa H, Department of ISE, BIT

Getting Past the Hype:


Students going out of college after studies to a real job, they realize there is
a gap betweenWhat you learned and what you do in a job.
Example:
Difference between academic statistics and industry statistics.
In general, data Scientist at their job have to access a larger body of
knowledge and methodology, as well as a data science process that has
foundations in both statistics and computer science.
Why Now?
Collection of massive amount of data from many aspects of our lives and an
abundance of inexpensive computing power.
What people might not know is that the “datafication” of our offline behavior and
mirroring the online data collection revolution. When these are put together,
there is a lot to learn about behavior and by extension, who we are as species.
Not only the massiveness makes the new data interesting ( or poses challeges) but
it’s the data itself, often in real time, becomes the building blocks of data
products.
14 Dr.Roopa H, Department of ISE, BIT

7
Datafication:
In May/June 2023, Kenneth Neil Cukier and Viktor Myer- Schoenberger
wrote an article called “The Rise of Big Data”. They defined

 Datafication=“taking all aspects of life and turning them into data”


Example:
• Google’s augmented-reality glasses datafy the gaze
• Twitter datafies stray thoughts'
• LinkedIn datafies professional networks

 “Once we datafy things, we can transform their purpose and turn the
information into new forms of value.”

“We” is the modeler’s and entrepreneurs making money from getting


people to buy stuff, and the “value” translates into something like
increased efficiency through automation.

15 Dr.Roopa H, Department of ISE, BIT

What is Data Science?


 An area that manages, manipulates, extracts, and interprets
knowledge from tremendous amount of data
 Data science (DS) is a multidisciplinary field of study with
goal to address the challenges in big data
 Data science principles apply to all data – big and small

16 Dr.Roopa H, Department of ISE, BIT

8
What is Data Science?
 Theories and techniques from many fields and disciplines are
used to investigate and analyze a large amount of data to help
decision makers in many industries such as science,
engineering, economics, politics, finance, and education
 Computer Science
 Pattern recognition, visualization, data warehousing, High performance
computing, Databases, AI
 Mathematics
 Mathematical Modeling
 Statistics
 Statistical and Stochastic modeling, Probability.

17 Dr.Roopa H, Department of ISE, BIT

18 Dr.Roopa H, Department of ISE, BIT

9
Data Science take I
Mike Driscoll (CEO of Meta market): Defines Data Science as
“Data science, as it’s practiced, is a blend of Red-Bull-fueled
hacking and espresso-inspired statistics.

But data science is not merely hacking—because when


hackers finish debugging their Bash one-liners and Pig
scripts, few of them care about non-Euclidean distance
metrics.

And data science is not merely statistics, because when


Drew Conway’s Venn
statisticians finish theorizing the perfect model, few could diagram of data science
read a tab-delimited file into R if their job depended on it.

Data science is the civil engineering of data. Its acolytes


possess a practical knowledge of tools and materials, coupled
with a theoretical understanding of what’s possible.”
Many posers “It’s not enough to just know how to run a black box algorithm.You actually need
to know how and why it works, so that when it doesn’t work, you can adjust.“ Cathy O’Neil
19 Dr.Roopa H, Department of ISE, BIT

 Gartner’s 2014 Hype Cycle

20 Dr.Roopa H, Department of ISE, BIT

10
Data Scientists
 Data Scientist
 The Sexiest Job of the 21st Century
 They find stories, extract knowledge. They are not reporters.
 Data scientists are the key to realizing the opportunities presented
by big data. They bring structure to it, find compelling patterns in
it, and advise executives on the implications for products, processes,
and decisions.

21 Dr.Roopa H, Department of ISE, BIT

Types of Data Scientists


1) Machine Learning Scientist
2) Statistician
3) Software Programming
Analyst
4) Data Engineer
5) Actuarial Scientist
6) Business Analytic
Practitioner
7) Quality Analyst
8) Spatial Data Scientist
9) Mathematician
10) Digital Analytic Consultant

22 Dr.Roopa H, Department of ISE, BIT

11
What do data scientists do?
 “define what data science is by what data scientists get paid to do” (O’Neil and Schutt)
 In academia, a data scientist is trained in some discipline, works with large
amounts of data, grapples with computational problems posed by the
structure, size, messiness, and the complexity and nature of the data, and
solves real-world problems.
 In industry, a data scientist
 knows how to extract meaning from and interpret data, which
requires both tools and methods from statistics and machine learning, as
well as being human.
 spends a lots of effort in collecting, cleaning, and munging data
utilizing statistics and software engineering skills.
 performs exploratory data analysis, finds patterns, builds models, and
algorithms.
 communicates the findings in clear language and with data visualizations so
that even if her/his colleagues unfamiliar with the data can understand the
implications
23 Dr.Roopa H, Department of ISE, BIT

A Data Science Profile:


Create profile (on a relative rather than absolute scale) with respect to their
skill levels in the following domains:
• Computer science
• Math
• Statistics
• Machine learning
• Domain expertise
• Communication and presentation skills
• Data visualization

24 Dr.Roopa H, Department of ISE, BIT

12
Data Science team • Individual data scientist
profiles are merged to make a
Data science team

• Team profile should align with


the profile of the data
problems to tackle

25 Dr.Roopa H, Department of ISE, BIT

Data Science: Skills and Actors


Clustering and visualization of data science subfields based on a survey of data science
practitioners (Analyzing the Analyzers by Harlan Harris, Sean Murphy, and MarckVaisman,
2012) • Data Businesspeople are the product and profit-focused data
scientists. They’re leaders, managers, and entrepreneurs,
but with a technical bent. A common educational path is an
engineering degree paired with an MBA.

• Data Creatives are eclectic jacks-of-all-trades, able to work


with a broad range of data and tools. They may think of
themselves as artists or hackers, and excel at visualization
and open source technologies.

• Data Developers are focused on writing software to do


analytic, statistical, and machine learning tasks, often in
production environments. They often have computer
science degrees, and often work with so-called “big data”.

• Data Researchers apply their scientific training, and the


tools and techniques they learned in academia, to
organizational data. They may have PhDs, and their creative
applications of mathematical tools yields valuable insights
and products.
26 Dr.Roopa H, Department of ISE, BIT

13
WHAT IS DATA SCIENCE?
…solving problems with data…
scientific,
collect & clean & use data
social, or data
understand format to create
business problem
data data solution
problem

…sounds cool!
What makes a good data scientist?

27 Dr.Roopa H, Department of ISE, BIT


2

WHAT IS DATA SCIENCE?


…solving problems with data…
scientific,
collect & clean & use data
social, or data
understand format to create
business problem
data data solution
problem

…which step is most challenging?

use data data analysis


to create or
solution machine learning
(or both)
28 Dr.Roopa H, Department of ISE, BIT
2
8

14
WHAT IS DATA ANALYSIS?
…using data to discover useful information…

• data: anything you can measure or record

• statistics: summarize (and visualize) main


Statistics characteristics of the data

• algorithms: apply algorithms to find


Algor ithms patterns in the data

29 Dr.Roopa H, Department of ISE, BIT


2
9

WHAT IS MACHINE LEARNING?


…creating and using models that learn from data…

•data: anything you can measure or record

•model: specification of a (mathematical)


relationship between different variables

•evaluation: how well does the model


work?

30 Dr.Roopa H, Department of ISE, BIT


3
0

15
Statistical Thinking in the Age of Big Data:
Big Data is a vague term, used loosely, if often, these days. But put simply,
the catchall phrase means three things. First, it is a bundle of technologies.
Second, it is a potential revolution in measurement. And third, it is a point
of view, or philosophy, about how decisions will be—and perhaps should
be—made in the future.
-Steve Lohr,The NewYork Times

When you’re developing your skill set as a data scientist, certain


foundational pieces need to be in place first—statistics, linear algebra, some
programming. Even once you have those pieces, part of the challenge is that
you will be developing several skill sets in parallel simultaneously—data
preparation and munging, modeling, coding, visualization, and
communication—that are interdependent. As we progress through the
book, these threads will be intertwined. That said, we need to start
somewhere, and will begin by getting grounded in statistical inference.
31 Dr.Roopa H, Department of ISE, BIT

32 Dr.Roopa H, Department of ISE, BIT

16
33 Dr.Roopa H, Department of ISE, BIT

34 Dr.Roopa H, Department of ISE, BIT

17
Use N to represent the total number of observations in the
population.

35 Dr.Roopa H, Department of ISE, BIT

Example:

Suppose your population was all emails sent last year by employees at a huge corporation,
BigCorp. Then a single observation could be a list of things: the sender’s name, the list of
recipients, date sent, text of email, number of characters in the email, number of sentences
in the email, number of verbs in the email, and the length of time until first reply.
Sample is a subset of the units of size n in order to examine the observations
to draw conclusions and make inferences about the population.
In the BigCorp email example, you could make a list of all the employees and select 1/10th
of those people at random and take all the email they ever sent, and that would be your
sample.
Alternatively, sample 1/10th of all email sent each day at random, and that would be your
sample.
Both these methods are reasonable, and both methods yield the same sample size.
But if you took them and counted how many email messages each person sent, and used that
to estimate the underlying distribution of emails sent by all indiviuals at BigCorp, you might
get entirely different answers.

36 Dr.Roopa H, Department of ISE, BIT

18
Populations and Samples of Big Data:
• Sampling solves some engineering challenges
• How much data you need at hand really depends on what your goal is: for analysis or
inference purposes, you typically don’t need to store all the data all the time.
Bias
Even if we have access to all of Facebook’s or Google’s or Twitter’s data corpus, any
inferences made from that data should not be extended to draw conclusions about humans
beyond those sets of users, or even those users for any particular day.
Kate Crawford, a principal scientist at Microsoft Research, describes in her Strata talk,
“Hidden Biases of Big Data,” how if you analyzed tweets immediately before and after
Hurricane Sandy, you would think that most people were supermarket shopping pre-Sandy
and partying post-Sandy. However, most of those tweets came from NewYorkers.
First of all, they’re heavier Twitter users than, say, the coastal New Jerseyans, and second of
all, the coastal New Jerseyans were worrying about other stuff like their house falling down
and didn’t have time to tweet.
In other words, you would you would think that Hurricane Sandy wasn’t all that bad if you
used tweet data to understand it. The only conclusion you can actually draw is that this is
what Hurricane Sandy was like for the subset of Twitter users (who themselves are not
representative of the general US population), whose situation was not so bad that they
didn’thave time to tweet.
37 Dr.Roopa H, Department of ISE, BIT

Terminology: Big Data:


• “Big” is a moving target.
• “Big” is when you can’t fit it on one machine.
• Big Data is a cultural phenomenon.
• The 4 Vs: Volume, variety, velocity, and value.
Big Data Can Mean Big Assumptions:
The Cukier and Mayer-Schoenberger article “The Rise of Big Data.” In it,
they argue that the Big Data revolution consists of three things:
• Collecting and using a lot of data rather than small samples
• Accepting messiness in your data
• Giving up on knowing the causes
It doesn’t need to worry about sampling error because it is literally keeping track of the truth.
The way the article frames this is by claiming that the new approach of Big Data is letting
“N=ALL.”
Can N=ALL?
Here’s the thing: it’s pretty much never all. And we are very often missing the very things we
should care about most.
38 Dr.Roopa H, Department of ISE, BIT

19
An example from that very article—election night polls—is in itself a great counter-example:
even if we poll absolutely everyone who leaves the polling stations, we still don’t count people
who decided not to vote in the first place. And those might be the very people we’d need to
talk to understand our country’s voting problems.
Indeed, we’d argue that the assumption we make that N=ALL is one of the biggest problems
we face in the age of Big Data. It is, above all, a way of excluding the voices of people who
don’t have the time, energy, or access to cast their vote in all sorts of informal, possibly
unannounced, elections.

39 Dr.Roopa H, Department of ISE, BIT

40 Dr.Roopa H, Department of ISE, BIT

20
A model is our attempt to understand and represent the nature of reality through a
particular lens, be it architectural, biological, or mathematical.
A model is an artificial construction where all extraneous detail has been removed or
abstracted. Attention must always be paid to these abstracted details after a model has been
analyzed to see what might have been overlooked.
In the case of proteins, a model of the protein backbone with side chains by itself is
removed from the laws of quantum mechanics that govern the behavior of the electrons,
which ultimately dictate the structure and actions of proteins. In the case of a statistical
model, we may have mistakenly excluded key variables, included irrelevantones,
or assumed a mathematical structure divorced from reality.
41 Dr.Roopa H, Department of ISE, BIT

Statistical modelling:
Before you get too involved with the data and start coding, it’s useful to draw
a picture of what you think the underlying process might be with your model.
What comes first? What influences what? What causes what? What’s a test of
that?
But different people think in different ways. Some prefer to express these
kinds of relationships in terms of math.
Other people prefer pictures and will first draw a diagram of data flow,
possibly with arrows, showing how things affect other things or what happens
over time. This gives them an abstract picture of the relationships before
choosing equations to express them.

But how do you build a model?


How do you have any clue whatsoever what functional form the data should
take? Truth is, it’s part art and part science.

42 Dr.Roopa H, Department of ISE, BIT

21
One place to start is exploratory data analysis (EDA), which we will cover in a later section.
This entails making plots and building intuition for your particular dataset. EDA helps out a
lot, as well as trial and error and iteration.
For example, you can (and should) plot histograms and look at scatterplots to start getting a
feel for the data. Then you just try writing something down, even if it’s wrong first (it will
probably be wrong first, but that doesn’t matter).
Start building up your arsenal of potential models throughout this book. Some of the building
blocks of these models are probability distributions.
Probability distributions:
Probability distributions are the foundation of statistical models.
Back in the day, before computers, scientists observed real-world phenomenon, took
measurements, and noticed that certain mathematical shapes kept reappearing. The classical
example is the height of humans, following a normal distribution—a bell-shaped curve, also
called a Gaussian distribution, named after Gauss.
Other common shapes have been named after their observers as well (e.g., the Poisson
distribution and the Weibull distribution), while other shapes such as Gamma distributions or
exponential distributions are named after associated mathematical objects.
Natural processes tend to generate measurements whose empirical shape could be
approximated by mathematical functions with a few parameters that could be estimated from
the data.
43 Dr.Roopa H, Department of ISE, BIT

Distributions are to be interpreted as


assigning a probability to a subset of possible
outcomes, and have corresponding
functions. For example, the normal
distribution is written as:

The parameter μ is the mean and median


and controls where the distribution
is centered (because this is a symmetric
distribution), and the parameter σ controls
how spread out the distribution is. This is
the general functional form, but for specific
real-world phenomenon,
these parameters have actual numbers as
values, which we can estimate
from the data.

44 Dr.Roopa H, Department of ISE, BIT

22
Fitting a model:

Fitting a model means that you estimate the parameters of the model using the observed
data.
Fitting the model often involves optimization methods and algorithms, such as maximum
likelihood estimation, to help get the parameters.

Fitting the model is when you start actually coding: your code will read in the data, and
you’ll specify the functional form that you wrote down on the piece of paper.

Initially you should have an understanding that optimization is taking place and how it
works, but you don’t have to code this part yourself—it underlies the R or Python
functions.

Over fitting:

Over fitting is the term used to mean that you used a dataset to estimate the parameters of
your model, but your model isn’t that good at capturing reality beyond your sampled data.

45 Dr.Roopa H, Department of ISE, BIT

23

You might also like