0% found this document useful (0 votes)

329 views

Data Science Harvard Lecture 1 PDF

This document provides an overview of a course on data science taught by Dr. Ramon A. Mata-Toledo at Harvard University. The main objectives of the course are for students to understand how data science fits within organizations and to learn how to gather, select, model and extract knowledge from large amounts of data using R. The course will cover topics such as statistics, SQL, data warehousing, and predictive modeling from a business perspective. A brief history of data science and big data is also provided.

Uploaded by

Anonymous sVCaJ7jxc

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

329 views

Data Science Harvard Lecture 1 PDF

Uploaded by

Anonymous sVCaJ7jxc

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

CSCI E-84

A Practical Approach to Data

Science
Ramon A. Mata-Toledo, Ph.D.
Professor of Computer Science
Harvard University
Unit 1 - Lecture 1
January, Wednesday 27, 2016

Required Textbooks

Main Objectives of the Course

This course is an introduction to the field of data science
and its applicability to the business world.
At the end of the semester you should be able to:
Understand how data science fits in an organization and
gather, select, and model large amounts of data.
Identify the appropriate data and the methods that will
allow you to extract knowledge from this data using the
programming language R.

What This Course is NOT is an in-depth

calculus based statistic research
environment .

What is this course all about? It IS is a practical hands-on

approach to understanding the methodology and tools of
Data Science from a business prospective. Some of the
technical we will briefly consider are:

Statistics
Database Querying
SQL
Data Warehousing
Regression Analysis
Explanatory versus Predictive Modeling

Brief History of Data Science/Big Data

The term data science can be considered the result of three
main factors:
Evolution of data processing methods
Internet particularly the Web 2.0
Technological Advancements in computer processing
speed/storage and development of algorithms for
extracting useful information and knowledge from big
data

Brief History of Data Science/Big Data

(continuation)
However, Already seventy years ago we encounter the
first attempts to quantify the growth rate in the volume of
data or what has popularly been known as the information
explosion (a term first used in 1941, according to
the Oxford English Dictionary). "

Selected Milestones in the History of

Data Science/Big Data
Theoretical/Academic:
(Source: http://www.forbes.com/sites/gilpress/2013/05/09/a-very-short-history-of-big-data/)

November 1967 - B. A. Marron and P. A. D. de Maine publish Automatic

data compression in the Communications of the ACM, stating that The
information explosion noted in recent years makes it essential that storage
requirements for all information be kept to a minimum.

April 1980 - I.A. Tjomsland gives a talk titled Where Do We Go From Here? at
the Fourth IEEE Symposium on Mass Storage Systems, in which he says Those
associated with storage devices long ago realized that Parkinsons First Law may
be paraphrased to describe our industry Data expands to fill the space
available. I believe that large amounts of data are being retained because users
have no way of identifying obsolete data; the penalties for storing obsolete data
are less apparent than are the penalties for discarding potentially useful data.

July 1986 - Hal B. Becker publishes Can users really absorb data at todays
rates? Tomorrows? in Data Communications. Becker estimates that the
recoding density achieved by Gutenberg was approximately 500 symbols
(characters) per cubic inch500 times the density of [4,000 B.C. Sumerian] clay
tablets. By the year 2000, semiconductor random access memory should be storing
1.25X10^11 bytes per cubic inch.

1996 - Digital storage becomes more cost-effective for storing data than paper
according to R.J.T. Morris and B.J. Truskowski, in The Evolution of Storage
Systems IBM Systems Journal, July 1, 2003.

1997 - Michael Lesk publishes How much information is there in the world?
Lesk concludes that There may be a few thousand petabytes of information all
told; and the production of tape and disk will reach that level by the year 2000. So
in only a few years, (a) we will be able [to] save everythingno information will
have to be thrown out, and (b) the typical piece of information will never be
looked at by a human being.

August 1999 - Steve Bryson, David Kenwright, Michael Cox, David Ellsworth, and
Robert Haimes publish Visually exploring gigabyte data sets in real time in
the Communications of the ACM. It is the first CACM article to use the term Big Data
(the title of one of the articles sections is Big Data for Scientific Visualization). The
article opens with the following statement: Very powerful computers are a blessing to
many fields of inquiry. They are also a curse; fast computations spew out massive
amounts of data. Where megabyte data sets were once considered large, we now find
data sets from individual simulations in the 300GB range. But understanding the data
resulting from high-end computations is a significant endeavor. As more than one
scientist has put it, it is just plain difficult to look at all the numbers. And as Richard W.
Hamming, mathematician and pioneer computer scientist, pointed out, the purpose of
computing is insight, not numbers.
October 2000 - Peter Lyman and Hal R. Varian at UC Berkeley publish How Much
Information? It is the first comprehensive study to quantify, in computer storage terms,
the total amount of new and original information (not counting copies) created in the
world annually and stored in four physical media: paper, film, optical (CDs and DVDs),
and magnetic.

The Google White Papers

Foreshadowing of the data explosion

Developed 2 white papers

Map Reduce
Google File System (GFS)

Basic precept was based on a distributed file system

Clusters of smaller, cheaper computers

Selected Software for Handling Big Data

SPSS is a widely used program for statistical analysis in social science. It is also used
by industry and academic researchers. The original SPSS manual (Nie, Bent & Hull,
1970) has been described as one of "sociology's most influential books" for allowing
ordinary researchers to do their own statistical analysis. In addition to statistical
analysis, data management (case selection, file reshaping, creating derived data)
and data documentation (a metadata dictionary was stored in the datafile) are
features of the base software. SPSS was acquired by IBM in 2009.
SAS was developed at North Carolina State University from 1966 until 1976, when
SAS Institute was incorporated. SAS was further developed in the 1980s and 1990s
with the addition of new statistical procedures, additional components and the
introduction of JMP. A point-and-click interface was added in version 9 in 2004. A
social media analytics product was added in 2010.

R is a programming language and software environment for statistical

computing and graphics supported by the R Foundation for Statistical Computing.
R was created, in 1997, by Ross Ihaka and Robert Gentleman at the University of
Auckland, New Zealand, and is currently developed by the R Development Core
Team, of which Chambers is a member.
R is named partly after the first names of the first two R authors and partly as a
play on the name of S programming language.
Source:
(https://www.stat.auckland.ac.nz/~ihaka/downloads/Interface98.pdf).

Software Tools
MongoDB is a cross-platform document-oriented database developed by
MongoDB Inc, in 2007. Classified as a NoSQL database, MongoDB eschews the
traditional table-based relational database structure in favor of JSON-like
documents with dynamic schemas (MongoDB calls the format BSON), making
the integration of data in certain types of applications easier and faster.
MongoDB is free and open-source software.

Software Tools - Continuation

Hadoop
Open source from Apache [written in Java]
Implementation of the Google White Papers
Now associated with a collection of technologies

Design Choice
Hadoop Example
Very good at storing files
Optimizes use of cheap resources - no RAID
needed here
Provides data redundancy
Good at sequential reads
Not so good at high speed random reads
Cannot update a file must replace

Hadoop Ecosystem other Technologies

Yet Another Resource Negotiator (YARN)

Scoop imports tables from RDBMS

Flume deals with event data like web logs
Hive imposes metadata over flat data so SQL code can be used

Impala high speed analytic using distributed queries

Hbase NOSQL db for Hadoop stored data
Mahout machine learning algorithms

Oozie workflow manager

Zookeeper configuration management

Big Data
The processing of massive
data sets that facilitate
real-time data driven decision-making
Digital data grows by
2.5 quintillion (1018)
bytes every day

Units of Information storage

The smallest addressable unit in a computers memory is the
byte. A byte is equal to 8 consecutive binary digits or bits.

In the physical sciences a Kilo (K) stand for 1000 units. For
example: 1 Km = 1000 meters (m)
1 Kg = 1000 grams (g)
However, in the computer field when we refer to Kilobyte or
Kb the K = 1024 bytes.

Units of Information storage

(continuation)

Larger information units and their names are shown in the figure below.
Source (https://en.wikipedia.org/wiki/Units_of_information)

1 ZB is equivalent to 1 followed by 21 zeroes.

1 YB is equivalent to 1 followed by 24 zeroes.

The problem with data..

over abundance, fragmentation,
erroneous, heterogeneity, duplication,
untrustworthy, unstructured.

What makes it Big Data?

Volume
Velocity
Variety
Veracity
Validity
Value
Volatility

Diya Soubra (2012), The 3Vs that Define Big Data

Volume > 2.8

21)
(10 ZB

Digital universe will reach 40 zettabytes (ZB) by 2020

57 times all grains of sand on the beaches

= weight of 424 Nimitz aircraft carriers
Facebook (2012) processed 2.5 million pieces of content
each day (= 500+ terabytes of data daily)
Amazon sold 958,333 items every hour (on Cyber Monday)
in 2016. (If you do the math, thats how many in a day???)

Velocity Data in Motion

Data Delivery Streaming Data

Increasing rate of data flow into the

organization. It considers incoming
and outgoing flow of data. The
faster we can capture, analyze, and
transform data into useful
information the better.

Results delivery Real time, Near-real time, Periodic

Disease Outbreak
Facial Recognition
Traffic

Variety ALL types of data

Big data comes from EVERYWHERE
It is not a relational database with
billions of structured rows its a
mix of structured and multistructured data.

While we are talking about Data

Veracity the 4th V of Big Data
Dirty data is a tough issue
Uncertain data cant be cleaned
Errors can snowball fast with big data
Examples:
GPS signals bouncing around buildings in NYC
Weather conditions sun,clouds, rain icon

Validity
Validity refers to the issue that the data being stored and
mined is clean and meaningful to the problems or
decisions that need to be made.

Value and Volatility

Value refers usefulness and relevancy of the information
extracted from the data as opposed to the practice of
collecting data for archival or regulatory purposes.
Volatility addresses the issue of how long is the data
valid and for how long should it be stored. Some analysis
is required to determine when, in the future, the data is
no longer relevant.

In Summary

Uses of Big Data - Current

Customer-Centric
Personalize the experience

Customer Recommendations
Streaming Routing

Online ad targeting

Predictive Analytics
Manage the risk

Credit Card Fraud

Crop Forecasts

Uses of Big Data - Future

Self Quantification

New material development

(Materials Genome Project)
Pay as You Go

What is Data Science?

What does it take?

Is there anything else?

Tools in
Data
Science

Some Order in the Buzz

I/O The Real Issue

Processors arent the
problem, its getting
data off the hard disk
and for those hopeful that Solid State drives
would be a solution they wont be replacing hard
drives anytime soon

Algorithms
What it really is all about
In Big Data, common association is with
Predictive Algorithms

Room for new ideas but some believe

we have enough of these for now
Others say what you need is more data,
simpler algorithms

What types of algorithms?

Sorting

Classifying

Searching

Predictive
Learning via Rule
Fit Ensembles

Streaming
Filtering

Deterministic
behavior
algorithms

Summary
From scientific discovery to business intelligence,
"Big Data" is changing our world
Big Data permeates most (all?) areas of computer
science
Opening the doors to lots of opportunities in the
computing sciences
Its not just for Data Scientists.

Links to more info about material presented in this lecture:

IDC (2012) Digital Universe in 2020 - http://www.emc.com/collateral/analyst-reports/idc-the-digital-universe-in-2020.pdf
Diya Soubra (2012) The 3Vs that define Big Data - http://www.datasciencecentral.com/forum/topics/the-3vs-that-define-big-data
Self Quantification - http://www.topcoder.com/blog/big-data-mobile-sensors-visualization-gamification-quantified-self/
Materials Genome Project - http://www.theverge.com/2013/9/26/4766486/materials-genome-initiative-mit-and-harvard
Pay as You Go - http://www.businessweek.com/articles/2012-10-15/pay-as-you-drive-insurance-big-brother-needs-a-makeover
Montgomery County Traffic Cameras - http://www6.montgomerycountymd.gov/tmctmpl.asp
Trending Now: Using Social Media to Predict and Track Disease Outbreaks http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3261963/
Health Map - http://healthmap.org/en/
Manhunt Boston Bombers. Aired May 29, 2013 on PBS - Nova Program on Facial Recognition http://www.pbs.org/wgbh/nova/tech/manhunt-boston-bombers.html
Netflix and Big Data - http://technologyadvice.com/wp-content/themes/newta/how-netflix-is-using-big-data-to-get-people-hooked-on-itsoriginal-programming.html#.Ul1WIBAlgnZ
Data Never Sleeps - http://www.domo.com/learn/infographic-data-never-sleeps
Google Research Publication: The Google File System - http://research.google.com/archive/gfs.html
Google Research Publication: MapReduce - http://research.google.com/archive/mapreduce.html

Links to more info about material presented in this talk:

Sorting the World: Google Invents New Way to Manage Data - http://www.wired.com/science/discoveries/magazine/1607/pb_sorting
Hadoop - http://hadoop.apache.org/
What lies at the core of Hadoop? http://blog.enablecloud.com/2012/06/what-lies-at-core-of-hadoop.html
Distributed Average Consensus with Least-Mean-Square Deviation http://www.stanford.edu/~boyd/papers/pdf/lmsc_mtns06.pdf
HDFS in Cartoon https://docs.google.com/file/d/0Bzw6KHOtbT4MmRkZWJjYzEtYjI3Ni00NTFjLWE0OGItYTU5OGMxYjc0N2M1/edit?usp=drive_web&pli=1
Big-Data Computing: Creating revolutionary breakthroughs in commerce, science, and society - Computer Research
Association CCC led paper - http://www.cra.org/ccc/resources/ccc-led-white-papers
How Quantum Computers and Machine Learning Will Revolutionize Big Data Quanta Magazine http://www.wired.com/wiredscience/2013/10/computers-big-data/
Predictive Apps http://www.information-management.com/blogs/application-developers-ignore-big-data-at-your-own-peril10024904-1.html
Real World Use of Big Data Ford Focus - http://www.stthomas.edu/gradsoftware/files/BigData_RealWorldUse.pdf
Eatery Massive Health Experiment - https://eatery.massivehealth.com/
Waze www.waze.com
CS2013 Ironman v1.0 draft - http://ai.stanford.edu/users/sahami/CS2013/

Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
87% (46)
12 Week Program: Summer Body Starts Now
70 pages
The Hold Me Tight Workbook - Dr. Sue Johnson
100% (16)
The Hold Me Tight Workbook - Dr. Sue Johnson
187 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Cheat Code To The Universe
94% (79)
Cheat Code To The Universe
34 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
Shortcut To Shred Ebook Revised 9-9-2015 PDF
88% (8)
Shortcut To Shred Ebook Revised 9-9-2015 PDF
15 pages
Trauma-Focused ACT - Russ Harris
95% (39)
Trauma-Focused ACT - Russ Harris
568 pages
I Hate You - Don't Leave Me
80% (54)
I Hate You - Don't Leave Me
6 pages
The Secret Language of Attraction
86% (107)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (542)
How To Develop and Write A Grant Proposal
17 pages
Workbook For The Body Keeps The Score
89% (53)
Workbook For The Body Keeps The Score
111 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (30)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
77% (13)
27 Feedback Mechanisms Pogil Key
6 pages
Frank Hammond - List of Demons
92% (92)
Frank Hammond - List of Demons
3 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
2025 MandateForLeadership FULL
70% (10)
2025 MandateForLeadership FULL
920 pages
The 36 Questions That Lead To Love - The New York Times
94% (34)
The 36 Questions That Lead To Love - The New York Times
3 pages
100 Questions To Ask Your Partner
80% (35)
100 Questions To Ask Your Partner
2 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
100% (7)
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
27 pages
Starbucks Underfilled Latte Lawsuit
68% (76)
Starbucks Underfilled Latte Lawsuit
24 pages
1001 Songs
69% (72)
1001 Songs
1,798 pages
SOLIDWORKS Community Download Instructions - SOLIDWORKS
0% (3)
SOLIDWORKS Community Download Instructions - SOLIDWORKS
2 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
Onboarding Process Template Download 20190429
No ratings yet
Onboarding Process Template Download 20190429
8 pages
Computer Productivity 01le
100% (1)
Computer Productivity 01le
3 pages
A Practical Approach To Data Science: CSCI E-84
No ratings yet
A Practical Approach To Data Science: CSCI E-84
51 pages
International Journal of Engineering Research and Development (IJERD)
No ratings yet
International Journal of Engineering Research and Development (IJERD)
6 pages
BDA-1
No ratings yet
BDA-1
26 pages
NIIT Project-1
No ratings yet
NIIT Project-1
22 pages
BDA Unit 1
No ratings yet
BDA Unit 1
68 pages
The Age of Big Data: Kayvan Tirdad
No ratings yet
The Age of Big Data: Kayvan Tirdad
26 pages
Unit - 1
No ratings yet
Unit - 1
46 pages
Engr Ass! One
No ratings yet
Engr Ass! One
14 pages
Dbms Unit-1 Notes For Students
No ratings yet
Dbms Unit-1 Notes For Students
79 pages
Leiden DBDM 02 Databases 1x1
No ratings yet
Leiden DBDM 02 Databases 1x1
48 pages
BDA - Unit-1
No ratings yet
BDA - Unit-1
24 pages
Machine Learning in Big Data Analytics IJERTCONV9IS11032
No ratings yet
Machine Learning in Big Data Analytics IJERTCONV9IS11032
5 pages
Chapter 1-1.1
No ratings yet
Chapter 1-1.1
22 pages
Artigo - The Lowell Database Research Self Assessment
No ratings yet
Artigo - The Lowell Database Research Self Assessment
9 pages
Big Data Demystified - How To Use Big Data, Data Science and AI To Make Better Business Decisions and Gain Competitive Advantage (PDFDrive) - 61-71
No ratings yet
Big Data Demystified - How To Use Big Data, Data Science and AI To Make Better Business Decisions and Gain Competitive Advantage (PDFDrive) - 61-71
11 pages
CH 3
No ratings yet
CH 3
35 pages
BigData Processing Intro
No ratings yet
BigData Processing Intro
34 pages
Where The Rubber Meets The Sky: Bridging The Gap Between Databases and Science
No ratings yet
Where The Rubber Meets The Sky: Bridging The Gap Between Databases and Science
9 pages
BDA Unit 1
No ratings yet
BDA Unit 1
10 pages
Parallel Processing
No ratings yet
Parallel Processing
5 pages
Database
No ratings yet
Database
15 pages
Chapter Three - Data Warehouse Evaluation: SATA Technology and Business Collage
No ratings yet
Chapter Three - Data Warehouse Evaluation: SATA Technology and Business Collage
7 pages
Evolution of Database
No ratings yet
Evolution of Database
15 pages
001 Introduction Big Data
No ratings yet
001 Introduction Big Data
12 pages
BDA-UNIT-1
No ratings yet
BDA-UNIT-1
32 pages
Trends in Database Software
No ratings yet
Trends in Database Software
11 pages
Data v2
No ratings yet
Data v2
25 pages
Data Discourse Over The Years
No ratings yet
Data Discourse Over The Years
4 pages
U1 - History of Database Systems
No ratings yet
U1 - History of Database Systems
14 pages
Anoverviewon Big Dataand Hadoop
No ratings yet
Anoverviewon Big Dataand Hadoop
8 pages
Intelligence Community Massive Digital Data Systems Initiative
No ratings yet
Intelligence Community Massive Digital Data Systems Initiative
18 pages
IA1 _BDA
No ratings yet
IA1 _BDA
12 pages
1) Data-sci Chapter-1
No ratings yet
1) Data-sci Chapter-1
17 pages
Big Data Analytics - notes
No ratings yet
Big Data Analytics - notes
13 pages
History and Background of Database
No ratings yet
History and Background of Database
16 pages
Big Data Analytics (VN) 1
No ratings yet
Big Data Analytics (VN) 1
98 pages
DOLAP 2011-Analytics Over Large Scale MD Data
No ratings yet
DOLAP 2011-Analytics Over Large Scale MD Data
3 pages
Seminar Report BIG DATA
No ratings yet
Seminar Report BIG DATA
28 pages
Big Data
No ratings yet
Big Data
3 pages
Redbook 5th Edition
No ratings yet
Redbook 5th Edition
54 pages
Readings in Database Systems: Fifth Edition
No ratings yet
Readings in Database Systems: Fifth Edition
54 pages
Data Science Lecture 2 Four Dimensions
No ratings yet
Data Science Lecture 2 Four Dimensions
25 pages
Subject: Port Information Systems and Platforms: Proposed By: Prof Tali
No ratings yet
Subject: Port Information Systems and Platforms: Proposed By: Prof Tali
9 pages
FDS Module I-I
No ratings yet
FDS Module I-I
38 pages
PPT 1.1.2
No ratings yet
PPT 1.1.2
17 pages
Chap1-Overview of Data Science
No ratings yet
Chap1-Overview of Data Science
50 pages
Big Data Hadoop and Spark
No ratings yet
Big Data Hadoop and Spark
27 pages
Session 8 - George Strawn - Big Data
No ratings yet
Session 8 - George Strawn - Big Data
34 pages
Detailednotes_unit1_Big Data
No ratings yet
Detailednotes_unit1_Big Data
22 pages
Ashish_Presentation_Stage1_modify_LR
No ratings yet
Ashish_Presentation_Stage1_modify_LR
24 pages
Data-Intensive Supercomputing: The Case For DISC: Randal E. Bryant
No ratings yet
Data-Intensive Supercomputing: The Case For DISC: Randal E. Bryant
22 pages
SSRN Id3687251
No ratings yet
SSRN Id3687251
27 pages
7
No ratings yet
7
66 pages
Data Mining: Discovering Hidden Value in Your Data Warehouse
No ratings yet
Data Mining: Discovering Hidden Value in Your Data Warehouse
6 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Big Data Analysis Seminar
100% (1)
Big Data Analysis Seminar
15 pages
Cloud vs Local
From Everand
Cloud vs Local
Mei Gates
No ratings yet
Expandable Storage Innovations
From Everand
Expandable Storage Innovations
Kai Turing
No ratings yet
Building Scalable Data-Intensive Applications
From Everand
Building Scalable Data-Intensive Applications
Chandani Kaul
No ratings yet
Database Design
From Everand
Database Design
Mei Gates
No ratings yet
Case Study - Sentiment Analysis With RNNs
No ratings yet
Case Study - Sentiment Analysis With RNNs
8 pages
GBM-GBC Manual
No ratings yet
GBM-GBC Manual
26 pages
License Table For Competency Partners: Core Licenses
No ratings yet
License Table For Competency Partners: Core Licenses
23 pages
1 Saqa 14913 Learner Workbook Done
No ratings yet
1 Saqa 14913 Learner Workbook Done
18 pages
2degrees Mobile
No ratings yet
2degrees Mobile
22 pages
E2 Relay Commissioning
100% (1)
E2 Relay Commissioning
14 pages
Dofa Rolling Asst Prof 01 2023
No ratings yet
Dofa Rolling Asst Prof 01 2023
6 pages
University of Karachi: Entry Test Based Bachelor Programs (Morning) and Pharm-D (Morning & Evening) Closing Percentages
No ratings yet
University of Karachi: Entry Test Based Bachelor Programs (Morning) and Pharm-D (Morning & Evening) Closing Percentages
1 page
Design Hearing Aid Device
No ratings yet
Design Hearing Aid Device
7 pages
FDP On Design & Manufacturing Technologies For Make in India
No ratings yet
FDP On Design & Manufacturing Technologies For Make in India
5 pages
Online Food Ordering System
No ratings yet
Online Food Ordering System
4 pages
01 - Concepts of Entrepreneurship Innovations
No ratings yet
01 - Concepts of Entrepreneurship Innovations
35 pages
Gobetwino Manual
No ratings yet
Gobetwino Manual
35 pages
Virus Programming in CMD
100% (1)
Virus Programming in CMD
12 pages
MX Sea STD 001 Fracas Process - Rev01
100% (1)
MX Sea STD 001 Fracas Process - Rev01
32 pages
Mta 98 367 Study Guide
100% (1)
Mta 98 367 Study Guide
5 pages
Iso Mss List
No ratings yet
Iso Mss List
4 pages
18 83587 Cause and Effect (Shutdown Key) - G12-E002453-01-001
No ratings yet
18 83587 Cause and Effect (Shutdown Key) - G12-E002453-01-001
2 pages
Bagan Keyboard Unicode Free
No ratings yet
Bagan Keyboard Unicode Free
2 pages
MOCK EXAM 1-I
No ratings yet
MOCK EXAM 1-I
5 pages
Robotstudio 220412 Final
No ratings yet
Robotstudio 220412 Final
41 pages
Ch01-Introduction To ELEC 301-S2022
No ratings yet
Ch01-Introduction To ELEC 301-S2022
24 pages
G - Accelerators For Cost Effective Tailoring of T24 Model Bank Implementation
No ratings yet
G - Accelerators For Cost Effective Tailoring of T24 Model Bank Implementation
30 pages
Application of PI and MPPT Controller To DC-DC Con
No ratings yet
Application of PI and MPPT Controller To DC-DC Con
9 pages
How To Become A Hacker
No ratings yet
How To Become A Hacker
19 pages
Technical Seminar On RFID TECHNOLOGY: CMR College of Engineering and Technology
No ratings yet
Technical Seminar On RFID TECHNOLOGY: CMR College of Engineering and Technology
18 pages
01 - Company Profiles
No ratings yet
01 - Company Profiles
13 pages

Data Science Harvard Lecture 1 PDF

Uploaded by

Data Science Harvard Lecture 1 PDF

Uploaded by

CSCI E-84

A Practical Approach to Data

Main Objectives of the Course

What This Course is NOT is an in-depth

What is this course all about? It IS is a practical hands-on

Brief History of Data Science/Big Data

Brief History of Data Science/Big Data

Selected Milestones in the History of

November 1967 - B. A. Marron and P. A. D. de Maine publish Automatic

The Google White Papers

Developed 2 white papers

Basic precept was based on a distributed file system

Selected Software for Handling Big Data

R is a programming language and software environment for statistical

Software Tools - Continuation

Hadoop Ecosystem other Technologies

Scoop imports tables from RDBMS

Impala high speed analytic using distributed queries

Oozie workflow manager

Units of Information storage

Units of Information storage

1 ZB is equivalent to 1 followed by 21 zeroes.

The problem with data..

What makes it Big Data?

Diya Soubra (2012), The 3Vs that Define Big Data

Volume > 2.8

Digital universe will reach 40 zettabytes (ZB) by 2020

57 times all grains of sand on the beaches

Velocity Data in Motion

Increasing rate of data flow into the

Results delivery Real time, Near-real time, Periodic

Variety ALL types of data

While we are talking about Data

Value and Volatility

Uses of Big Data - Current

Credit Card Fraud

Uses of Big Data - Future

New material development

What is Data Science?

What does it take?

Is there anything else?

Some Order in the Buzz

I/O The Real Issue

Room for new ideas but some believe

What types of algorithms?

Links to more info about material presented in this lecture:

Links to more info about material presented in this talk:

You might also like