Data Science Harvard Lecture 1 PDF
Data Science Harvard Lecture 1 PDF
Required Textbooks
Statistics
Database Querying
SQL
Data Warehousing
Regression Analysis
Explanatory versus Predictive Modeling
April 1980 - I.A. Tjomsland gives a talk titled Where Do We Go From Here? at
the Fourth IEEE Symposium on Mass Storage Systems, in which he says Those
associated with storage devices long ago realized that Parkinsons First Law may
be paraphrased to describe our industry Data expands to fill the space
available. I believe that large amounts of data are being retained because users
have no way of identifying obsolete data; the penalties for storing obsolete data
are less apparent than are the penalties for discarding potentially useful data.
July 1986 - Hal B. Becker publishes Can users really absorb data at todays
rates? Tomorrows? in Data Communications. Becker estimates that the
recoding density achieved by Gutenberg was approximately 500 symbols
(characters) per cubic inch500 times the density of [4,000 B.C. Sumerian] clay
tablets. By the year 2000, semiconductor random access memory should be storing
1.25X10^11 bytes per cubic inch.
1996 - Digital storage becomes more cost-effective for storing data than paper
according to R.J.T. Morris and B.J. Truskowski, in The Evolution of Storage
Systems IBM Systems Journal, July 1, 2003.
1997 - Michael Lesk publishes How much information is there in the world?
Lesk concludes that There may be a few thousand petabytes of information all
told; and the production of tape and disk will reach that level by the year 2000. So
in only a few years, (a) we will be able [to] save everythingno information will
have to be thrown out, and (b) the typical piece of information will never be
looked at by a human being.
August 1999 - Steve Bryson, David Kenwright, Michael Cox, David Ellsworth, and
Robert Haimes publish Visually exploring gigabyte data sets in real time in
the Communications of the ACM. It is the first CACM article to use the term Big Data
(the title of one of the articles sections is Big Data for Scientific Visualization). The
article opens with the following statement: Very powerful computers are a blessing to
many fields of inquiry. They are also a curse; fast computations spew out massive
amounts of data. Where megabyte data sets were once considered large, we now find
data sets from individual simulations in the 300GB range. But understanding the data
resulting from high-end computations is a significant endeavor. As more than one
scientist has put it, it is just plain difficult to look at all the numbers. And as Richard W.
Hamming, mathematician and pioneer computer scientist, pointed out, the purpose of
computing is insight, not numbers.
October 2000 - Peter Lyman and Hal R. Varian at UC Berkeley publish How Much
Information? It is the first comprehensive study to quantify, in computer storage terms,
the total amount of new and original information (not counting copies) created in the
world annually and stored in four physical media: paper, film, optical (CDs and DVDs),
and magnetic.
Software Tools
MongoDB is a cross-platform document-oriented database developed by
MongoDB Inc, in 2007. Classified as a NoSQL database, MongoDB eschews the
traditional table-based relational database structure in favor of JSON-like
documents with dynamic schemas (MongoDB calls the format BSON), making
the integration of data in certain types of applications easier and faster.
MongoDB is free and open-source software.
Hadoop
Open source from Apache [written in Java]
Implementation of the Google White Papers
Now associated with a collection of technologies
Design Choice
Hadoop Example
Very good at storing files
Optimizes use of cheap resources - no RAID
needed here
Provides data redundancy
Good at sequential reads
Not so good at high speed random reads
Cannot update a file must replace
Big Data
The processing of massive
data sets that facilitate
real-time data driven decision-making
Digital data grows by
2.5 quintillion (1018)
bytes every day
In the physical sciences a Kilo (K) stand for 1000 units. For
example: 1 Km = 1000 meters (m)
1 Kg = 1000 grams (g)
However, in the computer field when we refer to Kilobyte or
Kb the K = 1024 bytes.
Larger information units and their names are shown in the figure below.
Source (https://en.wikipedia.org/wiki/Units_of_information)
21)
(10 ZB
Disease Outbreak
Facial Recognition
Traffic
Validity
Validity refers to the issue that the data being stored and
mined is clean and meaningful to the problems or
decisions that need to be made.
In Summary
Customer Recommendations
Streaming Routing
Online ad targeting
Predictive Analytics
Manage the risk
Crop Forecasts
Self Quantification
Tools in
Data
Science
Algorithms
What it really is all about
In Big Data, common association is with
Predictive Algorithms
Classifying
Searching
Predictive
Learning via Rule
Fit Ensembles
Streaming
Filtering
Deterministic
behavior
algorithms
Summary
From scientific discovery to business intelligence,
"Big Data" is changing our world
Big Data permeates most (all?) areas of computer
science
Opening the doors to lots of opportunities in the
computing sciences
Its not just for Data Scientists.