Unit1-Data Science Fundamentals
Unit1-Data Science Fundamentals
1
Definition of Data Science
Sources:Experiments
Surveys Data that already exists/published
Interviews
Questionnaires
Sources:Books/newspapers
Web information
Government Reports/published census
Research Articles
Raw Data
• It is also called as source data or atomic data.
• It is unprocessed data which is hard to parse or
analyze.
• The processing of raw data has to be carried out
more than once, this record should be
maintained.
Typical Attributes of Raw data
• Nominal Data:
Does not have any order. E.g which chocolate do you like
dark or white, gender is female or male etc.
Quantitative data
• Referring to a number, can be measured or
ranked.
•Example:
//Source: mathematics_monster.com
Types of Quantitative Data
• Discrete data
- can be counted
- involves whole numbers
- e.g. No of Children in your family
• Continuous data
- Numerical data
- any values in certain range
- e. g. Temperature
Case Study : Google Transparency Report
Country Cr_req Cr_c Ud_req Ud_comp Hemisphere hdi
ompl ly
y
Argentina 21 100 134 32 Southern Very high
Australia 10 40 361 73 Southern Very high
Belgium 6 100 90 67 Northern Very high
Brazil 224 67 703 82 Southern high
USA 92 63 5950 93 Northern Very high
Description of variables
• Country: Identifier variable, indicating name of the country for
which data are gathered.
• Cr_req: No. of content removal requests made by the
respective country.
It is Discrete numerical variable.
• Cr_comply: Percentage of content removal requests that
Google has complied with.
It is Continuous numerical variable.
• Ud_req: No. of user data requests by the country as a part of
criminal investigation.
It is discrete numerical variable.
• Ud_comply: Percentage of user data requests
that Google has complied with.
It is continuous numerical data variable.
• Hemisphere: Whether the country is in
southern or northern hemisphere.
It is nominal categorical variable.
• Hdi: human development index: It combines
indicators of life expectancy, educational
attainment and income and is released by
United Nations(UN).
It is ordinal categorical.
Code Book or Meta data
It should contain following information about the
variables:
• Units- whether the unit of column is in Rs in lacs or
thousands.
• Summary choices-whether mean or median is used.
• Information about resource of data- which data base is
used or structured survey etc is used.
• Valid link to the data base should be given.
• For structured survey- what was the population,
whether it was observational or experimental design ,
selection of samples , confounding variables
information biases if any , mathematical formulation
should be mentioned in the code book.
• Example:..\Codebook-Example.txt
Case Study
• “Growth in a Time of Debt”, Reinhart C. and Rogoff K , American Economic
Review: Papers and Proceedings , Vol- 100.
• Main finding is that across both advanced countries and emerging markets,
high debt/GDP levels (90 percent and above) are associated with notably
lower growth outcomes.
• Another Economist “Thomas Herndon” got hold of raw excel file and
metadata and proved that selective exclusion of available data and
unconventional weighting of summary statistics lead to serious errors.
• Hence the representation of relationship between public debt and GDP is
inaccurate.
• “Does High public debt consistently stifle economic growth? A critique of
Reinhart and Rogoff” , Herndon T, Ash m, Pollin r, Peri working paper series ,
no.32, April 2013.
• Case study highlights the importance of metadata in data processing and
more importantly ethics in data science.
Data Science Pipeline
1. Data collection
• All available datasets are gathered from
structured/clearly defined data sources( relational
databases) and unstructured data sources(emails,
social media chats, audio/ video mobile data)
• There are various ways of collecting data such as
✓ Web scraping(extract data from websites)
✓ Querying databases(request data from databases)
✓ Questionnaires and surveys
✓ Reading from excel sheets and other documents
• The quality of collected data determines the quality
of developed solution. Hence the solution is as good
as the data quality.
Challenges in extracting the data:
• Real data is in a far from easier format.
• Sometimes data is available in the form of free text. This data
may be interpretable with human intelligence but can be a
challenge as a task.— Doctor’s Prescription. It contains Dr’s
name registration no, etc.
• Another challenge is data may be well organized but may be
in some different format which is difficult to analyze.
• Sometimes data would be given in two different formats for
combining and joint processing.
Example: MySQL and MongoDB are free popular databases.
The data is extracted into a usable format such as CSV,JSON
etc. Hence the knowledge of different types of formats is
also necessary.
File formats