M1 - FDS
M1 - FDS
• Those with hacking skills can conceptualize and program complicated algorithms
using computer languages.
• Having a Math & Statistics Knowledge base allows you to theorize and evaluate
algorithms and tweak the existing procedures to fit specific situations.
• Having Substantive Expertise (domain expertise) allows you to apply concepts and
results in a meaningful and effective way.
• Data Science is the intersection of the three key areas mentioned earlier.
• In order to gain knowledge from data, we must be able to
o utilize computer programming to access the data
o understand the mathematics behind the models we derive
o understand our analyses' place in the domain we are in.
• This includes the presentation of data. If we are creating a model to predict heart
attacks in patients, is it better to create a PDF of information or an app where you
can type in numbers and get a quick prediction?
• All these decisions must be made by the data scientist.
Healthcare
• Medical Image Analysis
o Procedures such as detecting tumors, artery stenosis, organ delineation
employ various different methods and frameworks like MapReduce to find
optimal parameters for tasks like lung texture classification.
o It applies machine learning methods, support vector machines (SVM),
content-based medical image indexing for solid texture classification.
• Genetics & Genomics
o The goal is to understand the impact of DNA on our health and find individual
biological connections between genetics, diseases, and drug response.
o Data science techniques allow integration of different kinds of data with
genomic data in the disease research, which provides a deeper
understanding of genetic issues in reactions to particular drugs and diseases.
• Drug Development
o Data science applications and machine learning algorithms simplify and
shorten the drug discovery process.
o Such algorithms can forecast how the compound will act in the body using
advanced mathematical modelling and simulations.
Internet Search
• All these search engines make use of data science algorithms to deliver the best
result for our searched query in a fraction of seconds.
Targeted Advertising
• Starting from the display banners on various websites to the digital billboards at
the airports – almost all of them are decided by using data science algorithms.
• This is the reason why digital ads have been able to get a lot higher CTR (Call-
Through Rate) than traditional advertisements.
• They can be targeted based on a user’s past behavior.
Website Recommendations
• Product recommendations not only help you find relevant products from billions of
products available with them but also adds a lot to the user experience.
• Internet giants like Amazon, Twitter, Google Play, Netflix, LinkedIn, IMDB and many
more use this system to improve the user experience.
• The recommendations are made based on previous search results for a user.
Advanced Image Recognition
• You upload your image with friends on Facebook and you start getting suggestions
to tag your friends. This automatic tag suggestion feature uses face recognition
algorithm.
• In addition, Google provides you with the option to search for images by uploading
them. It uses image recognition and provides related search results.
Airline Route Planning
• Airline companies are struggling to maintain their occupancy ratio and operating
profits
• Airline companies started using data science to identify the strategic areas of
improvements. Now using data science, the airline companies can:
o Predict flight delay
o Decide which class of airplanes to buy
o Whether to directly land at the destination or take a halt in between
o Effectively drive customer loyalty programs
Speech Recognition
• Using speech-recognition feature, simply speak out the message and it will be
converted to text.
• Speech recognition products are Google Voice, Siri, Cortana etc use Data Science
powered algorithms for this purpose.
Types of Data
Structured vs Unstructured Data
Structured Data Unstructured Data
This is data that can be thought of as This data exists as a free entity
observations and characteristics.
It is usually organized using a table It does not follow any standard
method (rows and columns). organization hierarchy.
It takes lesser storage It takes more storage
It is much easier to work with and analyze It is much harder to work with & analyze
Structured data contributes only around Most data (80-90%) in the world exist in
10-20% of total data available. unstructured form
It is easier to search data It is harder to search data
Structured data usually consist of only Unstructured data exists in various types
textual data. like images, audio, text etc.
It usually resides in Relational DBs and It usually resides in NoSQL DBs, Data
Data Warehouses. lakes and warehouses etc.
Examples: dates, phone numbers, Aadhar Examples: server logs, Twitter and FB
numbers, address, names etc. posts, Emails, Images, Audio, Video files
Typical applications: Typical applications:
• Reservation systems • Word/Test processing softwares
• Inventory management systems • Email clients
• CRM systems • Audio/Video processing softwares
• ERP system • Social Media sites
Data pre-processing is not required. Data pre-processing is required.
Quantitative vs Qualitative Data
Quantitative Data Qualitative Data
This data can be described using This data cannot be described using
numbers. numbers.
Basic mathematical operations can be Mathematical operations cannot be
performed. performed.
It cannot be described using text. It is described using text / language.
Easy to perform statistical analysis Harder to perform statistical analysis
Collected/generated from surveys, Collected/generated from interviews,
scientific observations etc. textual descriptions, documents etc.
Ex: exam scores, weight, shoe size, age Ex: names, colours, zip-codes, adjectives
(typically any number) (typically any word)
Note: Though zip-code is numeric, it is qualitative as no meaningful mathematical
operations can be performed on it.
• For a quantitative column, you may ask questions such as the following:
o What is the average value?
o Does this quantity increase or decrease over time (if time is a factor)?
o Is there a threshold that if this number grew above or be too low would
signal trouble for the company?
• For a qualitative column, none of the preceding questions can be answered;
however, the following questions only apply to qualitative values:
o Which value occurs the most and the least?
o How many unique values are there?
o What are these unique values?
• Quantitative data can be further categorized into:
o Discrete quantities:
o Continuous quantities
Discrete quantities:
• This describes data that is counted. It can only take on certain values.
• Ex: dice roll, because it can only take on six values, and the number of customers
in a café, because you can't have a real range of people (i.e., the number of people
cannot be a value like 48.67).
Continuous quantities
• This describes data that is measured. It exists on an infinite range of values.
• Examples:
o A good example of continuous data would be a person's weight because it
can be 150 pounds or 197.66 pounds (note the decimals).
o The height of a person or building is a continuous number because an infinite
scale of decimals is possible.
o Other examples of continuous data would be time and temperature.
The 4 Levels of Data
It is generally understood that a specific characteristic (feature/column) of
structured data can be broken down into one of four levels of data. The levels are:
1. Nominal Level
2. Ordinal Level
3. Interval Level
4. Ratio Level
Measures of Centre: A measure of center is a number that describes what the data
tends to. It is sometimes referred to as the balance point of the data. Common
examples include the mean, median, and mode.
Nominal Level
• The nominal level, (which also sounds like the word name) consists of data that is
described purely by name or category.
• Basic examples include gender, nationality, species etc.
• They are not described by numbers and are therefore qualitative. Examples:
o A type of animal is on the nominal level of data. If you’re a human, then you
are also a mammal.
o A part of speech is also considered on the nominal level of data. The word
she is a pronoun, and it is also a noun.
• Being qualitative, we cannot perform any quantitative mathematical operations,
such as addition or division.
• Data at the nominal level is mostly categorical in nature.
• We cannot perform mathematics on the nominal level of data except the basic
equality and set membership functions, as shown in the following two examples:
o Being a tech entrepreneur is the same as being in the tech industry, but not
vice versa
o A figure described as a square falls under the description of being a rectangle,
but not vice versa
• Measures of Centre used for Nominal Data: In order to find the center of nominal
data, we generally turn to the mode (the most common element) of the dataset.
Ex: In an employee dataset, the most common city was Bangalore, making that a
possible choice for the center of the city column.
Ordinal Level
• In the Nominal Level, we could not order the observations in any natural way.
• Data in the ordinal level provides us with a rank order, or the means to place one
observation before the other.
• However, it does not provide us with relative differences between observations,
meaning that while we may order the observations from first to last, we cannot
add or subtract them to get any real meaning.
• Example: Satisfaction scale from 1 to 10. Your answer, which must fall between 1
and 10, can be ordered: 8 is better than 7 while 3 is worse than 9.
• However, differences between the numbers do not make much sense. The
difference between a 7 and a 6 might be different than the difference between a 2
and a 1.
• We inherit all mathematics from the ordinal level (equality and set membership)
and we can also add the following to the list of operations allowed in the nominal
level:
o Ordering: refers to the natural order provided to us by the data
o Comparison:
▪ At the nominal level, it would not make sense to say that one country
was naturally better than another or that one part of speech is worse
than another.
▪ At the ordinal level, we can make these comparisons. For example, we
can talk about how putting a "7" on a survey is worse than putting a
"10".
• Measures of Centre used for Ordinal Data: At the ordinal level, the median is
usually an appropriate way of defining the center of the data.
• Ex: Imagine you have conducted a survey among your employees asking "how
happy are you to be working here on a scale from 1-5", and your results are as
follows: 1, 4, 4, 5, 2, 1, 3, 2.
• Though, we may think mean would also work, it is not mathematically viable
because if we subtract/add two scores, say a score of 4 minus a score of 2, the
difference of two does not really mean anything.
• If addition/subtraction among the scores doesn't make sense, the mean won't
make sense either.
Interval Level
• At the interval level, we are beginning to look at data that can be expressed
through very quantifiable means, and where much more complicated
mathematical formulas are allowed.
• The basic difference between the ordinal level and the interval level is, well, just
that—difference. Data at the interval level allows meaningful subtraction between
data points.
• Ex: Temperature - If it is 100 °F in Texas and 80 °F in Istanbul, Turkey, then Texas is
20 degrees warmer than Istanbul.
• Note: The 1-5 survey example, used for ordinal level, doesn’t fir for interval level.
The difference between the scores (when you subtract them) does not make
sense, therefore, this data cannot be called at the interval level.
• We can use all the operations allowed on the lower levels (ordering, comparisons,
equality and set membership), along with two other notable operations:
o Addition
o Subtraction
• Measures of Centre used for Interval Data: At this level, we can use the median
and mode to describe this data; however, usually the most accurate description of
the center of data would be the arithmetic mean.
• Measures of variation:
o The measures that describe how ‘spread out’ the dataset are called measures
of variation.
o Measures of variation give us a very clear picture of how spread out or
dispersed our data is.
o Example: Standard Deviation
• Standard Deviation: It can be thought of as the "average distance a data point is at
from the mean". The steps for calculating SD are:
1. Find the mean of the data.
2. For each number in the dataset, subtract it from the mean and then square it
3. Find the average of each square difference.
4. Take the square root of the number obtained in step three. This is the standard
deviation.
Ratio Level
• Data at the interval level does not have a "natural starting point or a natural zero".
Ex: being at zero degrees Celsius does not mean that you have "no temperature".
• Not only can we define order and difference, the ratio level allows us to multiply
and divide as well. Examples:
o While Fahrenheit and Celsius are stuck in the interval level, the Kelvin scale of
temperature boasts a natural zero. A measurement zero Kelvin literally
means the absence of heat. We can actually scientifically say that 200 Kelvin
is twice as much heat as 100 Kelvin.
o Money in the bank is at the ratio level. You can have "no money in the bank"
and it also makes sense that $200,000 is "twice as much as" $100,000.
• Note: Though Celsius and Fahrenheit can be converted to Kelvin and zero Kelvin
has a Celsius and Fahrenheit equivalent, the equivalent is negative and not zero
and hence can’t be considered as data in Ratio Level.
• Measures of Centre used for Ratio Level Data: The arithmetic mean still holds
meaning at this level, as does a new type of mean called the geometric mean. This
measure is generally not used as much even at the ratio level, but is worth
mentioning. It is the square root of the product of all the values.
• Disadvantages:
o Data at the ratio level is usually non-negative.
o For this reason alone, many data scientists prefer the interval level to the
ratio level.
o The reason for this restrictive property is because if we allowed negative
values, the ratio might not always make sense.
Exercise on 4 Levels of Data
a) The teacher of a class of 3rd graders records the height of each student - Ratio
b) The teacher of a class records the eye color of each student - Nominal
c) The teacher records the letter grade for mathematics for each student - Ordinal
d) The teacher records the percentage that each student got on the last test - Ratio
e) A person compiles a list of temperatures in degree Celsius for a month - Interval
f) A person compiles a list of temperatures in degree Fahrenheit - Interval
g) A person compiles a list of temperatures in Kelvin for a month - Ratio
h) A film critic lists the top 50 greatest movies of all time - Ordinal
i) A car magazine lists the prices of the most expensive cars - Ratio
j) The captain of a team lists the jersey numbers for each of the players - Nominal
k) A local animal shelter keeps track of the breeds of dogs that come in - Nominal
l) A local animal shelter keeps track of the weights of dogs that come in - Ratio
m)Calendar year is an example of what scale of measurement? - Interval
n) Number of calories in a small pack of Yogurt - Ratio
o) Shades of lipstick available in a store, Blood type - Nominal
p) Your age, Hourly wages - Ratio
q) Arranging the shirt sizes as small, medium and large - Ordinal
r) Pain scale in a doctor's office - Ordinal
s) Sequence of math classes: Algebra 1, Geometry, Algebra 2A, Statistics - Ordinal
Types of Learning
• Supervised (inductive) learning: Input - training data + desired outputs (labels)
• Unsupervised learning: Input - training data (without desired outputs)
• Semi-supervised learning: Input - training data + a few desired outputs
• Reinforcement learning: Rewards from sequence of actions