0% found this document useful (0 votes)
40 views

M1 - FDS

This document provides an introduction to basic data science concepts including terminology, the data science life cycle, and common algorithms. It discusses how data science involves the intersection of computer programming, mathematics/statistics, and domain expertise. The data science life cycle includes problem definition, data investigation and cleaning, developing a minimal viable model, deployment and enhancements, and ongoing data science operations. Common algorithms include those for data processing, parameter estimation, and machine learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

M1 - FDS

This document provides an introduction to basic data science concepts including terminology, the data science life cycle, and common algorithms. It discusses how data science involves the intersection of computer programming, mathematics/statistics, and domain expertise. The data science life cycle includes problem definition, data investigation and cleaning, developing a minimal viable model, deployment and enhancements, and ongoing data science operations. Common algorithms include those for data processing, parameter estimation, and machine learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

FDS - Module 1

Introduction to Data Science


Basic terminology
• Data: A collection of information in either an organized or unorganized format
o Organized data: This refers to data that is sorted into a row/column
structure, where every row represents a single observation and the columns
represent the characteristics of that observation.
o Unorganized data: This is the type of data that is in the free form, usually text
or raw audio/signals that must be parsed further to become organized.
• Data Science:
o Data science is the art and science of acquiring knowledge through data.
o Data science is all about how we take data, use it to acquire knowledge, and
then use that knowledge to do the following:
▪ Make decisions
▪ Predict the future
▪ Understand the past/present
▪ Create new industries/products
• Datafication: It is the process of taking all aspects of life and turning them into
data. Examples:
o Google’s augmented-reality glasses datafy the gaze
o Twitter datafies stray thoughts
o LinkedIn datafies professional networks

Why Data Science?


• The sheer volume of data makes it literally impossible for a human to parse it in a
reasonable time.
• Data is collected in various forms and from different sources, and often comes in
very unorganized.
• Data can be missing, incomplete, or just flat out wrong.
• Often, we have data on very different scales and that makes it tough to compare
it.
• Once we clean our data, the relationships between the data become more
obvious, and the knowledge that was once buried deep in millions of rows of data
simply pops out.
• One of the main goals of data science is to make explicit practices and procedures
to discover and apply these relationships in the data.

The Data Science Venn diagram


Understanding data science begins with three basic areas:
• Math/statistics: This is the use of equations and formulas to perform analysis
• Computer programming: This is the ability to use code to create outcomes on the
computer
• Domain knowledge: This refers to understanding the problem domain (medicine,
finance, social science, and so on)
The following Venn diagram provides a visual representation of how the three areas
of data science intersect.

• Those with hacking skills can conceptualize and program complicated algorithms
using computer languages.
• Having a Math & Statistics Knowledge base allows you to theorize and evaluate
algorithms and tweak the existing procedures to fit specific situations.
• Having Substantive Expertise (domain expertise) allows you to apply concepts and
results in a meaningful and effective way.
• Data Science is the intersection of the three key areas mentioned earlier.
• In order to gain knowledge from data, we must be able to
o utilize computer programming to access the data
o understand the mathematics behind the models we derive
o understand our analyses' place in the domain we are in.
• This includes the presentation of data. If we are creating a model to predict heart
attacks in patients, is it better to create a PDF of information or an app where you
can type in numbers and get a quick prediction?
• All these decisions must be made by the data scientist.

Tools for Data Science


• Data science can be done in many languages.
• Python, Julia, and R are some of the many languages available to us.
Why Python?
• Python is an extremely simple language to read and write
• It is one of the most common languages, both in production and academic setting
• The language's online community is vast and friendly. This means that a quick
Google search should yield multiple results of people who have faced and solved
similar (if not exactly the same) situations
• Python has prebuilt data science modules that both the novice and the veteran
data scientist can utilize
Some of these modules are as follows
• pandas • numpy/scipy
• sci-kit learn • requests (to mine data from the Web)
• seaborn • BeautifulSoup (for the Web-HTML parsing)

Data Science Life Cycle


This life cycle has five steps:
1. Problem Definition
2. Data Investigation and Cleaning
3. Minimal Viable Model
4. Deployment and Enhancements
5. Data Science Ops
1. Problem Definition
The project lead or product manager manages this phase. This initial phase should:
• State clearly the problem to be solved and why
• Identify the project risks including ethical considerations
• Identify the key stakeholders
• Align the stakeholders with the data science team
• Assess the resources (people and infrastructure) you’ll likely need
• Develop and communicate a high-level, flexible project plan

2. Data Investigation and Cleaning


➢ The team needs to identify what data is needed to solve the underlying problem.
Then determine how to get the data.
➢ Once the data is available, start exploring it. The data scientists or business/data
analysts will lead several activities such as:
• Document the data quality
• Clean the data
• Combine various data sets to create new views
• Load the data into the target location (often to a cloud platform)
• Visualize the data
• Present initial findings to stakeholders and solicit feedback
3. Minimal Viable Mode
➢ The minimal viable model is the version of a new model which allows a team to
collect the maximum amount of validated learning about the model’s
effectiveness with the least effort
• Minimal: The model is narrowly focused. It is not the best possible model but is
sufficient enough to make a measurable impact on a subset of the overall
problem.
• Collect the maximum amount of validated learning about the model’s
effectiveness: Develop a hypothesis and test it. This validated learning confirms
or denies your team’s initial hypotheses.
• Least effort: Full-fledged/complete deployments are typically costly and time-
consuming. Therefore, find the simplest way to get the model out.

4. Deployment and Enhancements


• Deployment:
o This step could be as simple as getting your model output in a Tableau
dashboard or as complex as scaling it to the cloud to millions of users.
o Typically, the more “engineering-focused” team members such as data
engineers, cloud engineers, ML engineers, application developers, and
quality assurance engineers execute this phase.
• Enhancements: Use the time that the engineers need to deliver the model to
improve the models. Conceptually, this “Enhancements” phase means to:
o Extend the model to similar use cases (i.e., a new “Problem Definition” phase)
o Add and clean data sets (i.e., a new “Data Investigation and Cleaning” phase)
o Try new modelling techniques (i.e., developing the next “Viable Model”)

5. Data Science Ops


➢ As data science matures into mainstream operations, companies need to take a
stronger product focus that includes plans to maintain the deployed systems for a
long-term. There are three major overlapping facets of management to this:
• Software Management: A productized data science solution needs to be
maintained. Common practices include:
o Maintaining the various system environments
o Managing access control
o Triggering alert notifications for serious incidents
o Meeting service level agreements (SLAs)
o Implementing security patches
• Model and Data Management: Data science product operations have additional
considerations beyond standard software product maintenance:
o Monitor the Data
o Monitor Model Performance
o Run Alpha / Beta Tests
o Ensure Proper Model Governance
• On-going Stakeholder Management:
o Continue to educate your stakeholders and set expectations that the
model isn’t magic.
o To drive adoption, communicate realistic benefits and if needed, provide
training to end users.
o Likewise, warn stakeholders of the risks and shortcomings of the models
and how to mitigate these.
Algorithms for Data Science
The three classes of algorithms one should be aware of wrt to Data Science are:
➢ Data munging (manipulation), preparation, and processing algorithms, such as
sorting, MapReduce, or Pregel.
➢ Optimization algorithms for parameter estimation, including Stochastic Gradient
Descent, Newton’s Method, and Least Squares.
➢ Machine learning algorithms
Machine Learning Algorithms for Data Science
There are some broad generalizations to consider:
• Interpreting parameters:
o Statisticians think of the parameters in their linear regression models as
having real-world interpretations.
o They also typically want to be able to find meaning in behavior or describe
the real-world phenomenon corresponding to those parameters.
o Whereas a soft‐ ware engineer or computer scientist might be wanting to
build their linear regression algorithm into production-level code,
o The predictive model they build is what is known as a black box algorithm as
they don’t generally focus on the interpretation of the parameters.
• Confidence intervals:
o Statisticians provide confidence intervals and posterior distributions for
parameters and estimators, and are interested in capturing the variability or
uncertainty of the parameters.
o Many machine learning algorithms, such as k-means or k-nearest neighbours
don’t have a notion of confidence intervals or uncertainty.
• The role of explicit assumptions:
o Statistical models make explicit assumptions about data generating processes
and distributions, and you use the data to estimate parameters.
o Nonparametric solutions, don’t make any assumptions about probability
distributions, or they are implicit.
Linear Regression - Supervised ML Algorithm
• It’s used when you want to express the mathematical relationship between two
variables or attributes.
• When we use it, we are making the assumption that there is a linear relationship
between an outcome variable (aka response variable, dependent variable or label)
and a predictor (aka an independent variable, explanatory variable, or feature)
• Given by the equation, 𝑦 = 𝑓(𝑥) = 𝑚𝑥 + 𝑐
• Linear regression seeks to find the best fit line that minimizes the sum of the
squares of the vertical distances between the approximated or predicted 𝑦𝑖 s and
the observed 𝑦̂s.
𝑖
• This is done to minimize the prediction errors. This method is called least squares
estimation.

Uses of Linear Regression:


• Determining the strength of predictors - Ex: What is the strength of relationship
between sales and marketing expenditure?
• Forecasting effect - Ex: How much the dependent variable changes with the
change in one/more independent variables?
• Trend Forecasting - Ex: What will be the price of petrol in next 6 months?
When to use Linear Regression?
• When available data is continuous and purely numeric. Ex: temperature, sales etc
• Data quality - used on datasets with nil/very few missing values
Advantages of Linear Regression
• Computationally inexpensive model
• Mathematically simpler; uses simpler equations and is easily understandable
Important Equations
∑((𝑥𝑖 −𝑥̅ )×(𝑦𝑖 −𝑦̅))
𝑦 = 𝑓 (𝑥 ) = 𝑚𝑥 + 𝑐, where 𝑚 =
∑(𝑥𝑖 −𝑥̅ )2
Find 𝑐 from the regression line, 𝑦̅ = 𝑚𝑥̅ + 𝑐.
Then, find the predicted value, 𝑦̂ from the equation 𝑦̂ = 𝑚𝑥 + 𝑐
∑(𝑦̂ − 𝑦)2
Lastly, find the RMSE (Root Mean Square Error) from 𝑅𝑀𝑆𝐸 = √
𝑛
Where 𝑛 is total number of observations.
K-nearest Neighbours (KNN) - Supervised ML Algorithm
• KNN is used in scenarios, where a dataset is classified/labelled in some way and
the aim is classify/label data that haven’t been classified/labelled.
• In linear regression, the output is a continuous variable. In KNN, the output of your
algorithm is going to be a categorical label.
• The intuition behind k-NN is to consider the most similar other items defined in
terms of their attributes, look at their labels, and give the unassigned item the
majority vote.
• If there’s a tie, randomly select among the labels that have tied for first.
Steps in KNN
1. Initialize 𝑘 with a value, say 3 (generally, an odd number so as to break ties).
2. Let the input to be labelled be (𝑎, 𝑏)
3. The given dataset has input columns (𝑥1 , 𝑥2 ) and output column 𝑦.
4. Decide the distance metric (like, Euclidean, Manhattan, Squared Euclidean etc),
which will be used to calculate the distance between (𝑎, 𝑏) and (𝑥1𝑖 , 𝑥2𝑖 ) .
5. Now, rank the distances between (𝑎, 𝑏) and (𝑥1𝑖 , 𝑥2𝑖 ).
6. Finally, select the top 𝑘-ranked observations. The majority output is taken as the
output for the given input (𝑎, 𝑏)
Choosing 𝒌
• Usually, the dataset will be divided into 2: training and testing (as it is Supervised
ML algorithm). Train the KNN Model using an initial value of 𝑘 and then validate it
using the testing data. Calculate the misclassification rate.
• Run KNN a few times, changing 𝑘, and checking the misclassification rate each
time.
K-Mean Clustering - Unsupervised ML Algorithm
• So far we’ve only seen supervised learning, where we know beforehand what label
(aka the “right answer”) is and we’re trying to get our model to be as accurate as
possible, defined by our chosen evaluation metric.
• k-means is an unsupervised learning technique, where the goal of the algorithm is
to determine the definition of the right answer by finding clusters of data.
• Clustering is the process of dividing the dataset into multiple group/clusters,
consisting of similar data points. Here, k represents the number of clusters.
• The algorithm can be stated as follows:
o First it selects k number of objects at random from the set of n objects. These k
objects are treated as the centroids or center of gravities of k clusters.
o For each of the remaining objects, it is assigned to the closest centroid. Thus, it
forms a collection of objects assigned to each centroid and is called a cluster.
o Next, the centroid of each cluster is then updated/recalculated (by calculating
the mean values of attributes of each object in that cluster).
o The assignment and update procedure is done until it reaches some stopping
criteria (number of iteration or centroids remain unchanged)
Disadvantages with K-Means Clustering
• Choosing k is more an art than a science, although there are bounds: 1 ≤ k ≤ n,
where n is number of data points.
• Convergence issues: The process of finding the clusters may not converge.
Applications of Data Science
• Fraud and Risk Detection
• Healthcare
• Internet Search
• Targeted Advertising
• Website Recommendations
• Advanced Image Recognition
• Speech Recognition
• Airline Route Planning
Fraud and Risk Detection
• The earliest applications of data science were in Finance. Companies were fed up
of bad debts and losses every year.
• However, they had a lot of data which use to get collected during the initial
paperwork while sanctioning loans.
• Over the years, banking companies learned to divide and conquer data via
customer profiling, past expenditures, and other essential variables to analyze the
probabilities of risk and default.
• Moreover, it also helped them to push their banking products based on customer’s
purchasing power.

Healthcare
• Medical Image Analysis
o Procedures such as detecting tumors, artery stenosis, organ delineation
employ various different methods and frameworks like MapReduce to find
optimal parameters for tasks like lung texture classification.
o It applies machine learning methods, support vector machines (SVM),
content-based medical image indexing for solid texture classification.
• Genetics & Genomics
o The goal is to understand the impact of DNA on our health and find individual
biological connections between genetics, diseases, and drug response.
o Data science techniques allow integration of different kinds of data with
genomic data in the disease research, which provides a deeper
understanding of genetic issues in reactions to particular drugs and diseases.
• Drug Development
o Data science applications and machine learning algorithms simplify and
shorten the drug discovery process.
o Such algorithms can forecast how the compound will act in the body using
advanced mathematical modelling and simulations.
Internet Search
• All these search engines make use of data science algorithms to deliver the best
result for our searched query in a fraction of seconds.
Targeted Advertising
• Starting from the display banners on various websites to the digital billboards at
the airports – almost all of them are decided by using data science algorithms.
• This is the reason why digital ads have been able to get a lot higher CTR (Call-
Through Rate) than traditional advertisements.
• They can be targeted based on a user’s past behavior.
Website Recommendations
• Product recommendations not only help you find relevant products from billions of
products available with them but also adds a lot to the user experience.
• Internet giants like Amazon, Twitter, Google Play, Netflix, LinkedIn, IMDB and many
more use this system to improve the user experience.
• The recommendations are made based on previous search results for a user.
Advanced Image Recognition
• You upload your image with friends on Facebook and you start getting suggestions
to tag your friends. This automatic tag suggestion feature uses face recognition
algorithm.
• In addition, Google provides you with the option to search for images by uploading
them. It uses image recognition and provides related search results.
Airline Route Planning
• Airline companies are struggling to maintain their occupancy ratio and operating
profits
• Airline companies started using data science to identify the strategic areas of
improvements. Now using data science, the airline companies can:
o Predict flight delay
o Decide which class of airplanes to buy
o Whether to directly land at the destination or take a halt in between
o Effectively drive customer loyalty programs
Speech Recognition
• Using speech-recognition feature, simply speak out the message and it will be
converted to text.
• Speech recognition products are Google Voice, Siri, Cortana etc use Data Science
powered algorithms for this purpose.
Types of Data
Structured vs Unstructured Data
Structured Data Unstructured Data
This is data that can be thought of as This data exists as a free entity
observations and characteristics.
It is usually organized using a table It does not follow any standard
method (rows and columns). organization hierarchy.
It takes lesser storage It takes more storage
It is much easier to work with and analyze It is much harder to work with & analyze
Structured data contributes only around Most data (80-90%) in the world exist in
10-20% of total data available. unstructured form
It is easier to search data It is harder to search data
Structured data usually consist of only Unstructured data exists in various types
textual data. like images, audio, text etc.
It usually resides in Relational DBs and It usually resides in NoSQL DBs, Data
Data Warehouses. lakes and warehouses etc.
Examples: dates, phone numbers, Aadhar Examples: server logs, Twitter and FB
numbers, address, names etc. posts, Emails, Images, Audio, Video files
Typical applications: Typical applications:
• Reservation systems • Word/Test processing softwares
• Inventory management systems • Email clients
• CRM systems • Audio/Video processing softwares
• ERP system • Social Media sites
Data pre-processing is not required. Data pre-processing is required.
Quantitative vs Qualitative Data
Quantitative Data Qualitative Data
This data can be described using This data cannot be described using
numbers. numbers.
Basic mathematical operations can be Mathematical operations cannot be
performed. performed.
It cannot be described using text. It is described using text / language.
Easy to perform statistical analysis Harder to perform statistical analysis
Collected/generated from surveys, Collected/generated from interviews,
scientific observations etc. textual descriptions, documents etc.
Ex: exam scores, weight, shoe size, age Ex: names, colours, zip-codes, adjectives
(typically any number) (typically any word)
Note: Though zip-code is numeric, it is qualitative as no meaningful mathematical
operations can be performed on it.
• For a quantitative column, you may ask questions such as the following:
o What is the average value?
o Does this quantity increase or decrease over time (if time is a factor)?
o Is there a threshold that if this number grew above or be too low would
signal trouble for the company?
• For a qualitative column, none of the preceding questions can be answered;
however, the following questions only apply to qualitative values:
o Which value occurs the most and the least?
o How many unique values are there?
o What are these unique values?
• Quantitative data can be further categorized into:
o Discrete quantities:
o Continuous quantities
Discrete quantities:
• This describes data that is counted. It can only take on certain values.
• Ex: dice roll, because it can only take on six values, and the number of customers
in a café, because you can't have a real range of people (i.e., the number of people
cannot be a value like 48.67).
Continuous quantities
• This describes data that is measured. It exists on an infinite range of values.
• Examples:
o A good example of continuous data would be a person's weight because it
can be 150 pounds or 197.66 pounds (note the decimals).
o The height of a person or building is a continuous number because an infinite
scale of decimals is possible.
o Other examples of continuous data would be time and temperature.
The 4 Levels of Data
It is generally understood that a specific characteristic (feature/column) of
structured data can be broken down into one of four levels of data. The levels are:
1. Nominal Level
2. Ordinal Level
3. Interval Level
4. Ratio Level
Measures of Centre: A measure of center is a number that describes what the data
tends to. It is sometimes referred to as the balance point of the data. Common
examples include the mean, median, and mode.
Nominal Level
• The nominal level, (which also sounds like the word name) consists of data that is
described purely by name or category.
• Basic examples include gender, nationality, species etc.
• They are not described by numbers and are therefore qualitative. Examples:
o A type of animal is on the nominal level of data. If you’re a human, then you
are also a mammal.
o A part of speech is also considered on the nominal level of data. The word
she is a pronoun, and it is also a noun.
• Being qualitative, we cannot perform any quantitative mathematical operations,
such as addition or division.
• Data at the nominal level is mostly categorical in nature.
• We cannot perform mathematics on the nominal level of data except the basic
equality and set membership functions, as shown in the following two examples:
o Being a tech entrepreneur is the same as being in the tech industry, but not
vice versa
o A figure described as a square falls under the description of being a rectangle,
but not vice versa
• Measures of Centre used for Nominal Data: In order to find the center of nominal
data, we generally turn to the mode (the most common element) of the dataset.
Ex: In an employee dataset, the most common city was Bangalore, making that a
possible choice for the center of the city column.
Ordinal Level
• In the Nominal Level, we could not order the observations in any natural way.
• Data in the ordinal level provides us with a rank order, or the means to place one
observation before the other.
• However, it does not provide us with relative differences between observations,
meaning that while we may order the observations from first to last, we cannot
add or subtract them to get any real meaning.
• Example: Satisfaction scale from 1 to 10. Your answer, which must fall between 1
and 10, can be ordered: 8 is better than 7 while 3 is worse than 9.
• However, differences between the numbers do not make much sense. The
difference between a 7 and a 6 might be different than the difference between a 2
and a 1.
• We inherit all mathematics from the ordinal level (equality and set membership)
and we can also add the following to the list of operations allowed in the nominal
level:
o Ordering: refers to the natural order provided to us by the data
o Comparison:
▪ At the nominal level, it would not make sense to say that one country
was naturally better than another or that one part of speech is worse
than another.
▪ At the ordinal level, we can make these comparisons. For example, we
can talk about how putting a "7" on a survey is worse than putting a
"10".
• Measures of Centre used for Ordinal Data: At the ordinal level, the median is
usually an appropriate way of defining the center of the data.
• Ex: Imagine you have conducted a survey among your employees asking "how
happy are you to be working here on a scale from 1-5", and your results are as
follows: 1, 4, 4, 5, 2, 1, 3, 2.
• Though, we may think mean would also work, it is not mathematically viable
because if we subtract/add two scores, say a score of 4 minus a score of 2, the
difference of two does not really mean anything.
• If addition/subtraction among the scores doesn't make sense, the mean won't
make sense either.
Interval Level
• At the interval level, we are beginning to look at data that can be expressed
through very quantifiable means, and where much more complicated
mathematical formulas are allowed.
• The basic difference between the ordinal level and the interval level is, well, just
that—difference. Data at the interval level allows meaningful subtraction between
data points.
• Ex: Temperature - If it is 100 °F in Texas and 80 °F in Istanbul, Turkey, then Texas is
20 degrees warmer than Istanbul.
• Note: The 1-5 survey example, used for ordinal level, doesn’t fir for interval level.
The difference between the scores (when you subtract them) does not make
sense, therefore, this data cannot be called at the interval level.
• We can use all the operations allowed on the lower levels (ordering, comparisons,
equality and set membership), along with two other notable operations:
o Addition
o Subtraction
• Measures of Centre used for Interval Data: At this level, we can use the median
and mode to describe this data; however, usually the most accurate description of
the center of data would be the arithmetic mean.
• Measures of variation:
o The measures that describe how ‘spread out’ the dataset are called measures
of variation.
o Measures of variation give us a very clear picture of how spread out or
dispersed our data is.
o Example: Standard Deviation
• Standard Deviation: It can be thought of as the "average distance a data point is at
from the mean". The steps for calculating SD are:
1. Find the mean of the data.
2. For each number in the dataset, subtract it from the mean and then square it
3. Find the average of each square difference.
4. Take the square root of the number obtained in step three. This is the standard
deviation.
Ratio Level
• Data at the interval level does not have a "natural starting point or a natural zero".
Ex: being at zero degrees Celsius does not mean that you have "no temperature".
• Not only can we define order and difference, the ratio level allows us to multiply
and divide as well. Examples:
o While Fahrenheit and Celsius are stuck in the interval level, the Kelvin scale of
temperature boasts a natural zero. A measurement zero Kelvin literally
means the absence of heat. We can actually scientifically say that 200 Kelvin
is twice as much heat as 100 Kelvin.
o Money in the bank is at the ratio level. You can have "no money in the bank"
and it also makes sense that $200,000 is "twice as much as" $100,000.
• Note: Though Celsius and Fahrenheit can be converted to Kelvin and zero Kelvin
has a Celsius and Fahrenheit equivalent, the equivalent is negative and not zero
and hence can’t be considered as data in Ratio Level.
• Measures of Centre used for Ratio Level Data: The arithmetic mean still holds
meaning at this level, as does a new type of mean called the geometric mean. This
measure is generally not used as much even at the ratio level, but is worth
mentioning. It is the square root of the product of all the values.
• Disadvantages:
o Data at the ratio level is usually non-negative.
o For this reason alone, many data scientists prefer the interval level to the
ratio level.
o The reason for this restrictive property is because if we allowed negative
values, the ratio might not always make sense.
Exercise on 4 Levels of Data
a) The teacher of a class of 3rd graders records the height of each student - Ratio
b) The teacher of a class records the eye color of each student - Nominal
c) The teacher records the letter grade for mathematics for each student - Ordinal
d) The teacher records the percentage that each student got on the last test - Ratio
e) A person compiles a list of temperatures in degree Celsius for a month - Interval
f) A person compiles a list of temperatures in degree Fahrenheit - Interval
g) A person compiles a list of temperatures in Kelvin for a month - Ratio
h) A film critic lists the top 50 greatest movies of all time - Ordinal
i) A car magazine lists the prices of the most expensive cars - Ratio
j) The captain of a team lists the jersey numbers for each of the players - Nominal
k) A local animal shelter keeps track of the breeds of dogs that come in - Nominal
l) A local animal shelter keeps track of the weights of dogs that come in - Ratio
m)Calendar year is an example of what scale of measurement? - Interval
n) Number of calories in a small pack of Yogurt - Ratio
o) Shades of lipstick available in a store, Blood type - Nominal
p) Your age, Hourly wages - Ratio
q) Arranging the shirt sizes as small, medium and large - Ordinal
r) Pain scale in a doctor's office - Ordinal
s) Sequence of math classes: Algebra 1, Geometry, Algebra 2A, Statistics - Ordinal

Offers: Nominal Ordinal Interval Ratio

Mode Yes Yes Yes Yes

The sequence of variables is established – Yes Yes Yes

Median – Yes Yes Yes

Mean – – Yes Yes

Difference between variables can be evaluated – – Yes Yes

Addition and Subtraction of variables – – Yes Yes

Multiplication and Division of variables – – – Yes

Absolute zero – – – Yes


What is Machine Learning?
• Machine Learning is the study of algorithms that
o improve the performance P
o at some task T
o with experience E
• A well-defined learning task is given by <P, T, E>

Types of Learning
• Supervised (inductive) learning: Input - training data + desired outputs (labels)
• Unsupervised learning: Input - training data (without desired outputs)
• Semi-supervised learning: Input - training data + a few desired outputs
• Reinforcement learning: Rewards from sequence of actions

Use Cases of Machine Learning Algorithms


• Recognizing patterns:
o Facial identities or facial expressions
o Handwritten or spoken words
o Medical images
• Generating patterns: Generating images or motion sequences
• Recognizing anomalies:
o Unusual credit card transactions
o Unusual patterns of sensor readings in a nuclear power plant
• Prediction: Future stock prices or currency exchange rates

You might also like