0% found this document useful (0 votes)
22 views

Data Science - AD1102-1

The document provides an introduction to a data science course, covering topics like what data science is, its applications, evolution and basic terminology. It defines data science and discusses how it relates to other fields like artificial intelligence, big data and data engineering.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Data Science - AD1102-1

The document provides an introduction to a data science course, covering topics like what data science is, its applications, evolution and basic terminology. It defines data science and discusses how it relates to other fields like artificial intelligence, big data and data engineering.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

INTRODUCTION TO

DATA SCIENCE
COURSE CODE : AD1102-1
SYLLABUS
SYLLABUS
EVALUATION SCHEME
DATA SCIENCE/BIG DATA IN THE NEWS…

5
DATA SCIENCE EVERYWHERE!...

3
DATA SCIENCE EVERYWHERE!...

3
DATA SCIENCE EVERYWHERE!...

3
DATA SCIENCE VOCABULARY

4
WHAT IS DATA SCIENCE?

5
WHAT IS DATA SCIENCE?
• “Data science, also known as data-driven science, is an
interdisciplinary field of scientific methods, processes,
algorithms and systems to extract knowledge or insights
from data in various forms, either structured or
unstructured, similar to data mining.”

5
WHAT IS DATA SCIENCE?
• “Data science, also known as data-driven science, is an
interdisciplinary field of scientific methods, processes,
algorithms and systems to extract knowledge or insights
from data in various forms, either structured or
unstructured, similar to data mining.”
• “Data science intends to analyze and understand actual
phenomena with ‘data’. In other words, the aim of data science
is to reveal the features or the hidden structure of complicated
natural, human, and social phenomena with data from a
different point of view from the established or traditional theory
and method.”

5
WHAT IS DATA SCIENCE?
• Fourth paradigm
• “… change of all sciences moving from observational,
to theoretical, to computational and now to the 4th
Paradigm – Data-Intensive Scientific Discovery”

• Technically DATA SCIENCE is defined as the


process of extracting knowledge and insights from
complex and large sets of data by using processes
like data cleaning data visualization.

6
WHAT IS IMPORTANT?

Need to solve a real problem using data…


No applications, no data science.

“Torture the data, and it will confess to anything”

-Ronald Coase , Economics, Nobel Prize

7
DATA SCIENCE AS A UNIFIER

Humanities
Data Machine/
Management Statistical
Learning

Law
Data Application
Domain
Science Expertise

Social
Visualization
Science
Mathematical
Optimization
8
DATA SCIENCE AND BIG DATA
• They are not the “same thing”
• Big data = crude oil
• Big data is about extracting “crude oil”, transporting it in “mega tankers”,
siphoning it through “pipelines”, and storing it in “massive silos”

• Data science is about refining the “crude oil”

Carlos Samohano
Founder, Data Science London

9
DATA SCIENCE AND ARTIFICIAL INTELLIGENCE

Data ML/DM/
Analytics
Artificial
Science Intelligence

10
DATA SCIENCE AND ARTIFICIAL INTELLIGENCE

Data ML/DM/
Analytics
Artificial
Science Intelligence

“Data science produces insights.


Machine learning produces predictions” 10
DATA SCIENCE APPLICATION EXAMPLES
• Fraud detection
• Investigate fraud patterns in past data
• Early detection is important
• Before damage propagates
• Harder than late detection
• Precision is important
• False positive and false negative are both
bad

• Real-time analytics

11
DATA SCIENCE APPLICATION EXAMPLES
• Recommender systems
• The ability to offer unique
personalized service
• Increase sales, click-through rates,
conversions, …
• Netflix recommender system valued at
$1B per year
• Amazon recommender system drives a
20-35% lift in sales annually

• Collaborative filtering at scale

12
DATA SCIENCE APPLICATION EXAMPLES
• Predicting why patients are being
readmitted
• Reduce costs
• Improve population health
• Find the “why” behind specific
populations being readmitted
• Data lakes of multiple data sources
• Investigate ties between readmission
and socioeconomic data points, patient
history, genetics, …

13
DATA SCIENCE APPLICATION EXAMPLES
• “Smart cities”
• Not well-defined

14
DATA SCIENCE APPLICATION EXAMPLES
• “Smart cities”
• Not well-defined

14
DATA SCIENCE APPLICATION EXAMPLES
• “Smart cities”
• Not well-defined
• Generally, refers to using data
and ICT to
• Better plan communities
• Better manage assets
• Reduce costs
• Deploy open data to better
engage with community

14
DATA SCIENCE APPLICATION EXAMPLES
• Moneyball
• How to build a baseball team on a very
low budget by relying on data
• Sabermetrics: the statistical analysis of
baseball data to objectively evaluate
performance
• 2002 record of 103-59 was joint best in
MLB
• Team salary budget: $40 million
• Other team: Yankees
• Team salary budget: $120 million

15
EVOLUTION OF DATA SCIENCE
1. Early Roots (1960s - 1990s): Data Science traces its origins to statistics, computer science, and applied
mathematics. The emphasis was on data analysis, but the scale of data was relatively small compared to today's
standards. Techniques like linear regression, hypothesis testing, and data visualization were common.
2. Big Data Era (2000s): The exponential growth of data generated by the internet and technological advancements
in storage and computing power led to the Big Data revolution. Data Science began to focus on processing,
storing, and analyzing large datasets that were beyond the capabilities of traditional tools. Distributed computing
frameworks like Apache Hadoop and data processing languages like Apache Spark emerged.
3. Emergence of Machine Learning (2000s - 2010s): As data continued to grow, traditional statistical methods faced
limitations. Machine Learning, a subfield of Artificial Intelligence, gained prominence. Algorithms like Support
Vector Machines, Decision Trees, and Random Forests became popular. Researchers and practitioners started
exploring deep learning techniques, thanks to the availability of vast datasets and advancements in neural
networks.
EVOLUTION OF DATA SCIENCE
4. Interdisciplinary Merging (2010s): The boundaries between fields like data analysis, statistics, and computer
science blurred as Data Science expanded its horizons. Interdisciplinary collaboration became essential to harness
the full potential of data. Domain knowledge became vital for understanding the context and making accurate
predictions.
5. Rise of Data Science Tools and Platforms (2010s): Data Science tools and platforms proliferated, making it easier
for non-experts to work with data. Libraries like Pandas, NumPy, scikit-learn, and TensorFlow provided accessible
APIs for data manipulation, analysis, and machine learning. Cloud-based services also enabled data storage and
processing without significant infrastructure costs.
6. Integration of Data Engineering and DevOps (2010s): The role of Data Engineers became vital in the data pipeline,
working on data collection, storage, and transformation. DevOps principles were also adopted to ensure the
seamless integration and deployment of data solutions into production environments.
EVOLUTION OF DATA SCIENCE
7. Ethical Considerations (2010s - Present): As Data Science gained prominence, ethical concerns surrounding data
privacy, bias, and fairness emerged. Data scientists and organizations started focusing on responsible data
practices to mitigate potential harm caused by data-driven decisions.
8. Augmentation with AI and Automation (Present): The integration of AI and automation has allowed Data Science
to become more efficient and productive. AutoML (Automated Machine Learning) tools have emerged, enabling
faster model development and deployment.
9. Future Trends (ongoing): Data Science continues to evolve as technologies like natural language processing,
computer vision, and reinforcement learning progress. Explainable AI, which seeks to make AI models more
transparent and interpretable, is gaining traction to address concerns about AI's "black-box" nature.
Basic Terminology
• Define “DATA” - defining what data is.
• we use the word "data", we refer to a collection of information in either an organized or unorganized
format

• Organized data: This refers to data that is sorted into a row/column structure, where every row
represents a single observation, and the columns represent the characteristics of that observation.
• Unorganized data: This is the type of data that is in the free form, usually text or raw audio/signals
that must be parsed further to become organized.

Data science is the art and science of acquiring knowledge through data.

Data science is all about how we take data, use it to acquire knowledge, and then use that knowledge to
do the following:
• Make decisions
• Predict the future
• Understand the past/present
• Create new industries/products
The Data Science Venn Diagram
Understanding data science begins with three basic
areas:
• Math/statistics: This is the use of equations and
formulas to perform analysis.
• Computer programming: This is the ability to use
code to create outcomes on the computer.
• Domain knowledge: This refers to understanding
the problem domain (medicine, finance, social
science, and so on)

❖ Those with hacking skills can conceptualize and


program complicated algorithms using computer
languages.
❖ Having a Math & Statistics Knowledge base
allows you to theorize and evaluate algorithms
and tweak the existing procedures to fit specific
situations.
❖ Having Substantive Expertise (domain expertise)
allows you to apply concepts and results in a
meaningful and effective way.
Types of Data
Types of Data
We will look at the three basic classifications of data:
• Structured vs unstructured (sometimes called organized vs unorganized)
• Quantitative vs Qualitative
• The four levels of data

• Structured (organized) data: This is data that can be thought of as observations and characteristics. It is
usually organized using a table method (rows and columns).
• Unstructured (unorganized) data: This data exists as a free entity and does not follow any standard
organization hierarchy.
• Examples : ?

Quantitative versus Qualitative data


• Quantitative data: This data can be described using numbers, and basic mathematical procedures,
including addition, are possible on the set.
• Qualitative data: This data cannot be described using numbers and basic mathematics. This data is
generally thought of as being described using "natural" categories and language.
Case Study
Name of coffee shop – Qualitative
• The name of a coffee shop is not expressed as a number, and we cannot perform math on the name of the shop.
Revenue – Quantitative
• How much money a cafe brings in can definitely be described using a number. Also, we can do basic operations such
as adding up the revenue for 12 months to get a year's worth of revenue.
Zip code – Qualitative
• A zip code is always represented using numbers, but what makes it qualitative is that it does not fit the second part of
the definition of quantitative—we cannot perform basic mathematical operations on a zip code. If we add together
two zip codes, it is a nonsensical measurement. We don't necessarily get a new zip code and we definitely don't get
"double the zip code".
Average monthly customers – Quantitative
• Again, describing this factor using numbers and addition makes sense. Add up all of your monthly customers and you
get your yearly customers.
Country of coffee origin – Qualitative
• We will assume this is a very small café with coffee from a single origin. This country is described using a name
(Ethiopian, Colombian), and not numbers.
One of the ways to decide whether or not the data is qualitative or quantitative

To ask yourself a few basic questions about the data characteristics:


• Can you describe it using numbers?
• No? It is qualitative.
• Yes? Move on to next question.
• Does it still make sense after you add them together?
• No? They are qualitative.
• Yes? You probably have quantitative data.
Quantitative data can be broken down, one step further, into discrete and continuous
quantities.
These can be defined as follows:
• Discrete data: This describes data that is counted. It can only take on certain values.
• Examples of discrete quantitative data include a dice roll, because it can only take on six values,
and the number of customers in a café, because you can’t have a real range of people.

• Continuous data: This describes data that is measured. It exists on an infinite range of values.
• A good example of continuous data would be a person's weight because it can be 150 pounds or
197.66 pounds (note the decimals). The height of a person or building is a continuous number
because an infinite scale of decimals is possible. Other examples of continuous data would be
time and temperature.
The four levels of data
It is generally understood that a specific characteristic (feature/column) of
structured data can be broken down into one of four levels of data. The
levels are:
1. The nominal level
2. The ordinal level
3. The interval level
4. The ratio level
The nominal level, (which also sounds like the word name) consists of data that is described
purely by name or category.
• Basic examples include gender, nationality, species, or yeast strain in a beer. They are not
described by numbers and are therefore qualitative.

Mathematical operations allowed


We cannot perform mathematics on the nominal level of data except the basic equality and set
membership functions, as shown in the following two examples:
• Being a tech entrepreneur is the same as being in the tech industry, but not vice versa
• A figure described as a square falls under the description of being a rectangle, but not vice
versa

Measures of center
A measure of center is a number that describes what the data tends to. It is sometimes referred
to as the balance point of the data. Common examples include the mean, median, and mode.
The ordinal level data is an ordered series of information.
• Let’s say you went to a restaurant, and your information is stored in the form of a
customer ID, which means you are represented by the customer ID. Now, you
would have rated their services as good or average, and that’s how the ordinal
data is; similarly, there have a record of other customers who visit the restaurant
along with their ratings. So, any data which has some sort of sequence or some
sort of order to rate is known as ordinal data.

Mathematical operations allowed


following to the list of operations:
• Ordering
• Comparison
Measures of center
• At the ordinal level, the median is usually an appropriate way of defining the
center of the data.
Data at the interval level allows meaningful subtraction between data points.
• Example
• Temperature is a great example of data at the interval level. If it is 100 degrees Fahrenheit in Texas
and 80 degrees Fahrenheit in Istanbul, Turkey, then Texas is 20 degrees warmer than Istanbul. This
simple example allows for so much more manipulation at this level than previous examples.

Mathematical operations allowed


• We can use all the operations allowed on the lower levels (ordering, comparisons, and so on), along
with two other notable operations:
• Addition
• Subtraction
Measures of center
• At this level, we can use the median and mode to describe this data; however, usually the most
accurate description of the center of data would be the arithmetic mean, more commonly referred to
as, simply, "the mean".
Measures of variation
A measure of variation (like the standard deviation) is a number that attempts to describe how spread out the data
is.
Along with a measure of center, a measure of variation can almost entirely describe a dataset with only two
numbers.
DATA SCIENCE PROCESS
HOLISTIC APPROACH TO DATA SCIENCE

Core

Data Security & Privacy

Data Making Data


Data
Trustable & Management of Modeling & Dissemination &
Big Data Analysis Visualization
Acquisition Usable Preservation

Ethics, Policy & Social Impact

Application Application Application Application

16
CORE RESEARCH ISSUES & INTERACTIONS
Making Data
Trustable &
Usable

Big Data Modelling &


Management Analysis

Data
Visualization &
Dissemination
17
CORE RESEARCH ISSUES & INTERACTIONS
• Data cleaning
Making Data • Sampling
Trustable & • Data provenance
Usable

Big Data Modelling &


Management Analysis

Data
Visualization &
Dissemination
17
CORE RESEARCH ISSUES & INTERACTIONS
• Data cleaning
Making Data • Sampling
• Data lakes Trustable & • Data provenance
• Batch & online access Usable
• Platforms

Big Data Modelling &


Management Analysis

Data
Visualization &
Dissemination
17
CORE RESEARCH ISSUES & INTERACTIONS
• Data cleaning
Making Data • Sampling
• Data lakes Trustable & • Data provenance
• Batch & online access Usable
• Platforms

Big Data Modelling &


Management Analysis

• Models & methods for data


lakes
• Unsupervised
Data
Visualization & classification & AI
Dissemination
17
CORE RESEARCH ISSUES & INTERACTIONS
• Data cleaning
Making Data • Sampling
• Data lakes Trustable & • Data provenance
• Batch & online access Usable
• Platforms

Big Data Modelling &


Management Analysis

• Visualization for wider • Models & methods for data


audience
lakes
• Visualization for data
exploration • Unsupervised
Data
• Open data technologies Visualization & classification & AI
Dissemination
17
CORE RESEARCH ISSUES & INTERACTIONS
• Data cleaning
Making Data • Sampling
• Data lakes Trustable & • Data provenance
• Batch & online access Usable
• Platforms • DM support for
provenance
• Data preparation for big
data management
Big Data• Cleaning for data Modelling &
Managementanalysis Analysis
• DM for ML
• Visualization for wider • ML for DM
• Visual analytics • Models & methods for data
audience
… lakes
• Visualization for data
exploration • Unsupervised
Data
• Open data technologies Visualization & classification & AI
Dissemination
17

You might also like