Data Science - AD1102-1
Data Science - AD1102-1
DATA SCIENCE
COURSE CODE : AD1102-1
SYLLABUS
SYLLABUS
EVALUATION SCHEME
DATA SCIENCE/BIG DATA IN THE NEWS…
5
DATA SCIENCE EVERYWHERE!...
3
DATA SCIENCE EVERYWHERE!...
3
DATA SCIENCE EVERYWHERE!...
3
DATA SCIENCE VOCABULARY
4
WHAT IS DATA SCIENCE?
5
WHAT IS DATA SCIENCE?
• “Data science, also known as data-driven science, is an
interdisciplinary field of scientific methods, processes,
algorithms and systems to extract knowledge or insights
from data in various forms, either structured or
unstructured, similar to data mining.”
5
WHAT IS DATA SCIENCE?
• “Data science, also known as data-driven science, is an
interdisciplinary field of scientific methods, processes,
algorithms and systems to extract knowledge or insights
from data in various forms, either structured or
unstructured, similar to data mining.”
• “Data science intends to analyze and understand actual
phenomena with ‘data’. In other words, the aim of data science
is to reveal the features or the hidden structure of complicated
natural, human, and social phenomena with data from a
different point of view from the established or traditional theory
and method.”
5
WHAT IS DATA SCIENCE?
• Fourth paradigm
• “… change of all sciences moving from observational,
to theoretical, to computational and now to the 4th
Paradigm – Data-Intensive Scientific Discovery”
6
WHAT IS IMPORTANT?
7
DATA SCIENCE AS A UNIFIER
Humanities
Data Machine/
Management Statistical
Learning
Law
Data Application
Domain
Science Expertise
Social
Visualization
Science
Mathematical
Optimization
8
DATA SCIENCE AND BIG DATA
• They are not the “same thing”
• Big data = crude oil
• Big data is about extracting “crude oil”, transporting it in “mega tankers”,
siphoning it through “pipelines”, and storing it in “massive silos”
Carlos Samohano
Founder, Data Science London
9
DATA SCIENCE AND ARTIFICIAL INTELLIGENCE
Data ML/DM/
Analytics
Artificial
Science Intelligence
10
DATA SCIENCE AND ARTIFICIAL INTELLIGENCE
Data ML/DM/
Analytics
Artificial
Science Intelligence
• Real-time analytics
11
DATA SCIENCE APPLICATION EXAMPLES
• Recommender systems
• The ability to offer unique
personalized service
• Increase sales, click-through rates,
conversions, …
• Netflix recommender system valued at
$1B per year
• Amazon recommender system drives a
20-35% lift in sales annually
12
DATA SCIENCE APPLICATION EXAMPLES
• Predicting why patients are being
readmitted
• Reduce costs
• Improve population health
• Find the “why” behind specific
populations being readmitted
• Data lakes of multiple data sources
• Investigate ties between readmission
and socioeconomic data points, patient
history, genetics, …
13
DATA SCIENCE APPLICATION EXAMPLES
• “Smart cities”
• Not well-defined
14
DATA SCIENCE APPLICATION EXAMPLES
• “Smart cities”
• Not well-defined
14
DATA SCIENCE APPLICATION EXAMPLES
• “Smart cities”
• Not well-defined
• Generally, refers to using data
and ICT to
• Better plan communities
• Better manage assets
• Reduce costs
• Deploy open data to better
engage with community
14
DATA SCIENCE APPLICATION EXAMPLES
• Moneyball
• How to build a baseball team on a very
low budget by relying on data
• Sabermetrics: the statistical analysis of
baseball data to objectively evaluate
performance
• 2002 record of 103-59 was joint best in
MLB
• Team salary budget: $40 million
• Other team: Yankees
• Team salary budget: $120 million
15
EVOLUTION OF DATA SCIENCE
1. Early Roots (1960s - 1990s): Data Science traces its origins to statistics, computer science, and applied
mathematics. The emphasis was on data analysis, but the scale of data was relatively small compared to today's
standards. Techniques like linear regression, hypothesis testing, and data visualization were common.
2. Big Data Era (2000s): The exponential growth of data generated by the internet and technological advancements
in storage and computing power led to the Big Data revolution. Data Science began to focus on processing,
storing, and analyzing large datasets that were beyond the capabilities of traditional tools. Distributed computing
frameworks like Apache Hadoop and data processing languages like Apache Spark emerged.
3. Emergence of Machine Learning (2000s - 2010s): As data continued to grow, traditional statistical methods faced
limitations. Machine Learning, a subfield of Artificial Intelligence, gained prominence. Algorithms like Support
Vector Machines, Decision Trees, and Random Forests became popular. Researchers and practitioners started
exploring deep learning techniques, thanks to the availability of vast datasets and advancements in neural
networks.
EVOLUTION OF DATA SCIENCE
4. Interdisciplinary Merging (2010s): The boundaries between fields like data analysis, statistics, and computer
science blurred as Data Science expanded its horizons. Interdisciplinary collaboration became essential to harness
the full potential of data. Domain knowledge became vital for understanding the context and making accurate
predictions.
5. Rise of Data Science Tools and Platforms (2010s): Data Science tools and platforms proliferated, making it easier
for non-experts to work with data. Libraries like Pandas, NumPy, scikit-learn, and TensorFlow provided accessible
APIs for data manipulation, analysis, and machine learning. Cloud-based services also enabled data storage and
processing without significant infrastructure costs.
6. Integration of Data Engineering and DevOps (2010s): The role of Data Engineers became vital in the data pipeline,
working on data collection, storage, and transformation. DevOps principles were also adopted to ensure the
seamless integration and deployment of data solutions into production environments.
EVOLUTION OF DATA SCIENCE
7. Ethical Considerations (2010s - Present): As Data Science gained prominence, ethical concerns surrounding data
privacy, bias, and fairness emerged. Data scientists and organizations started focusing on responsible data
practices to mitigate potential harm caused by data-driven decisions.
8. Augmentation with AI and Automation (Present): The integration of AI and automation has allowed Data Science
to become more efficient and productive. AutoML (Automated Machine Learning) tools have emerged, enabling
faster model development and deployment.
9. Future Trends (ongoing): Data Science continues to evolve as technologies like natural language processing,
computer vision, and reinforcement learning progress. Explainable AI, which seeks to make AI models more
transparent and interpretable, is gaining traction to address concerns about AI's "black-box" nature.
Basic Terminology
• Define “DATA” - defining what data is.
• we use the word "data", we refer to a collection of information in either an organized or unorganized
format
• Organized data: This refers to data that is sorted into a row/column structure, where every row
represents a single observation, and the columns represent the characteristics of that observation.
• Unorganized data: This is the type of data that is in the free form, usually text or raw audio/signals
that must be parsed further to become organized.
Data science is the art and science of acquiring knowledge through data.
Data science is all about how we take data, use it to acquire knowledge, and then use that knowledge to
do the following:
• Make decisions
• Predict the future
• Understand the past/present
• Create new industries/products
The Data Science Venn Diagram
Understanding data science begins with three basic
areas:
• Math/statistics: This is the use of equations and
formulas to perform analysis.
• Computer programming: This is the ability to use
code to create outcomes on the computer.
• Domain knowledge: This refers to understanding
the problem domain (medicine, finance, social
science, and so on)
• Structured (organized) data: This is data that can be thought of as observations and characteristics. It is
usually organized using a table method (rows and columns).
• Unstructured (unorganized) data: This data exists as a free entity and does not follow any standard
organization hierarchy.
• Examples : ?
• Continuous data: This describes data that is measured. It exists on an infinite range of values.
• A good example of continuous data would be a person's weight because it can be 150 pounds or
197.66 pounds (note the decimals). The height of a person or building is a continuous number
because an infinite scale of decimals is possible. Other examples of continuous data would be
time and temperature.
The four levels of data
It is generally understood that a specific characteristic (feature/column) of
structured data can be broken down into one of four levels of data. The
levels are:
1. The nominal level
2. The ordinal level
3. The interval level
4. The ratio level
The nominal level, (which also sounds like the word name) consists of data that is described
purely by name or category.
• Basic examples include gender, nationality, species, or yeast strain in a beer. They are not
described by numbers and are therefore qualitative.
Measures of center
A measure of center is a number that describes what the data tends to. It is sometimes referred
to as the balance point of the data. Common examples include the mean, median, and mode.
The ordinal level data is an ordered series of information.
• Let’s say you went to a restaurant, and your information is stored in the form of a
customer ID, which means you are represented by the customer ID. Now, you
would have rated their services as good or average, and that’s how the ordinal
data is; similarly, there have a record of other customers who visit the restaurant
along with their ratings. So, any data which has some sort of sequence or some
sort of order to rate is known as ordinal data.
Core
16
CORE RESEARCH ISSUES & INTERACTIONS
Making Data
Trustable &
Usable
Data
Visualization &
Dissemination
17
CORE RESEARCH ISSUES & INTERACTIONS
• Data cleaning
Making Data • Sampling
Trustable & • Data provenance
Usable
Data
Visualization &
Dissemination
17
CORE RESEARCH ISSUES & INTERACTIONS
• Data cleaning
Making Data • Sampling
• Data lakes Trustable & • Data provenance
• Batch & online access Usable
• Platforms
Data
Visualization &
Dissemination
17
CORE RESEARCH ISSUES & INTERACTIONS
• Data cleaning
Making Data • Sampling
• Data lakes Trustable & • Data provenance
• Batch & online access Usable
• Platforms