01Intro
01Intro
— Unit 1 —
1
Chapter 1. Introduction
◼ Why Data Science?
◼ Summary
2
Why Data Mining?
3
Evolution of Sciences
◼ Before 1600, empirical science
◼ 1600-1950s, theoretical science
◼ Each discipline has grown a theoretical component. Theoretical models often
motivate experiments and generalize our understanding.
◼ 1950s-1990s, computational science
◼ Over the last 50 years, most disciplines have grown a third, computational branch
(e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)
◼ Computational Science traditionally meant simulation. It grew out of our inability
to find closed-form solutions for complex mathematical models.
◼ 1990-now, data science
◼ The flood of data from new scientific instruments and simulations
◼ The ability to economically store and manage petabytes of data online
◼ The Internet and computing Grid that makes all these archives universally
accessible
◼ Scientific info. management, acquisition, organization, query, and visualization
tasks scale almost linearly with data volumes. Data mining is a major new
challenge!
4
Evolution of Database Technology
◼ 1960s:
◼ Data collection, database creation, IMS and network DBMS
◼ 1970s:
◼ Relational data model, relational DBMS implementation
◼ 1980s:
◼ RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
◼ Application-oriented DBMS (spatial, scientific, engineering, etc.)
◼ 1990s:
◼ Data mining, data warehousing, multimedia databases, and Web
databases
◼ 2000s
◼ Stream data management and mining
◼ Data mining and its applications
◼ Web technology (XML, data integration) and global information systems
5
Data Science
◼ Definition: Data science is an interdisciplinary field that combines
statistics, mathematics, computer science, and domain knowledge to
extract insights from structured and unstructured data.
◼ Purpose: The primary goal is to analyze data to answer questions,
identify patterns, and support decision-making processes in various
industries.
◼ Key Components:
1. Data Collection: Gathering raw data from various sources such as
databases, sensors, or user interactions.
2. Data Cleaning: Ensuring the data is accurate and ready for analysis
by removing duplicates and filling in missing values.
3. Data Analysis: Applying statistical methods and algorithms to
identify trends and relationships within the data.
4. Data Visualization: Presenting analysis results through charts and
graphs to facilitate understanding and decision-making.
6
Data Science
◼ Techniques Used: Incorporates machine learning, predictive
analytics, and statistical modeling to derive actionable insights.
◼ Applications: Utilized in various sectors including healthcare,
finance, e-commerce, and marketing to enhance business strategies
and operations.
◼ Importance: As organizations increasingly rely on data for strategic
decisions, data science plays a crucial role in transforming raw data
into meaningful information that drives business success.
7
Chapter 1. Introduction
◼ Why Data Mining?
◼ Summary
8
What Is Data Mining?
9
Knowledge Discovery (KDD) Process
◼ This is a view from typical
database systems and data
Pattern Evaluation
warehousing communities
◼ Data mining plays an essential
role in the knowledge discovery
process Data Mining
Task-relevant Data
Data Cleaning
Data Integration
11
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
Business intelligence (BI) involves analyzing historical and current data to make strategic decisions.
Warehouses and data cubes are used for structured storage and reporting, but they don’t necessarily involve complex data mining.
Business Objects vs. Data Mining Tools
Business objects (like reports, dashboards, and OLAP cubes) focus on data exploration rather than deep pattern discovery.
Data mining tools, on the other hand, extract hidden patterns, correlations, and predictions from large datasets.
Supply Chain Example
BI tools may provide supply chain dashboards, tracking inventory, shipments, and demand trends.
Data mining, however, could predict demand patterns, detect anomalies, or optimize logistics based on historical data.
Data Presentation
14
Example: Medical Data Mining
15
Chapter 1. Introduction
◼ Why Data Mining?
◼ Summary
16
Multi-Dimensional View of Data Mining
◼ Data to be mined
◼ Database data (extended-relational, object-oriented, heterogeneous,
◼ Techniques utilized
◼ Data-intensive, data warehouse (OLAP), machine learning, statistics,
◼ Summary
18
Data Mining: On What Kinds of Data?
◼ Database-oriented data sets and applications
◼ Relational database, data warehouse, transactional database
◼ Advanced data sets and advanced applications
◼ Data streams and sensor data
◼ Time-series data, temporal data, sequence data (incl. bio-sequences)
◼ Structure data, graphs, social networks and multi-linked data
◼ Object-relational databases
◼ Heterogeneous databases and legacy databases
◼ Spatial data and spatiotemporal data
◼ Multimedia database
◼ Text databases
◼ The World-Wide Web
19
Different Sources of Data for Data Analysis
1. Primary Data Sources
◼ Surveys: Collecting firsthand information through questionnaires from a
targeted audience.
◼ Observations: Gathering data by observing behaviors or events in their natural
settings.
◼ Experiments: Conducting controlled tests to gather data on specific variables
and their effects.
◼ Interviews: Directly engaging with individuals to obtain detailed information.
◼ Focus Groups: Discussing topics with a group to gather diverse perspectives
and insights.
20
Different Sources of Data for Data Analysis
2. Secondary Data Sources
◼ Government Databases: Publicly available data such as census data, economic
reports, and health statistics.
◼ Academic Journals: Research studies and articles that provide validated data in
various fields.
◼ Corporate Reports: Financial statements and performance reports from
businesses.
◼ Online Repositories: Platforms that aggregate data sets from various sources
for public access.
◼ Historical Records: Archived data that can provide insights into past trends and
events.
21
Different Sources of Data for Data Analysis
3. External Data Sources
◼ Social Media: Data from platforms like Twitter, Facebook, and LinkedIn that can
reveal consumer sentiment and trends.
◼ Market Research Data: Insights from studies conducted by research firms on
consumer behavior and market conditions.
◼ Weather Data: Information about climate conditions that can be relevant for
various analyses, especially in agriculture and logistics.
◼ APIs (Application Programming Interfaces): Tools that allow access to data
from web services, enabling integration with other applications.
22
Different Sources of Data for Data Analysis
4. Big Data Sources
◼ Machine Data: Information generated by machines or sensors, often used in
IoT applications.
◼ File Data: Structured or unstructured data stored in files that can be shared
across platforms.
5. Open Data Sources
◼ Public Health Data: Information related to health trends, disease outbreaks,
and healthcare access.
◼ World Bank Open Data: Global statistics on development indicators across
countries.
23
Chapter 1. Introduction
◼ Why Data Mining?
◼ Summary
24
Data Mining Function: (1) Generalization
25
Data Mining Function: (2) Association and
Correlation Analysis
◼ Frequent patterns (or frequent itemsets)
◼ What items are frequently purchased together in your
Walmart?
◼ Association, correlation vs. causality
◼ A typical association rule
◼ Diaper → Beer [0.5%, 75%] (support, confidence)
◼ Are strongly associated items also strongly correlated?
◼ How to mine such patterns and rules efficiently in large
datasets?
◼ How to use such patterns for classification, clustering,
and other applications?
26
Data Mining Function: (3) Classification
27
Data Mining Function: (4) Cluster Analysis
28
Data Mining Function: (5) Outlier Analysis
◼ Outlier analysis
◼ Outlier: A data object that does not comply with the general
behavior of the data
◼ Noise or exception? ― One person’s garbage could be another
person’s treasure
◼ Methods: by product of clustering or regression analysis, …
◼ Useful in fraud detection, rare events analysis
29
Chapter 1. Introduction
◼ Why Data Mining?
◼ Summary
30
Data Mining: Confluence of Multiple Disciplines
31
Why Confluence of Multiple Disciplines?
◼ Summary
33
Applications of Data Mining
◼ Web page analysis: from web page classification, clustering to
PageRank & HITS algorithms
◼ Collaborative analysis & recommender systems
◼ Basket data analysis to targeted marketing
◼ Biological and medical data analysis: classification, cluster analysis
(microarray data analysis), biological sequence analysis, biological
network analysis
◼ Data mining and software engineering (e.g., IEEE Computer, Aug.
2009 issue)
◼ From major dedicated data mining systems/tools (e.g., SAS, MS SQL-
Server Analysis Manager, Oracle Data Mining Tools) to invisible data
mining
34
Chapter 1. Introduction
◼ Why Data Mining?
◼ Summary
35
Major Issues in Data Mining (1)
◼ Mining Methodology
◼ Mining various and new kinds of knowledge
◼ Mining knowledge in multi-dimensional space
◼ Data mining: An interdisciplinary effort
◼ Boosting the power of discovery in a networked environment
◼ Handling noise, uncertainty, and incompleteness of data
◼ Pattern evaluation and pattern- or constraint-guided mining
◼ User Interaction
◼ Interactive mining
◼ Incorporation of background knowledge
◼ Presentation and visualization of data mining results
36
Major Issues in Data Mining (2)
37
Challenges in Data Science
38
Applications of Data Science
39
Introduction to Data Modeling
41
Introduction to Data Modeling
◼ Challenges in Data Modeling
◼ Model Complexity: Developing an effective model can be challenging,
requiring careful consideration of data features and relationships.
◼ Algorithm Implementation: Once a model is created, applying it through
algorithms is generally straightforward, but finding the optimal model
parameters can be difficult.
42
Statistical Modeling
◼ Definition: Statistical data modeling is the process of applying statistical
analysis techniques to datasets to understand relationships, make
predictions, and derive insights.
◼ Purpose: The goal is to create mathematical representations (models)
that describe the underlying structure of the data and can be used for
forecasting future outcomes.
◼ It is an underlying distribution from which the visible data is drawn.
43
Computational Approaches to Modeling
◼ Definition: Computational modeling involves using algorithms and
computer programs to analyze and simulate complex systems, often
contrasting with traditional statistical approaches.
◼ Algorithmic Perspective:
◼ In computational modeling, a model is seen as the solution to a complex
query about the data rather than a statistical representation.
◼ For example, calculating the average and standard deviation of a
dataset provides insights without necessarily fitting a Gaussian
distribution.
44
Computational Approaches to Modeling
◼ Modeling Approaches
◼ Summarization:
◼ This approach focuses on succinctly representing data while capturing
essential features, allowing for easier interpretation and analysis.
◼ Feature Extraction:
◼ Prominent features of the data are identified and retained, while less
significant information is ignored, simplifying the dataset for further
analysis.
◼ Examples of Computational Modeling Techniques
◼ Random Processes: Constructing models based on random processes that
simulate how data could have been generated.
◼ Machine Learning Algorithms: Utilizing algorithms that learn from historical
data to make predictions or identify patterns.
45
Statistical Limits on Data Mining
46
Statistical Limits on Data Mining
47
48
Bonferroni’s Principle: Example
◼ Even without any actual evil-doers, there would be approximately
250,000 pairs appearing suspicious.
◼ This highlights the challenge in distinguishing genuine signals from noise
in large datasets.
◼ Implications
◼ False Positives: High numbers of expected occurrences can lead to
significant resources being wasted investigating innocent individuals.
◼ Application in Security: In contexts like terrorism detection, it emphasizes
the need to look for rare events that are less likely to occur randomly to
effectively identify genuine threats.
◼ Bonferroni’s Principle serves as a critical reminder in data analysis and
mining, urging caution against overinterpreting results from large
datasets and highlighting the importance of rigorous statistical methods
to validate findings. 49
Chapter 1. Introduction
◼ Why Data Mining?
◼ Summary
50
Summary
◼ Data mining: Discovering interesting patterns and knowledge from
massive amount of data
◼ A natural evolution of database technology, in great demand, with
wide applications
◼ A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
◼ Mining can be performed in a variety of data
◼ Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis, etc.
◼ Data mining technologies and applications
◼ Major issues in data mining
51
Recommended Reference Books
◼ S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan
Kaufmann, 2002
◼ R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000
◼ T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
◼ U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and
Data Mining. AAAI/MIT Press, 1996
◼ U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge
Discovery, Morgan Kaufmann, 2001
◼ J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd ed., 2011
◼ D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001
◼ T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference,
and Prediction, 2nd ed., Springer-Verlag, 2009
◼ B. Liu, Web Data Mining, Springer 2006.
◼ T. M. Mitchell, Machine Learning, McGraw Hill, 1997
◼ G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991
◼ P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
◼ S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998
◼ I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java
Implementations, Morgan Kaufmann, 2nd ed. 2005
52