0% found this document useful (0 votes)
3 views

Unit-1 IDS

The document provides an overview of data science, its processes, and the relationship with big data, emphasizing the importance of data analysis in various industries. It outlines the data science process, including goal setting, data retrieval, preparation, exploration, modeling, and presentation, while also detailing the big data ecosystem and its components. Additionally, it highlights the benefits and applications of data science in commercial, governmental, and academic contexts.

Uploaded by

kataruraj9
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Unit-1 IDS

The document provides an overview of data science, its processes, and the relationship with big data, emphasizing the importance of data analysis in various industries. It outlines the data science process, including goal setting, data retrieval, preparation, exploration, modeling, and presentation, while also detailing the big data ecosystem and its components. Additionally, it highlights the benefits and applications of data science in commercial, governmental, and academic contexts.

Uploaded by

kataruraj9
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

UNIT I: Introduction to Data science, benefits and uses, facets of data, data science process in brief, big data

ecosystem and data science


Data Science process: Overview, defining goals and creating project charter, retrieving data, cleansing,
integrating and transforming data, exploratory analysis, model building, presenting findings and building
applications on top of them

Introduction to Data science


Big Data
 Definition: Collections of data sets so large or complex that traditional data management techniques (like
RDBMS) are inadequate.
 Challenges: Data capture, curation, storage, search, sharing, transfer, and visualization.
 Characteristics (Four Vs):
o Volume: Quantity of data.
o Variety: Diversity of data types.
o Velocity: Speed of data generation.
o Veracity: Accuracy and trustworthiness of data.
Data Science
 Definition: The discipline of analyzing massive amounts of data to extract knowledge, evolved from
statistics and traditional data management.
 Key Aspects:
o Combines methods from computer science and statistics.
o Deals with big data, machine learning, computing, and algorithm building.
 Tools:
o Hadoop, Pig, Spark, R, Python, Java, and others.
o Python is emphasized for its extensive libraries, support, and rapid prototyping capabilities.
Relationship Between Big Data and Data Science
 Analogy: Big data is like crude oil; data science is like the oil refinery.
 Evolution: Data science has evolved to handle the challenges posed by big data, making it distinct from
traditional statistics and data management roles.
Importance for Data Scientists
 Skills: Ability to work with big data and experience in machine learning, computing, and algorithm
development are crucial.
 Career Impact: Data scientists will inevitably engage with big data projects as data continues to grow in
importance.
Introduction to Data science:
Data Science is a combination of multiple disciplines that uses statistics, data analysis, and machine learning to
analyze data and to extract knowledge and insights from it.
What is Data Science?
 Data Science is about data gathering, analysis and decision-making.
 Data Science is about finding patterns in data, through analysis, and make future predictions.
 By using Data Science, companies are able to make:
 Better decisions (should we choose A or B)
 Predictive analysis (what will happen next?)
 Pattern discoveries (find pattern, or maybe hidden information in the data)
Where is Data Science Needed?
Data Science is used in many industries in the world today, e.g. banking, consultancy, healthcare, and
manufacturing.
Examples of where Data Science is needed:
 For route planning: To discover the best routes to ship
 To foresee delays for flight/ship/train etc. (through predictive analysis)
 To create promotional offers
 To find the best suited time to deliver goods
 To forecast the next years revenue for a company
 To analyze health benefit of training
 To predict who will win elections
Data Science can be applied in nearly every part of a business where data is available.
Examples are:
 Consumer goods
 Stock markets
 Industry
 Politics
 Logistic companies
 E-commerce

Benefits and Uses of Data Science and Big Data


Commercial Uses
 Customer Insights: Understanding customer behavior to improve user experience, cross-sell, up-sell, and
personalize offerings.
o Example: Google AdSense uses data to match relevant commercial messages to users.
o Example: MaxPoint provides real-time personalized advertising.
 Human Resources:
o People Analytics: Screening candidates, monitoring employee mood, studying informal networks.
o Example: "Moneyball" illustrates the use of statistics to hire players in baseball, transforming the
traditional scouting process.
 Financial Institutions:
o Predicting stock markets, assessing lending risks, attracting new clients.
o At least 50% of global trades are algorithmically performed by data scientists (quants).
Governmental Uses
 Internal Data Science: Detecting fraud, optimizing project funding, monitoring criminal activity.
o Example: Edward Snowden's leaks revealed how agencies like NSA and GCHQ used data science
to monitor millions of individuals using various data sources.
 Public Data Sharing: Government organizations sharing data with the public to gain insights and build
data-driven applications.
o Example: Data.gov provides open data from the US Government.
Nongovernmental Organizations (NGOs)
 Fundraising and Advocacy:
o NGOs use data to raise funds and promote their causes.
o Example: World Wildlife Fund (WWF) employs data scientists to improve fundraising efforts.
o Example: DataKind is a group of data scientists volunteering for humanitarian causes.
Academic Institutions
 Research and Education Enhancement:
o Using data science in research and to improve student learning experiences.
o Massive open online courses (MOOCs) generate data to study and enhance learning methods.
o Examples of MOOCs: Coursera, Udacity, edX, which offer courses from top universities to stay
updated with the rapidly changing field of data science and big data.
Facets of Data
1. Structured Data
 Definition: Data organized in a fixed format, often stored in tables within databases or spreadsheets.
 Management: Managed using SQL.
 Example: Excel tables, hierarchical data like family trees.

(or)
2. Unstructured Data
 Definition: Data that doesn’t fit neatly into a predefined model, often context-specific or varying.
 Example: Emails containing mixed elements like sender, title, and body text.

(or)

3. Natural Language Data


 Definition: A type of unstructured data requiring advanced processing and understanding of linguistics.
 Challenges: Ambiguity and domain-specific context.
 Successes: Entity recognition, topic recognition, sentiment analysis.
 Example: Text in emails or documents.
4. Machine-Generated Data
 Definition: Data automatically created by machines without human intervention.
 Volume & Speed: Requires scalable tools for analysis due to high volume and speed.
 Examples: Web server logs, network event logs, telemetry.
 Market Insight: Industrial Internet estimated at $540 billion in 2020.
5. Graph-Based Data
 Definition: Data representing relationships between objects using nodes, edges, and properties.
 Use: Effective for modeling social networks and calculating metrics like influence and shortest paths.
 Examples: Social media connections, LinkedIn networks.
 Storage: Graph databases, queried with languages like SPARQL.

6. Audio, Image, and Video Data


 Definition: Multimedia data types that present unique challenges for recognition and interpretation.
 Challenges: Recognizing objects and movements in images and videos.
 Example: Video capture in sports for real-time analytics, image recognition algorithms.

7. Streaming Data
 Definition: Data that continuously flows into a system as events happen, requiring real-time processing.
 Examples: Twitter trends, live event data, stock market data
The Data Science Process

The data science process typically consists of six steps:


1. Setting the Research Goal
o Objective: Define the purpose of the project and how it benefits the organization.
o Project Charter: Includes research objectives, benefits, required data and resources, timetable,
and deliverables.
o Application: Different research goals will be explored through case studies in the book.
2. Retrieving Data
o Data Collection: Ensure availability, quality, and access to the necessary data as outlined in the
project charter.
o Sources: Data can come from internal databases, third-party companies, and various formats such
as Excel spreadsheets and databases.
3. Data Preparation
o Objective: Enhance data quality and prepare it for analysis.
o Subphases:
 Data Cleansing: Remove false values and inconsistencies.
 Data Integration: Combine information from multiple sources.
 Data Transformation: Format the data appropriately for modeling.
4. Data Exploration
o Objective: Gain a deeper understanding of the data.
o Techniques: Use descriptive statistics, visual techniques, and simple modeling.
o EDA: Exploratory Data Analysis focuses on understanding variable interactions, data distribution,
and identifying outliers.
5. Data Modeling or Model Building
o Objective: Use models and domain knowledge to answer the research question.
o Techniques: Select from statistics, machine learning, operations research, etc.
o Process: Iteratively select variables, execute the model, and perform model diagnostics.
6. Presentation and Automation
o Objective: Present results to the business and possibly automate the process for future use.
o Formats: Results can be shared through presentations, reports, or automated systems.
Iterative Nature of the Process
 Non-linear Progression: The process is often iterative, requiring rework based on new insights or errors
discovered.
 Scope: Clearly define and scope the business question at the start to minimize rework and ensure focus.
The Big Data Ecosystem and Data Science
The big data ecosystem consists of various technologies, each serving specific purposes within the data science
workflow. These technologies can be grouped into several main categories:
1. Distributed File Systems
o Definition: A file system running on multiple servers, offering functionalities similar to traditional
file systems but with added advantages.
o Advantages:
 Storage of large files beyond single server limits.
 Automatic replication for redundancy and parallel operations.
 Horizontal scaling for virtually limitless growth.
o Examples: Hadoop File System (HDFS), Red Hat Cluster File System, Ceph File System, Tachyon
File System.
2. Distributed Programming Frameworks
o Definition: Frameworks designed to work with distributed data by moving programs to the data
rather than moving data to the program.
o Benefits: Simplify handling distributed data complexities, such as job failures and subprocess
tracking.
o Popular Frameworks: Apache Hadoop, Apache Spark.
3. Data Integration Frameworks
o Definition: Tools for moving data between sources and performing tasks similar to the traditional
extract, transform, load (ETL) process.
o Examples: Apache Sqoop, Apache Flume.
4. Machine Learning Frameworks
o Definition: Libraries and tools for extracting insights from data through machine learning,
statistics, and applied mathematics.
o Popular Libraries:
 Python Libraries: Scikit-learn (general machine learning), PyBrain (neural networks),
NLTK (natural language processing), Pylearn2 (machine learning toolbox), TensorFlow
(deep learning).
 Other Technologies: Apache Spark (real-time machine learning).
5. NoSQL Databases
o Definition: Databases designed to manage and query large amounts of data, overcoming
limitations of traditional relational databases.
o Types:
 Column Databases: Store data in columns for faster queries.
 Document Stores: Store data in flexible document formats.
 Streaming Data: Handle real-time data collection and aggregation.
 Key-Value Stores: Use keys to access values, with scalable performance.
 SQL on Hadoop: Perform batch queries using SQL-like languages on Hadoop.
 New SQL: Combine scalability of NoSQL with relational database advantages.
 Graph Databases: Use graph theory for problems like social networks.
6. Scheduling Tools
o Definition: Automate repetitive tasks and trigger jobs based on events, similar to CRON but for
big data.
o Example: Automate MapReduce tasks when new data is available.
7. Benchmarking Tools
o Definition: Tools for optimizing big data infrastructure by profiling and benchmarking
performance.
o Usage: Typically managed by IT professionals rather than data scientists.
8. System Deployment
o Definition: Tools for automating the installation and configuration of big data infrastructure.
o Role: Assists engineers rather than data scientists.
9. Service Programming
o Definition: Tools to expose big data applications as services for other applications to use.
o Example: REST services for integrating applications.
10. Security
o Definition: Tools for central and fine-grained control over data access.
o Role: Managed by security experts, with data scientists primarily as data consumers.
Introductory Example of Hadoop
 Objective: Use the Hortonworks Sandbox for a practical Hadoop application.
 Steps:
1. Download and Install: Obtain the Hortonworks Sandbox image and run it in VirtualBox.
2. Access the Sandbox: Log in via the web interface at http://127.0.0.1:8000.
3. Explore Data: Use HCatalog to view and query sample data.
4. Execute Queries: Use HiveQL via the Beeswax HiveQL editor to run MapReduce jobs and view
results.
 Outcome: Analyze job salary data to determine the impact of education on salaries.

Overview of the Data Science Process


A structured approach to data science boosts project success and team efficiency, though flexibility is often
necessary based on the project’s needs. Here’s a breakdown of the typical data science process:
1. Setting a Research Goal
o Objective: Clarify the project’s aims, methods, and reasons with all stakeholders.
o Deliverable: A project charter detailing goals, scope, and expected outcomes.
2. Data Retrieval
o Objective: Gather relevant data for analysis by finding and accessing necessary datasets.
o Deliverable: Raw data that may need cleaning and formatting.
3. Data Preparation
o Objective: Clean and transform raw data to make it suitable for analysis. This involves fixing
errors and combining datasets.
o Deliverable: Clean, usable data ready for exploration and modeling.
4. Data Exploration
o Objective: Understand the data by identifying patterns, correlations, and anomalies through visual
and descriptive methods.
o Deliverable: Insights that inform the modeling phase.
5. Model Building
o Objective: Create and refine models to gain insights or make predictions as per the project goals.
Often, simpler models can be more effective than complex ones.
o Deliverable: Validated and tuned models that meet project objectives.
6. Presentation and Automation
o Objective: Share findings with stakeholders and set up automated systems if needed to improve
decision-making or processes.
o Deliverable: Clear insights and potentially automated analysis systems.
Key Points
 Iterative Nature: The process is cyclical, with frequent revisions based on discoveries and feedback.
 Prototype Mode: Early development often involves experimenting with models, focusing more on value
than on code perfection.
 Team Collaboration: Leveraging a team’s diverse skills enhances project quality over relying on a single
person.
Flexibility and Alternatives
 Agile Methodology: Offers a flexible, iterative approach that adapts to evolving project needs. It's
becoming popular in data science but may be limited by company policies.
 Adapting the Process: Adjustments may be needed based on project specifics and organizational context.
Additional Considerations
 Complex Projects: Large-scale or real-time projects might require different methods than those described.
 Data Accessibility: Getting the necessary data can be challenging, especially in large organizations with
complex approval processes.
This approach serves as a robust foundation for managing and executing data science projects effectively.
Step 1: Defining Research Goals and Creating a Project Charter
Objective: Start the project by clearly understanding and documenting the project's objectives, value, and
approach. This ensures everyone involved knows what to do and agrees on the plan.

1. Understanding Goals and Context


o What: Determine what the company expects from the project. Clarify the specific tasks or
outcomes required.
o Why: Understand the importance of the project. Is it part of a larger strategy or a standalone
initiative?
o How: Decide on the methods or approaches you'll use to achieve the goals.
o Outcome: A well-defined research goal, clear understanding of the business context, and a plan
for action.
Key Points:
o Continuously ask questions to fully grasp business expectations and the impact of your research.
o Avoid rushing through this phase; a deep understanding of the business context is crucial for
success.
2. Creating a Project Charter
o Purpose: Formalize the project details in a document that outlines what the project will deliver
and how.
o Components:
 Research Goal: A clear statement of what the project aims to achieve.
 Mission and Context: Explanation of the project's mission and its context within the
organization.
 Analysis Plan: How you will perform the analysis.
 Resources: What resources (data, tools, people) you'll need.
 Feasibility: Evidence that the project is achievable, such as proof of concept.
 Deliverables and Success Measures: What will be delivered and how success will be
measured.
 Timeline: Key milestones and deadlines for the project.
Client's Use:
o Helps in estimating project costs.
o Determines the data and resources needed for success.
Additional Tips:
 This phase often requires strong communication and business understanding skills, sometimes guided by
senior personnel.
 Ensuring clarity in the project charter helps prevent misunderstandings and sets clear expectations for both
the team and the client.
Step 2: Retrieving Data
Overview
Retrieving data is a crucial step in the data science process. It involves acquiring data from various sources, which
can include internal company repositories, external vendors, or publicly available datasets. The goal is to obtain
all necessary data and ensure it is suitable for analysis.

1. Start with Data Stored Within the Company


 Assess Available Data:
o Internal Repositories: Check for data in databases, data marts, data warehouses, and data lakes.
o Databases: Designed for storage.
o Data Warehouses: Designed for reading and analysis.
o Data Marts: Subsets of data warehouses, specific to business units.
o Data Lakes: Contain raw data; may need significant cleaning.
 Challenges:
o Scattered Data: Data may be dispersed across different departments or stored in Excel files.
o Access Issues: Organizational policies may restrict data access and involve bureaucratic hurdles.
 Data Accessibility:
o Chinese Walls: Physical and digital barriers to ensure data privacy and security.
o Access Permissions: Obtaining necessary permissions can be time-consuming.
2. Don’t Be Afraid to Shop Around
 External Data Sources:
o Data Vendors: Companies like Nielsen and GFK provide valuable data.
o Social Media: Platforms such as Twitter, LinkedIn, and Facebook offer data that can enrich
analysis.
 Open Data:
o Government and Institutional Data: Many organizations share high-quality data for free.
Examples include:
 Data.gov: US Government open data.
 Data.europa.eu: European Commission open data.
 Freebase.org: Information from Wikipedia, MusicBrainz, and SEC archives.
 Data.worldbank.org: World Bank open data initiative.
 Aiddata.org: Data for international development.
 Open.fda.gov: Data from the US Food and Drug Administration.

3. Do Data Quality Checks Now to Prevent Problems Later
 Importance of Quality Checks:
o Data Cleansing: Expect significant time investment in correcting data issues (up to 80% of the
project).
o Error Detection: Initial checks during retrieval ensure data matches source documents and correct
data types.
 Quality Check Phases:
o Data Retrieval:
 Verify that data matches the source.
 Check for correct data types.
o Data Preparation:
 Conduct more detailed checks.
 Correct errors, such as typos and inconsistencies (e.g., standardizing country names).
o Exploratory Data Analysis:
 Analyze data for patterns, distributions, correlations, and outliers.
 Iteratively refine data based on findings (e.g., addressing outliers that indicate data entry
errors).
o
Key Points
 Iterative Process: Expect to revisit and refine data across different phases.

 Data Sources: Utilize both internal and external sources for comprehensive data collection.
 Quality Assurance: Early and continuous quality checks are vital to avoid issues in later stages.
Step 3: Cleansing, Integrating, and Transforming Data
Data received from the retrieval phase often needs significant preparation. Cleansing, integrating, and
transforming data are crucial for ensuring that models work accurately and efficiently.
1 Cleansing Data
Goal: Remove errors and inconsistencies to make data a true and consistent representation.
Types of Errors and Solutions:
1. Data Entry Errors:
o Example: A person’s age recorded as 350 years.
o Solution: Use sanity checks to find and correct these errors.
2. Redundant Whitespace:
o Example: Keys like "FR " vs. "FR" causing mismatches.
o Solution: Use functions like strip() in Python to remove leading and trailing spaces.
3. Capital Letter Mismatches:
o Example: "Brazil" vs. "brazil."
o Solution: Convert all text to lowercase using functions like .lower().
4. Impossible Values:
o Example: A height of 3 meters.
o Solution: Implement rules to flag and correct unrealistic values.
5. Outliers:
o Example: A salary listed as $1,000,000 when most are under $100,000.
o Solution: Use plots or statistical methods to detect and assess outliers.
6. Missing Values:
o Techniques:
 Omit values (lose information).
 Set value to null (may cause issues with some models).
 Impute a static value or estimate (e.g., mean value).
 Model the value (may introduce assumptions).
7. Inconsistencies Between Data Sets:
o Example: Different units (Pounds vs. Dollars).
o Solution: Standardize units and use code books for consistent terminology.
Advanced Methods:
 Use diagnostic plots or regression to identify influential outliers or errors.

Examples:
Data Cleansing
1. Data Entry Errors:
o Mistakes during data entry: Correc ng "Godo" to "Good" and "Bade" to "Bad".
python
data = ["Good", "Bad", "Godo", "Bade"]
cleaned_data = ["Good" if x == "Godo" else "Bad" if x == "Bade" else x for x in data]
o Redundant whitespaces: Removing leading and trailing spaces from " Apple ".
python
string = " Apple "
cleaned_string = string.strip()
o Capital le er mismatches: Standardizing "Brazil" and "brazil" to lowercase.
python
string1 = "Brazil"
string2 = "brazil"
standardized_string1 = string1.lower()
standardized_string2 = string2.lower()
o Impossible values: Removing ages over 150.
python
ages = [25, 34, 300, 45]
valid_ages = [age for age in ages if age <= 150]
2. Outliers:
o Iden fying outliers using a plot or summary sta s cs.
python
import matplotlib.pyplot as plt
data = [10, 12, 15, 20, 1000]
plt.boxplot(data)
plt.show()
3. Missing Values:
o Impu ng missing values with the mean.
python
import numpy as np
data = [1, 2, np.nan, 4]
mean_value = np.nanmean(data)
imputed_data = [x if not np.isnan(x) else mean_value for x in data]
4. Devia ons from Code Book:
o Ensuring "Female" and "F" are consistent.
python
genders = ["Female", "F", "Male"]
standardized_genders = ["Female" if g == "F" else g for g in genders]
5. Different Units of Measurement:
o Conver ng prices from dollars to euros.
python
prices_in_dollars = [10, 20, 30]
conversion_rate = 0.85 # 1 dollar = 0.85 euros
prices_in_euros = [price * conversion_rate for price in prices_in_dollars]
6. Different Levels of Aggrega on:
o Aggrega ng daily data to weekly data.
python
daily_sales = [100, 200, 150, 300, 250, 400, 500]
weekly_sales = sum(daily_sales)

2 Correct Errors as Early as Possible


Importance:
1. Avoid Mistakes: Early correction prevents costly errors.
2. Efficiency: Reduces need for repeated cleansing.
3. Process Improvement: Errors can highlight issues in data collection or business processes.
4. Equipment and Software: Identify faulty equipment or software issues.
Example: Fixing data entry issues during the collection phase avoids the need to handle them later in modeling.

3 Combining Data from Different Data Sources


Operations:
1. Joining Tables:
o Example: Merge customer purchase data with regional information based on customer ID.
o Result: Enriches observations with additional data.
Examples:
Joining Tables: Combining data based on common keys to enrich observations
combined_data = pd.merge(purchases, regions, on=“Client")

2. Appending Tables:
o Example: Combine sales data from January and February into a single table.
o Result: Creates a comprehensive dataset covering both months.
o Example: Appending Tables: Adding observations from one table to another.

Append_data= pd.concat([Table1, Table2], ignore_index=True)

3. Using Views:
o Example: Create a virtual table combining monthly sales without duplicating data.
o Advantage: Saves disk space but may require more processing power.
o Example: Using Views: Creating virtual tables to avoid data duplication.

def create_view():
return combined_sales
yearly_sales_view = create_view()
4. Enriching Aggregated Measures:
o Example: Calculate growth percentage or rank sales by product class.
o Result: Adds valuable context and insights for modeling.
o Enriching Aggregated Measures: Adding calculated information for better analysis.

sales_data = pd.DataFrame({"ProductClass": ["Sport", "Sport", "Shoes"], "Sales": [95, 120, 10]})


total_sales_by_class = sales_data.groupby("ProductClass").sum()

4 Transforming Data
Objective: Adjust data into a suitable format for modeling.
1. Transforming Variables:
o Example: Apply log transformation to linearize relationships.
o Effect: Simplifies complex relationships for modeling.
Transforming Variables: Simplifying relationships (e.g., using logarithms).

x = [1, 2, 3, 4]
log_x = np.log(x)

2. Reducing Variables:
o Example: Use Principal Component Analysis (PCA) to reduce dimensions.
o Effect: Simplifies models and improves performance by focusing on key variables.

Reducing Variables:
Using methods like PCA
to reduce
dimensionality while
retaining informa on.

3. Turning Variables into Dummies:


o Example: Convert a "Weekdays" variable into binary columns for each day of the week.
o Result: Facilitates modeling with categorical data.

Turning Variables into Dummies: Converting categorical variables into binary dummy variables for modeling.

df_dummies = pd.get_dummies(df, columns=[‘Gender'])


Exploratory Data Analysis (EDA) (or) exploratory analysis
Overview
Exploratory Data Analysis (EDA) is a critical phase in data science where you delve deeply into your dataset to
understand its structure, patterns, and relationships. This phase leverages visualization techniques to make data
more interpretable and uncover insights that may not be apparent from raw data alone.

Goals of EDA
1. Understand Data Structure: Explore and visualize data to grasp its underlying structure and patterns.
2. Identify Anomalies: Discover anomalies or errors in the data that were missed earlier.
3. Generate Hypotheses: Formulate hypotheses about relationships between variables.
4. Improve Data Quality: Although not the primary goal, EDA often reveals data quality issues that need
to be addressed.
Exploratory Data Analysis (EDA) Techniques
1. Simple Graphs
Purpose: Simple graphs help visualize individual aspects of the data, making it easier to identify patterns,
distributions, and trends.
 Line Graphs
o Plot points connected by lines to show changes over time or trends in continuous variables.
o Example: Tracking monthly sales figures for a year to observe seasonal trends.
 Histograms
o Divide a range of values into discrete bins and count the number of occurrences in each bin.
o Example: A histogram showing the distribution of ages in a dataset, where age ranges (e.g., 0-5,
6-10) are plotted on the x-axis and counts on the y-axis.

o
 Bar Charts
o Display categorical data with rectangular bars. The length of each bar represents the count or value
of the category.
o Example: A bar chart comparing the total sales of different product categories.

2. Combined Graphs
Purpose: Combine multiple simple graphs to provide a more comprehensive view of the data, revealing
relationships and patterns across different variables.
 Pareto Diagram (80-20 Rule)
o A combination of bar charts and cumulative distribution lines. It helps identify the most significant
factors contributing to a particular outcome.
o Example: A Pareto diagram showing that 20% of the customer complaints come from 80% of the
complaint types, guiding where to focus improvement efforts.
 Composite Graphs
o Combining different types of plots (e.g., bar charts and line graphs) to visualize multiple aspects
of the data simultaneously.
o Example: A composite graph displaying total sales (bar chart) and profit margin percentage (line
graph) for different months.

3. Link and Brush


Purpose: Facilitate interactive exploration by linking different visualizations so that selecting data in one plot
highlights corresponding data in others.
 Enables selection in one graph to reflect changes across multiple graphs, making it easier to compare
related data points and identify correlations.
 Example: A link-and-brush interface where selecting certain regions in a scatter plot highlights
corresponding data in a histogram and a line chart, showing how data points are distributed across different
categories.
 Interactive Example: You select specific points in a scatter plot of customer satisfaction versus spending.
The selected points are highlighted in a bar chart showing spending categories and a line graph of
satisfaction trends, allowing for in-depth analysis of how selected customers’ spending relates to their
satisfaction.

4. Non-Graphical Techniques
Purpose: Augment visual exploration with other methods to gain additional insights from the data.
 Tabulation
o Summarize data in tables to display aggregated information, such as counts, averages, and other
metrics.
o Example: A table showing average sales per region along with total number of transactions and
standard deviation.
 Clustering
o Group similar data points together based on characteristics to identify patterns or segments.
o Example: Using clustering algorithms to segment customers into different groups based on
purchasing behavior, which can then be analyzed for targeted marketing strategies.
 Building Simple Models
o Develop basic models to explore relationships between variables and predict outcomes.
o Example: Creating a simple linear regression model to understand how changes in advertising
budget impact sales.
o
Exploratory Data Analysis employs a range of techniques to help understand and visualize data. Simple graphs
like line charts and histograms provide basic insights, while combined graphs like Pareto diagrams offer more
detailed views. Interactive methods like link-and-brush enhance exploration, and non-graphical techniques such
as tabulation and clustering add depth to the analysis. These methods collectively aid in gaining a thorough
understanding of the data before proceeding to more complex modeling.
Data Modeling (or) model building
Data modeling involves applying techniques from machine learning, data mining, and statistics to make
predictions or classify data.
The "Build the Models" phase in data science involves constructing models with the aim of making better
predictions, classifying objects, or gaining a deeper understanding of the system being modeled. By this stage,
the data is already clean, and there is a clear understanding of the content and the goals. This phase is more
focused compared to the exploratory analysis step, as there is a clearer direction on what you’re looking for and
the desired outcomes.

1. Model and Variable Selection


Choose the appropriate model and variables based on exploratory data analysis and project requirements.
 Selection Criteria:
o Model Type: Statistical models (e.g., linear regression) vs. machine learning models (e.g., k-
nearest neighbors).
o Variables: Based on exploratory analysis, select variables that contribute meaningfully to the
model.
o Practical Considerations: Consider model deployment, maintenance, and interpretability.
Example: If you want to predict house prices, you might use variables like size, number of bedrooms, and
location. You could choose a linear regression model for its simplicity and interpretability or a more complex
model if performance justifies it.
2. Model Execution
Implement the chosen model using programming languages and libraries.
 Python Libraries:
o StatsModels: For statistical models, e.g., linear regression.
o Scikit-learn: For machine learning models, e.g., k-nearest neighbors.
Example:
 Linear Regression with StatsModels:
Python code
import statsmodels.api as sm
X = [predictor variables]
y = [target variable]
X = sm.add_constant(X) # Adds a constant term to the predictor
model = sm.OLS(y, X).fit() # Fit the model
print(model.summary()) # Print the model summary
 K-Nearest Neighbors with Scikit-learn:
python code
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)
accuracy = model.score(X_test, y_test)
print(f'Accuracy: {accuracy}')
3. Model Diagnostics and Model Comparison
Evaluate model performance and compare different models using error measures and diagnostic checks.
 Error Measures:
o Mean Squared Error (MSE): Measures the average squared difference between predicted and
actual values.
Formula:

Example: Calculate the MSE for different models to determine which one performs better on a holdout sample.
 Holdout Sample: A portion of data reserved for testing the model's performance on unseen data.
Example: Split data into training (80%) and test (20%) sets. Train the model on the training set and evaluate it
on the test set to measure error.
 Model Comparison:
o Compare models based on performance metrics (e.g., MSE) and practical considerations (e.g.,
interpretability, deployment).
Example: Compare a simple linear regression model to a more complex machine learning model (e.g., random
forest) using their performance on a holdout sample. Choose the model with the lowest error or best performance
metric.
 Model Diagnostics: Check if the model assumptions (e.g., independence of inputs) are met and if the
model is suitable for the data.
Example: For linear regression, check residual plots to ensure homoscedasticity (constant variance of errors) and
perform tests for multicollinearity.
Data modeling involves selecting and executing appropriate techniques based on exploratory analysis. It
requires evaluating models using error measures and diagnostics to ensure they perform well on unseen data.
Practical considerations like deployment and interpretability also play a role in choosing the best model.

Presenting Findings and Building Applications (or) PRESENTING FINDINGS AND BUILDING
APPLICATIONS ON TOP OF THEM
Presentation of Findings
After you’ve successfully analyzed the data and built a well-performing model, the next step is to present your
findings to stakeholders. This is a pivotal and exciting phase, as it allows you to showcase the results of your hard
work.
Importance of Automation
Due to the value stakeholders may place on your models and insights, they might request repeated analyses or
updates. To meet this demand efficiently, automation becomes essential. This could involve automating model
scoring, or creating applications that automatically update reports, Excel spreadsheets, or PowerPoint
presentations.
The Role of Soft Skills
In this final stage, your soft skills are crucial. Communicating your findings effectively is key to ensuring that
stakeholders understand and appreciate the value of your work. Developing these skills is recommended, as they
significantly impact how your work is received and acted upon.
By successfully communicating your results and satisfying stakeholders, you complete the data science
process, bringing your project to a successful conclusion.

(other)
PRESENTING FINDINGS AND BUILDING APPLICATIONS ON TOP OF THEM

1. Presenting Findings
Purpose: Communicate the results of your analysis and model to stakeholders in a clear and impactful way.
 Visualizations: Use charts, graphs, and other visual aids to make data and insights easily understandable.
o Examples:
 Bar Charts and Line Graphs: To show trends and comparisons.
 Heatmaps: To highlight correlations between variables.
 Interactive Dashboards: Tools like Tableau or Power BI can be used to create interactive
presentations where stakeholders can explore the data themselves.
 Reports and Presentations: Create detailed reports or PowerPoint presentations summarizing the key
findings, methodology, and implications.
o Example: A presentation might include an introduction, methodology, results with visual aids, and
a conclusion with recommendations.
 Storytelling: Craft a narrative that explains the journey from data collection to actionable insights.
o Example: "We started by collecting customer data from our CRM, cleaned and processed it,
explored the relationships between variables, built a predictive model, and now we can accurately
forecast customer churn."
2. Automating Models
Purpose: Ensure the model's results can be replicated and updated efficiently without manual intervention.
 Model Scoring: Automate the process of applying the model to new data.
o Example: Set up a script that runs daily to score new customer data and predict churn probability.
 Automated Reports: Develop systems that automatically generate updated reports or dashboards.
o Example: Use tools like Python (with libraries such as Pandas and Matplotlib), R, or dedicated BI
tools to create reports that update automatically with new data inputs.
 Application Development: Build applications that integrate the model and its predictions into business
processes.
o Example: Create a web app that sales teams can use to input customer data and get instant churn
predictions and recommendations on actions to take.
3. Importance of Soft Skills
Purpose: Effectively communicate and collaborate with stakeholders to ensure your work is understood and
utilized.
 Communication: Clearly explain technical details in a non-technical manner.
o Example: Instead of saying "Our model has an R-squared of 0.85," explain "Our model can
explain 85% of the variation in customer churn."
 Collaboration: Work with different departments to ensure the model meets their needs and they
understand how to use it.
o Example: Conduct training sessions for the marketing team on how to interpret the model's output
and use it for targeted campaigns.
 Feedback and Iteration: Be open to feedback and ready to make adjustments to improve the model or
presentation.
o Example: After presenting the findings, gather feedback from stakeholders to refine the model or
the way results are presented.
The final step in the data science process involves presenting findings to stakeholders and automating the
model for continuous use. Effective communication and visualization are crucial for conveying insights, while
automation ensures that the model's benefits are consistently realized. Soft skills are vital for ensuring that the
work is understood, valued, and acted upon by the stakeholders.

Model Building Overview


With clean data and a solid understanding of the content, the next step in the data science process is model
building. This phase is more focused than exploratory analysis, as you now have clear goals—whether making
predictions, classifying objects, or understanding the system you’re modeling.
Techniques and Approaches
The techniques used in this phase are borrowed from fields like machine learning, data mining, and statistics.
Although this introduction only scratches the surface, it provides enough to get started. A small set of techniques
will be applicable in most cases due to the overlap in their objectives, even though they might achieve these goals
in slightly different ways.
Iterative Nature of Model Building
Model building is an iterative process. The approach depends on whether you use classic statistical methods or
more modern machine learning techniques. Regardless of the approach, most models follow these main steps:
1. Selection of a Modeling Technique and Variables: Choosing the appropriate technique and variables for
the model.
2. Execution of the Model: Running the model based on the selected technique.
3. Diagnosis and Model Comparison: Evaluating and comparing the model’s performance to refine it
further.
This phase marks a crucial step in the data science process, laying the groundwork for more detailed
exploration of modeling techniques in subsequent chapters.
Model and Variable Selection
In the model building phase, selecting the right variables and modeling technique is crucial. The insights gained
from the exploratory analysis should guide you in identifying the variables that will help construct an effective
model.
Considerations for Model Selection
Choosing the appropriate model requires careful judgment, taking into account several factors:
 Model Performance: How well the model performs with the selected variables.
 Production Implementation: If the model needs to be deployed in a production environment, how easy
is it to implement?
 Maintenance: The difficulty of maintaining the model and how long it will remain relevant if left
unchanged.
 Ease of Explanation: Whether the model needs to be easily understandable to stakeholders.
After considering these factors, you can move forward with building and implementing the model, putting
your planning into action.
Model Execution Overview
After selecting a model, the next step is to implement it in code. This phase involves actual coding, often using
Python and its libraries, such as StatsModels or Scikit-learn, which provide powerful tools for model execution.
Setting Up the Environment
Before coding, ensure you have a Python virtual environment set up, as it’s essential for executing the code
correctly. This setup is a prerequisite, and resources like Appendix D in the book can help if it's your first time.
Coding the Model
Python simplifies model implementation with its libraries. For example, executing a linear regression model with
StatsModels involves creating random data, fitting the model, and analyzing the results. The process is relatively
straightforward due to the availability of these libraries, which save significant time compared to manual coding.
Key Concepts in Model Analysis
 Model Fit: Measured by R-squared or adjusted R-squared, which indicates how well the model captures
data variation. A high value (e.g., above 0.85 for business models) is generally considered good.
 Predictor Variables: These have coefficients that show the impact of changes in predictor variables on
the target variable.
 Predictor Significance: Determined by the p-value, which indicates whether the predictor variable
significantly influences the target variable. A p-value below 0.05 is often considered significant.
Example: k-Nearest Neighbors Classification
Another example is the k-nearest neighbors (k-NN) classification, which predicts the label of an unlabeled point
based on the labels of nearby points. Using Scikit-learn, this model can classify data with reasonable accuracy,
but care must be taken not to overfit or misinterpret results.
Importance of Validation
The document emphasizes the need for a holdout sample to validate the model against new data, rather than just
the data used to build the model. This step ensures the model’s robustness and reliability in real-world scenarios.
While Python has industry-ready implementations for many techniques, it's also possible to leverage R,
another powerful tool for statistical computing, using the RPy library in Python. Overall, implementing models
in code is a crucial step in the data science process, requiring both technical skill and careful analysis.
Model Diagnostics and Model Comparison
Building Multiple Models
When building models, it's common to create multiple models and then choose the best one based on various
criteria. The goal is to select a model that performs well on unseen data, ensuring it generalizes beyond the training
set.
Holdout Sample
A holdout sample is a portion of the data set aside during model building, used later to evaluate the model’s
performance. By training the model on a fraction of the data and testing it on the holdout sample, you can assess
how well the model works on new, unseen data.
Error Measures
One common way to evaluate models is by using error measures, such as mean square error (MSE). MSE
calculates the average squared difference between the predicted and actual values. The model with the lowest
MSE on the holdout sample is typically considered the best-performing model.
Example of Model Comparison
In a provided example, two models are compared: one predicts order size based on price, and the other uses a
constant prediction. By training on 80% of the data and testing on the remaining 20%, the first model (which
accounts for price) shows a lower error, making it the preferred choice.
Model Diagnostics
Beyond error measures, it's crucial to verify that the models meet their underlying assumptions, such as the
independence of inputs. This verification process is known as model diagnostics and ensures the model's validity.
This section introduces the essential steps for building a valid model, including model diagnostics and
comparison using a holdout sample. Once the best model is selected, the process moves on to presenting findings
and building applications based on the model.

You might also like