Unit-1 IDS
Unit-1 IDS
(or)
2. Unstructured Data
Definition: Data that doesn’t fit neatly into a predefined model, often context-specific or varying.
Example: Emails containing mixed elements like sender, title, and body text.
(or)
7. Streaming Data
Definition: Data that continuously flows into a system as events happen, requiring real-time processing.
Examples: Twitter trends, live event data, stock market data
The Data Science Process
Data Sources: Utilize both internal and external sources for comprehensive data collection.
Quality Assurance: Early and continuous quality checks are vital to avoid issues in later stages.
Step 3: Cleansing, Integrating, and Transforming Data
Data received from the retrieval phase often needs significant preparation. Cleansing, integrating, and
transforming data are crucial for ensuring that models work accurately and efficiently.
1 Cleansing Data
Goal: Remove errors and inconsistencies to make data a true and consistent representation.
Types of Errors and Solutions:
1. Data Entry Errors:
o Example: A person’s age recorded as 350 years.
o Solution: Use sanity checks to find and correct these errors.
2. Redundant Whitespace:
o Example: Keys like "FR " vs. "FR" causing mismatches.
o Solution: Use functions like strip() in Python to remove leading and trailing spaces.
3. Capital Letter Mismatches:
o Example: "Brazil" vs. "brazil."
o Solution: Convert all text to lowercase using functions like .lower().
4. Impossible Values:
o Example: A height of 3 meters.
o Solution: Implement rules to flag and correct unrealistic values.
5. Outliers:
o Example: A salary listed as $1,000,000 when most are under $100,000.
o Solution: Use plots or statistical methods to detect and assess outliers.
6. Missing Values:
o Techniques:
Omit values (lose information).
Set value to null (may cause issues with some models).
Impute a static value or estimate (e.g., mean value).
Model the value (may introduce assumptions).
7. Inconsistencies Between Data Sets:
o Example: Different units (Pounds vs. Dollars).
o Solution: Standardize units and use code books for consistent terminology.
Advanced Methods:
Use diagnostic plots or regression to identify influential outliers or errors.
Examples:
Data Cleansing
1. Data Entry Errors:
o Mistakes during data entry: Correc ng "Godo" to "Good" and "Bade" to "Bad".
python
data = ["Good", "Bad", "Godo", "Bade"]
cleaned_data = ["Good" if x == "Godo" else "Bad" if x == "Bade" else x for x in data]
o Redundant whitespaces: Removing leading and trailing spaces from " Apple ".
python
string = " Apple "
cleaned_string = string.strip()
o Capital le er mismatches: Standardizing "Brazil" and "brazil" to lowercase.
python
string1 = "Brazil"
string2 = "brazil"
standardized_string1 = string1.lower()
standardized_string2 = string2.lower()
o Impossible values: Removing ages over 150.
python
ages = [25, 34, 300, 45]
valid_ages = [age for age in ages if age <= 150]
2. Outliers:
o Iden fying outliers using a plot or summary sta s cs.
python
import matplotlib.pyplot as plt
data = [10, 12, 15, 20, 1000]
plt.boxplot(data)
plt.show()
3. Missing Values:
o Impu ng missing values with the mean.
python
import numpy as np
data = [1, 2, np.nan, 4]
mean_value = np.nanmean(data)
imputed_data = [x if not np.isnan(x) else mean_value for x in data]
4. Devia ons from Code Book:
o Ensuring "Female" and "F" are consistent.
python
genders = ["Female", "F", "Male"]
standardized_genders = ["Female" if g == "F" else g for g in genders]
5. Different Units of Measurement:
o Conver ng prices from dollars to euros.
python
prices_in_dollars = [10, 20, 30]
conversion_rate = 0.85 # 1 dollar = 0.85 euros
prices_in_euros = [price * conversion_rate for price in prices_in_dollars]
6. Different Levels of Aggrega on:
o Aggrega ng daily data to weekly data.
python
daily_sales = [100, 200, 150, 300, 250, 400, 500]
weekly_sales = sum(daily_sales)
2. Appending Tables:
o Example: Combine sales data from January and February into a single table.
o Result: Creates a comprehensive dataset covering both months.
o Example: Appending Tables: Adding observations from one table to another.
3. Using Views:
o Example: Create a virtual table combining monthly sales without duplicating data.
o Advantage: Saves disk space but may require more processing power.
o Example: Using Views: Creating virtual tables to avoid data duplication.
def create_view():
return combined_sales
yearly_sales_view = create_view()
4. Enriching Aggregated Measures:
o Example: Calculate growth percentage or rank sales by product class.
o Result: Adds valuable context and insights for modeling.
o Enriching Aggregated Measures: Adding calculated information for better analysis.
4 Transforming Data
Objective: Adjust data into a suitable format for modeling.
1. Transforming Variables:
o Example: Apply log transformation to linearize relationships.
o Effect: Simplifies complex relationships for modeling.
Transforming Variables: Simplifying relationships (e.g., using logarithms).
x = [1, 2, 3, 4]
log_x = np.log(x)
2. Reducing Variables:
o Example: Use Principal Component Analysis (PCA) to reduce dimensions.
o Effect: Simplifies models and improves performance by focusing on key variables.
Reducing Variables:
Using methods like PCA
to reduce
dimensionality while
retaining informa on.
Turning Variables into Dummies: Converting categorical variables into binary dummy variables for modeling.
Goals of EDA
1. Understand Data Structure: Explore and visualize data to grasp its underlying structure and patterns.
2. Identify Anomalies: Discover anomalies or errors in the data that were missed earlier.
3. Generate Hypotheses: Formulate hypotheses about relationships between variables.
4. Improve Data Quality: Although not the primary goal, EDA often reveals data quality issues that need
to be addressed.
Exploratory Data Analysis (EDA) Techniques
1. Simple Graphs
Purpose: Simple graphs help visualize individual aspects of the data, making it easier to identify patterns,
distributions, and trends.
Line Graphs
o Plot points connected by lines to show changes over time or trends in continuous variables.
o Example: Tracking monthly sales figures for a year to observe seasonal trends.
Histograms
o Divide a range of values into discrete bins and count the number of occurrences in each bin.
o Example: A histogram showing the distribution of ages in a dataset, where age ranges (e.g., 0-5,
6-10) are plotted on the x-axis and counts on the y-axis.
o
Bar Charts
o Display categorical data with rectangular bars. The length of each bar represents the count or value
of the category.
o Example: A bar chart comparing the total sales of different product categories.
2. Combined Graphs
Purpose: Combine multiple simple graphs to provide a more comprehensive view of the data, revealing
relationships and patterns across different variables.
Pareto Diagram (80-20 Rule)
o A combination of bar charts and cumulative distribution lines. It helps identify the most significant
factors contributing to a particular outcome.
o Example: A Pareto diagram showing that 20% of the customer complaints come from 80% of the
complaint types, guiding where to focus improvement efforts.
Composite Graphs
o Combining different types of plots (e.g., bar charts and line graphs) to visualize multiple aspects
of the data simultaneously.
o Example: A composite graph displaying total sales (bar chart) and profit margin percentage (line
graph) for different months.
4. Non-Graphical Techniques
Purpose: Augment visual exploration with other methods to gain additional insights from the data.
Tabulation
o Summarize data in tables to display aggregated information, such as counts, averages, and other
metrics.
o Example: A table showing average sales per region along with total number of transactions and
standard deviation.
Clustering
o Group similar data points together based on characteristics to identify patterns or segments.
o Example: Using clustering algorithms to segment customers into different groups based on
purchasing behavior, which can then be analyzed for targeted marketing strategies.
Building Simple Models
o Develop basic models to explore relationships between variables and predict outcomes.
o Example: Creating a simple linear regression model to understand how changes in advertising
budget impact sales.
o
Exploratory Data Analysis employs a range of techniques to help understand and visualize data. Simple graphs
like line charts and histograms provide basic insights, while combined graphs like Pareto diagrams offer more
detailed views. Interactive methods like link-and-brush enhance exploration, and non-graphical techniques such
as tabulation and clustering add depth to the analysis. These methods collectively aid in gaining a thorough
understanding of the data before proceeding to more complex modeling.
Data Modeling (or) model building
Data modeling involves applying techniques from machine learning, data mining, and statistics to make
predictions or classify data.
The "Build the Models" phase in data science involves constructing models with the aim of making better
predictions, classifying objects, or gaining a deeper understanding of the system being modeled. By this stage,
the data is already clean, and there is a clear understanding of the content and the goals. This phase is more
focused compared to the exploratory analysis step, as there is a clearer direction on what you’re looking for and
the desired outcomes.
Example: Calculate the MSE for different models to determine which one performs better on a holdout sample.
Holdout Sample: A portion of data reserved for testing the model's performance on unseen data.
Example: Split data into training (80%) and test (20%) sets. Train the model on the training set and evaluate it
on the test set to measure error.
Model Comparison:
o Compare models based on performance metrics (e.g., MSE) and practical considerations (e.g.,
interpretability, deployment).
Example: Compare a simple linear regression model to a more complex machine learning model (e.g., random
forest) using their performance on a holdout sample. Choose the model with the lowest error or best performance
metric.
Model Diagnostics: Check if the model assumptions (e.g., independence of inputs) are met and if the
model is suitable for the data.
Example: For linear regression, check residual plots to ensure homoscedasticity (constant variance of errors) and
perform tests for multicollinearity.
Data modeling involves selecting and executing appropriate techniques based on exploratory analysis. It
requires evaluating models using error measures and diagnostics to ensure they perform well on unseen data.
Practical considerations like deployment and interpretability also play a role in choosing the best model.
Presenting Findings and Building Applications (or) PRESENTING FINDINGS AND BUILDING
APPLICATIONS ON TOP OF THEM
Presentation of Findings
After you’ve successfully analyzed the data and built a well-performing model, the next step is to present your
findings to stakeholders. This is a pivotal and exciting phase, as it allows you to showcase the results of your hard
work.
Importance of Automation
Due to the value stakeholders may place on your models and insights, they might request repeated analyses or
updates. To meet this demand efficiently, automation becomes essential. This could involve automating model
scoring, or creating applications that automatically update reports, Excel spreadsheets, or PowerPoint
presentations.
The Role of Soft Skills
In this final stage, your soft skills are crucial. Communicating your findings effectively is key to ensuring that
stakeholders understand and appreciate the value of your work. Developing these skills is recommended, as they
significantly impact how your work is received and acted upon.
By successfully communicating your results and satisfying stakeholders, you complete the data science
process, bringing your project to a successful conclusion.
(other)
PRESENTING FINDINGS AND BUILDING APPLICATIONS ON TOP OF THEM
1. Presenting Findings
Purpose: Communicate the results of your analysis and model to stakeholders in a clear and impactful way.
Visualizations: Use charts, graphs, and other visual aids to make data and insights easily understandable.
o Examples:
Bar Charts and Line Graphs: To show trends and comparisons.
Heatmaps: To highlight correlations between variables.
Interactive Dashboards: Tools like Tableau or Power BI can be used to create interactive
presentations where stakeholders can explore the data themselves.
Reports and Presentations: Create detailed reports or PowerPoint presentations summarizing the key
findings, methodology, and implications.
o Example: A presentation might include an introduction, methodology, results with visual aids, and
a conclusion with recommendations.
Storytelling: Craft a narrative that explains the journey from data collection to actionable insights.
o Example: "We started by collecting customer data from our CRM, cleaned and processed it,
explored the relationships between variables, built a predictive model, and now we can accurately
forecast customer churn."
2. Automating Models
Purpose: Ensure the model's results can be replicated and updated efficiently without manual intervention.
Model Scoring: Automate the process of applying the model to new data.
o Example: Set up a script that runs daily to score new customer data and predict churn probability.
Automated Reports: Develop systems that automatically generate updated reports or dashboards.
o Example: Use tools like Python (with libraries such as Pandas and Matplotlib), R, or dedicated BI
tools to create reports that update automatically with new data inputs.
Application Development: Build applications that integrate the model and its predictions into business
processes.
o Example: Create a web app that sales teams can use to input customer data and get instant churn
predictions and recommendations on actions to take.
3. Importance of Soft Skills
Purpose: Effectively communicate and collaborate with stakeholders to ensure your work is understood and
utilized.
Communication: Clearly explain technical details in a non-technical manner.
o Example: Instead of saying "Our model has an R-squared of 0.85," explain "Our model can
explain 85% of the variation in customer churn."
Collaboration: Work with different departments to ensure the model meets their needs and they
understand how to use it.
o Example: Conduct training sessions for the marketing team on how to interpret the model's output
and use it for targeted campaigns.
Feedback and Iteration: Be open to feedback and ready to make adjustments to improve the model or
presentation.
o Example: After presenting the findings, gather feedback from stakeholders to refine the model or
the way results are presented.
The final step in the data science process involves presenting findings to stakeholders and automating the
model for continuous use. Effective communication and visualization are crucial for conveying insights, while
automation ensures that the model's benefits are consistently realized. Soft skills are vital for ensuring that the
work is understood, valued, and acted upon by the stakeholders.