0% found this document useful (0 votes)
13 views

Unit - 1

The document provides an introduction to Data Science, outlining its importance in analyzing data to identify patterns and make informed decisions across various fields. It details the Data Science process, including data collection, cleaning, exploratory data analysis (EDA), model building, and deployment, while also emphasizing the motivation for using Python and essential libraries like NumPy, Pandas, and Matplotlib. Additionally, it introduces Jupyter Notebook as a versatile tool for interactive coding and data visualization.

Uploaded by

satorugoj09399
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Unit - 1

The document provides an introduction to Data Science, outlining its importance in analyzing data to identify patterns and make informed decisions across various fields. It details the Data Science process, including data collection, cleaning, exploratory data analysis (EDA), model building, and deployment, while also emphasizing the motivation for using Python and essential libraries like NumPy, Pandas, and Matplotlib. Additionally, it introduces Jupyter Notebook as a versatile tool for interactive coding and data visualization.

Uploaded by

satorugoj09399
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Unit - 1

1. Introduction to Data Science 5. Essential Python Libraries:


2. Exploratory Data Analysis and Data Science Process • NumPy
3. Motivation for using Python for Data Analysis • Pandas
4. Introduction of Python Jupyter Notebook • Matplotlib
• SciPy
• scikit-learn
• Statsmodels
• seaborn.
Introduction to Data Science

Data Science is the study of data. It’s about using techniques and tools to make sense of the huge amounts of information
(data) around us. Data Science helps us understand patterns, make predictions, and solve problems in various fields like
business, healthcare, sports, and more.
Example:-
• You have a diary where you write down how much money you spend every day.
• After a month, you look at the diary and notice patterns, like you spend more on weekends.
• Based on this, you decide to save money by cutting down weekend expenses.
This simple idea—collecting data, finding patterns, and making decisions—is the foundation of Data Science.

What Does a Data Scientist Do?


1. Collect Data
•Data comes from various sources like apps, websites, social media, surveys, or even sensors in devices like fitness
trackers or smartwatches.
•Example: A fitness app might track your steps, heart rate, and sleep patterns. This data can later be analyzed to suggest
health improvements.
•Data can be structured (like a table of sales numbers) or unstructured (like tweets, videos, or images).
2. Clean Data
Raw data is often messy. It may have errors, incomplete information, or duplicates.
Cleaning involves:
• Filling missing values (e.g., guessing a missing age based on similar users).
• Removing irrelevant parts (e.g., junk email addresses in a mailing list).
• Ensuring consistency (e.g., making sure all dates are in the same format).
Example: A customer database might contain duplicate entries for the same person. Cleaning ensures every customer is
listed only once.
3. EDA
Once clean, the data is analysed to uncover patterns, trends, and insights.
Tools like Excel, Python, SQL, or R can help find answers like:
• What time of day do customers shop the most?
• Which product is most popular in different seasons?
Example: A food delivery app may analyze orders to learn that pizza orders spike on Friday nights.
4. Predict
•Data scientists use machine learning models to make future predictions based on past data.
•Example:
• Predict how many customers will shop during the holiday season.
• Forecast weather patterns based on historical data.
•These predictions help organizations prepare better, like stocking up on popular products before a sale.

5. Make Decisions
•The insights and predictions help businesses make smarter choices:
• Offering discounts or promotions during slow sales periods.
• Targeting ads to specific groups of people (like showing kids' toys to parents).
• Improving products or services based on customer feedback trends.
•Example: If analysis shows people are unhappy with late deliveries, a company might decide to hire more delivery staff.
Tools in Data Science:
•Programming: Python or R to process data.
•Math & Statistics: To understand trends.
•Visualization: Graphs and charts to show results clearly.
Exploratory Data Analysis and Data Science
Process
Exploratory Data Analysis (EDA) is a crucial initial step in data science projects. It involves analyzing and visualizing
data to understand its key characteristics, uncover patterns, and identify relationships between variables refers to the
method of studying and exploring record sets to apprehend their predominant traits, discover patterns, locate outliers, and
identify relationships between variables. EDA is normally carried out as a preliminary step before undertaking extra formal
statistical analyses or modeling.
Key aspects of EDA include:
•Distribution of Data: Examining the distribution of data points to understand their range, central tendencies (mean, median),
and dispersion (variance, standard deviation).

•Graphical Representations: Utilizing charts such as histograms, box plots, scatter plots, and bar charts to visualize
relationships within the data and distributions of variables.

•Outlier Detection: Identifying unusual values that deviate from other data points. Outliers can influence statistical analyses
and might indicate data entry errors or unique cases.

•Correlation Analysis: Checking the relationships between variables to understand how they might affect each other. This
includes computing correlation coefficients and creating correlation matrices.
•Handling Missing Values: Detecting and deciding how to address missing data points, whether by imputation or removal,
depending on their impact and the amount of missing data.

•Summary Statistics: Calculating key statistics that provide insight into data trends and nuances.

•Testing Assumptions: Many statistical tests and models assume the data meet certain conditions (like normality or
homoscedasticity). EDA helps verify these assumptions.
Data Science Process Life Cycle

Some steps are necessary for any of the tasks that are being done in the field of data science to derive any fruitful results from
the data at hand.

•Data Collection – After formulating any problem statement the main task is to calculate data that can help us in our analysis
and manipulation. Sometimes data is collected by performing some kind of survey and there are times when it is done by
performing scrapping.

•Data Cleaning – Most of the real-world data is not structured and requires cleaning and conversion into structured data
before it can be used for any analysis or modeling.

•Exploratory Data Analysis – This is the step in which we try to find the hidden patterns in the data at hand. Also, we try to
analyze different factors which affect the target variable and the extent to which it does so. How the independent features are
related to each other and what can be done to achieve the desired results all these answers can be extracted from this process
as well. This also gives us a direction in which we should work to get started with the modeling process.
•Model Building – Different types of machine learning algorithms as well as techniques have been developed which can easily
identify complex patterns in the data which will be a very tedious task to be done by a human.

•Model Deployment – After a model is developed and gives better results on the holdout or the real-world dataset then we
deploy it and monitor its performance. This is the main part where we use our learning from the data to be applied in real-
world applications and use cases.
Steps for Data Science Processes:

Step 1: Define the Problem

Clearly defining the research goals is the first step in the Data Science Process. A project charter outlines the objectives,
resources, deliverables, and timeline, ensuring that all stakeholders are aligned.

Step 2: Retrieve Data

Data can be stored in databases, data warehouses, or data lakes within an organization. Accessing this data often involves
navigating company policies and requesting permissions.

Step 3: Data Cleansing, Integration, and Transformation

Data cleaning ensures that errors, inconsistencies, and outliers are removed. Data integration combines datasets from different
sources, while data transformation prepares the data for modeling by reshaping variables or creating new features.
Step 4: Exploratory Data Analysis (EDA)

During EDA, various graphical techniques like scatter plots, histograms, and box plots are used to visualize data and
identify trends. This phase helps in selecting the right modeling techniques.

Step 5: Build Models

In this step, machine learning or deep learning models are built to make predictions or classifications based on the data.
The choice of algorithm depends on the complexity of the problem and the type of data.

Step 6: Present Findings and Deploy Models

Once the analysis is complete, results are presented to stakeholders. Models are deployed into production systems to
automate decision-making or support ongoing analysis.
Motivation for using Python for Data Analysis

Python has become one of the most popular programming languages for data analysis due to its versatility, simplicity, and the
vast ecosystem of libraries and tools.

Reasons to Use Python for Data Analysis


1. User-Friendly Syntax
Python has a simple and intuitive syntax, making it easy for beginners and efficient for professionals. This reduces
development time and allows analysts to focus on solving problems rather than understanding complex language rules.
2. Powerful Libraries for Data Handling
Python offers libraries like Pandas (for data manipulation), NumPy (for numerical computations), and SciPy (for
scientific calculations). These libraries simplify working with data and enable analysts to process large datasets with ease.
3. Advanced Data Visualization
Visualization tools like Matplotlib, Seaborn, and Plotly allow for creating detailed and interactive charts, graphs, and plots.
These make it easier to communicate insights and understand data patterns.

4. Support for Machine Learning and AI


Python is widely used in machine learning and AI applications. Libraries like Scikit-learn, TensorFlow, and PyTorch integrate
seamlessly into data analysis workflows, allowing analysts to create predictive models and advanced algorithms.

5. Cross-Platform Compatibility
Python works on all major operating systems (Windows, macOS, Linux) and can be integrated with various tools, databases,
and systems, ensuring flexibility in data analysis environments.
6. Community Support and Resources
Python has a large, active community that continuously develops libraries, answers questions, and shares knowledge. This
support makes it easier to find solutions and learn best practices for data analysis.

7. Open Source and Cost-Effective


Python is free and open source, making it an accessible option for individuals, startups, and enterprises. The availability of
free tools and libraries minimizes costs while maintaining high performance.
Introduction of Python Jupyter Notebook

Jupyter Notebook is an open-source, interactive web application that allows users to create and share documents containing
live code, equations, visualizations, and explanatory text. It is widely used in data science, machine learning, and scientific
research due to its versatility and ease of use.
Features of Jupyter Notebook
1.Interactive Coding:
1. Write and execute Python code in small, manageable cells.
2. See results immediately after running a cell, making it ideal for exploration and debugging.
2.Rich Text Support:
1. Combine code with markdown cells to add headings, descriptions, and formatted text.
2. Supports LaTeX for mathematical equations, making it great for academic and research documentation.
3.Data Visualization:
1. Easily generate and display plots using libraries like Matplotlib, Seaborn, and Plotly.
2. Interactive visualizations can be integrated seamlessly.
4.Language Support:
1. Although widely used with Python, Jupyter supports over 40 programming languages, including R, Julia, and Scala.
5.Notebook Sharing:
1. Share notebooks in various formats, including HTML and PDF.
2. Platforms like GitHub and JupyterHub enable collaboration and version control.
Installing Jupyter Notebook

Step 1: Download latest Python version


Step 2 : Install and Setup python in your computer.

Step 3: Open CMD and install jupyter notebook


Step 5 : Open Jupyter Notebook
5. Essential Python Libraries

NumPy
NumPy (Numerical Python) is a powerful library for numerical computing. It provides support for multidimensional arrays,
mathematical operations, and linear algebra, making it the foundation for scientific computing in Python.

Pandas
Used for data manipulation and analysis, especially with tabular data (DataFrames).
Matplotlib
Matplotlib is a plotting library used to create static, interactive, and animated visualizations in Python. It offers control over
every element of a plot, from axis labels to colors, enabling users to produce publication-quality visualizations.

SciPy
SciPy (Scientific Python) builds on NumPy, offering advanced mathematical functions for optimization, integration,
interpolation, and more. It is widely used for scientific research and engineering applications.
scikit-learn
Scikit-learn is a machine learning library that provides simple and efficient tools for tasks like classification, regression,
clustering, and dimensionality reduction. It is built on NumPy, SciPy, and Matplotlib.
Example:

Statsmodels
Statsmodels is a library for statistical modeling, hypothesis testing, and data exploration. It provides tools for fitting statistical
models like linear regression, time series analysis, and more.
Example:
Seaborn
Seaborn is a data visualization library built on top of Matplotlib. It simplifies the process of creating attractive and informative
statistical graphics. Seaborn is particularly useful for visualizing relationships in data.
Example:

You might also like