Unit - 1
Unit - 1
Data Science is the study of data. It’s about using techniques and tools to make sense of the huge amounts of information
(data) around us. Data Science helps us understand patterns, make predictions, and solve problems in various fields like
business, healthcare, sports, and more.
Example:-
• You have a diary where you write down how much money you spend every day.
• After a month, you look at the diary and notice patterns, like you spend more on weekends.
• Based on this, you decide to save money by cutting down weekend expenses.
This simple idea—collecting data, finding patterns, and making decisions—is the foundation of Data Science.
5. Make Decisions
•The insights and predictions help businesses make smarter choices:
• Offering discounts or promotions during slow sales periods.
• Targeting ads to specific groups of people (like showing kids' toys to parents).
• Improving products or services based on customer feedback trends.
•Example: If analysis shows people are unhappy with late deliveries, a company might decide to hire more delivery staff.
Tools in Data Science:
•Programming: Python or R to process data.
•Math & Statistics: To understand trends.
•Visualization: Graphs and charts to show results clearly.
Exploratory Data Analysis and Data Science
Process
Exploratory Data Analysis (EDA) is a crucial initial step in data science projects. It involves analyzing and visualizing
data to understand its key characteristics, uncover patterns, and identify relationships between variables refers to the
method of studying and exploring record sets to apprehend their predominant traits, discover patterns, locate outliers, and
identify relationships between variables. EDA is normally carried out as a preliminary step before undertaking extra formal
statistical analyses or modeling.
Key aspects of EDA include:
•Distribution of Data: Examining the distribution of data points to understand their range, central tendencies (mean, median),
and dispersion (variance, standard deviation).
•Graphical Representations: Utilizing charts such as histograms, box plots, scatter plots, and bar charts to visualize
relationships within the data and distributions of variables.
•Outlier Detection: Identifying unusual values that deviate from other data points. Outliers can influence statistical analyses
and might indicate data entry errors or unique cases.
•Correlation Analysis: Checking the relationships between variables to understand how they might affect each other. This
includes computing correlation coefficients and creating correlation matrices.
•Handling Missing Values: Detecting and deciding how to address missing data points, whether by imputation or removal,
depending on their impact and the amount of missing data.
•Summary Statistics: Calculating key statistics that provide insight into data trends and nuances.
•Testing Assumptions: Many statistical tests and models assume the data meet certain conditions (like normality or
homoscedasticity). EDA helps verify these assumptions.
Data Science Process Life Cycle
Some steps are necessary for any of the tasks that are being done in the field of data science to derive any fruitful results from
the data at hand.
•Data Collection – After formulating any problem statement the main task is to calculate data that can help us in our analysis
and manipulation. Sometimes data is collected by performing some kind of survey and there are times when it is done by
performing scrapping.
•Data Cleaning – Most of the real-world data is not structured and requires cleaning and conversion into structured data
before it can be used for any analysis or modeling.
•Exploratory Data Analysis – This is the step in which we try to find the hidden patterns in the data at hand. Also, we try to
analyze different factors which affect the target variable and the extent to which it does so. How the independent features are
related to each other and what can be done to achieve the desired results all these answers can be extracted from this process
as well. This also gives us a direction in which we should work to get started with the modeling process.
•Model Building – Different types of machine learning algorithms as well as techniques have been developed which can easily
identify complex patterns in the data which will be a very tedious task to be done by a human.
•Model Deployment – After a model is developed and gives better results on the holdout or the real-world dataset then we
deploy it and monitor its performance. This is the main part where we use our learning from the data to be applied in real-
world applications and use cases.
Steps for Data Science Processes:
Clearly defining the research goals is the first step in the Data Science Process. A project charter outlines the objectives,
resources, deliverables, and timeline, ensuring that all stakeholders are aligned.
Data can be stored in databases, data warehouses, or data lakes within an organization. Accessing this data often involves
navigating company policies and requesting permissions.
Data cleaning ensures that errors, inconsistencies, and outliers are removed. Data integration combines datasets from different
sources, while data transformation prepares the data for modeling by reshaping variables or creating new features.
Step 4: Exploratory Data Analysis (EDA)
During EDA, various graphical techniques like scatter plots, histograms, and box plots are used to visualize data and
identify trends. This phase helps in selecting the right modeling techniques.
In this step, machine learning or deep learning models are built to make predictions or classifications based on the data.
The choice of algorithm depends on the complexity of the problem and the type of data.
Once the analysis is complete, results are presented to stakeholders. Models are deployed into production systems to
automate decision-making or support ongoing analysis.
Motivation for using Python for Data Analysis
Python has become one of the most popular programming languages for data analysis due to its versatility, simplicity, and the
vast ecosystem of libraries and tools.
5. Cross-Platform Compatibility
Python works on all major operating systems (Windows, macOS, Linux) and can be integrated with various tools, databases,
and systems, ensuring flexibility in data analysis environments.
6. Community Support and Resources
Python has a large, active community that continuously develops libraries, answers questions, and shares knowledge. This
support makes it easier to find solutions and learn best practices for data analysis.
Jupyter Notebook is an open-source, interactive web application that allows users to create and share documents containing
live code, equations, visualizations, and explanatory text. It is widely used in data science, machine learning, and scientific
research due to its versatility and ease of use.
Features of Jupyter Notebook
1.Interactive Coding:
1. Write and execute Python code in small, manageable cells.
2. See results immediately after running a cell, making it ideal for exploration and debugging.
2.Rich Text Support:
1. Combine code with markdown cells to add headings, descriptions, and formatted text.
2. Supports LaTeX for mathematical equations, making it great for academic and research documentation.
3.Data Visualization:
1. Easily generate and display plots using libraries like Matplotlib, Seaborn, and Plotly.
2. Interactive visualizations can be integrated seamlessly.
4.Language Support:
1. Although widely used with Python, Jupyter supports over 40 programming languages, including R, Julia, and Scala.
5.Notebook Sharing:
1. Share notebooks in various formats, including HTML and PDF.
2. Platforms like GitHub and JupyterHub enable collaboration and version control.
Installing Jupyter Notebook
NumPy
NumPy (Numerical Python) is a powerful library for numerical computing. It provides support for multidimensional arrays,
mathematical operations, and linear algebra, making it the foundation for scientific computing in Python.
Pandas
Used for data manipulation and analysis, especially with tabular data (DataFrames).
Matplotlib
Matplotlib is a plotting library used to create static, interactive, and animated visualizations in Python. It offers control over
every element of a plot, from axis labels to colors, enabling users to produce publication-quality visualizations.
SciPy
SciPy (Scientific Python) builds on NumPy, offering advanced mathematical functions for optimization, integration,
interpolation, and more. It is widely used for scientific research and engineering applications.
scikit-learn
Scikit-learn is a machine learning library that provides simple and efficient tools for tasks like classification, regression,
clustering, and dimensionality reduction. It is built on NumPy, SciPy, and Matplotlib.
Example:
Statsmodels
Statsmodels is a library for statistical modeling, hypothesis testing, and data exploration. It provides tools for fitting statistical
models like linear regression, time series analysis, and more.
Example:
Seaborn
Seaborn is a data visualization library built on top of Matplotlib. It simplifies the process of creating attractive and informative
statistical graphics. Seaborn is particularly useful for visualizing relationships in data.
Example: