0% found this document useful (0 votes)

22 views17 pages

UNIT 1,2

delhi university DAV using python DSE

Uploaded by

sanyanigam05

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views17 pages

UNIT 1,2

delhi university DAV using python DSE

Uploaded by

sanyanigam05

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

UNIT 1

Fundamentals of Data Analysis

Data analysis is an iterative process that involves several key steps: data collection,
preparation (or wrangling), exploratory data analysis (EDA), and drawing conclusions.
This workflow is often heavily weighted towards data preparation, which can consume
up to 80% of a data scientist's time, despite being the least enjoyable aspect of their
work.

1. Data Collection

The first step in data analysis is data collection. This can begin even before the actual
data is obtained, as it involves determining what to investigate and what data will be
useful. Common sources of data include:

• Web Scraping: Extracting data from websites using tools like Selenium,
Requests, Scrapy, and BeautifulSoup.

• APIs: Collecting data from web services using the Requests package.

• Databases: Extracting data using SQL or other querying languages.

• Downloadable Resources: Accessing data from government websites or

financial platforms like Yahoo! Finance.

• Log Files: Analyzing data from system logs.

It's crucial to collect relevant data that will help answer the questions posed in the
analysis. For instance, if analyzing the relationship between temperature and hot
chocolate sales, one should focus on sales data and temperature records, rather than
unrelated metrics.

2. Data Wrangling

Data wrangling is the process of cleaning and preparing data for analysis. Data is often
"dirty," meaning it may contain errors or inconsistencies. Common issues include:

• Human Errors: Incorrect data entry or multiple versions of the same entry (e.g.,
"New York City," "NYC," "nyc").

• Computer Errors: Missing data due to recording issues.

• Unexpected Values: Non-standard representations of missing values (e.g.,

using "?" for missing numeric data).

• Incomplete Information: Missing responses in surveys.

• Resolution Issues: Data collected at a different frequency than required (e.g.,
daily vs. hourly).

• Relevance: Data collected for other purposes may not be suitable for the
current analysis.

• Format Issues: Data may need reshaping to be usable.

Addressing these issues is essential to ensure the integrity of the analysis. Chapters 3
and 4 of the book will delve deeper into data wrangling techniques.

3. Exploratory Data Analysis (EDA)

EDA involves using visualizations and summary statistics to understand the data better.
Visualizations are crucial as they can reveal patterns and insights that may not be
apparent from raw data alone. Common EDA tasks include:

• Analyzing trends over time.

• Comparing categorical observations.

• Identifying outliers.

• Examining distributions of variables.

However, care must be taken to avoid misleading visualizations, such as those caused
by inappropriate scaling of axes. EDA and data wrangling are closely linked, as data
often needs to be cleaned before effective analysis can occur.

4. Drawing Conclusions

After data collection, cleaning, and EDA, the next step is to draw conclusions. This
involves summarizing findings and determining the next steps, such as:

• Identifying patterns or relationships in the data.

• Assessing the potential for predictive modeling.

• Evaluating the need for additional data collection.

• Understanding the distribution of the data.

If modeling is pursued, it typically falls under the realm of machine learning and
statistics, which will be covered in later chapters.

Statistical Foundations

Statistics play a vital role in data analysis, with two main categories: descriptive and
inferential statistics.
• Descriptive Statistics: These summarize the sample data, providing insights
into its characteristics.

• Inferential Statistics: These use sample data to make inferences about the
larger population.

Sampling

A key principle in statistics is that samples must be random and representative of the
population to avoid bias. Various sampling methods exist, including simple random
sampling and stratified random sampling.

Descriptive Statistics

Descriptive statistics can be categorized into measures of central tendency and

measures of spread.

• Measures of Central Tendency:

• Mean: The average of the data, sensitive to outliers.

• Median: The middle value, robust to outliers.

• Mode: The most frequently occurring value.

• Measures of Spread:

• Range: The difference between the maximum and minimum values.

• Variance: The average squared deviation from the mean.

• Standard Deviation: The square root of the variance, providing a measure

of spread in the same units as the data.

• Interquartile Range (IQR): The range between the first and third quartiles,
indicating the spread of the middle 50% of the data.

Summarizing Data

The 5-number summary (minimum, Q1, median, Q3, maximum) provides a quick
overview of the data distribution. Box plots visually represent this summary, highlighting
the median, quartiles, and potential
Inferential statistics is all about making inferences or deductions from sample data to
make statements about the population as a whole. Here's a brief overview:

1. Observational Study vs. Experiment:

- Observational Study: The independent variable is not controlled by researchers.

For example, studies on smoking where researchers observe participants without
influencing their smoking habits. This means causation cannot be concluded.

- Experiment: Researchers can directly influence the independent variable and

randomly assign subjects to control and test groups. For example, A/B tests for website
redesigns. The ideal setup is double-blind, where neither the researchers nor the
subjects know who receives the treatment or placebo.

2. Bayesian vs. Frequentist Inference:

- Frequentist Statistics: Focuses on the frequency of events.

- **Bayesian Statistics:** Uses a degree of belief to determine the probability of an
event. You can read more about these approaches
[here](https://www.probabilisticworld.com/frequentist-bayesian-approaches-
inferential-statistics/).

3. **Confidence Intervals:**

- Provide a point estimate and a margin of error around it.

- At the 95% confidence level, 95% of the confidence intervals calculated from
random samples contain the true population parameter.

- Common confidence levels are 90%, 95%, and 99%.

4. **Hypothesis Tests:**

- Test whether the true population parameter is less than, greater than, or not equal to
a certain value at a specific significance level (alpha).

- Involves stating a null hypothesis (e.g., the true population mean is 0), picking a
significance level (usually 5%), calculating the critical value for the test statistic, and
comparing it to the test statistic from the data.

- A result is statistically significant if the null hypothesis value is not in the confidence
interval.

5. **Further Reading:**

- For more information on confidence intervals and hypothesis tests, check out the
link in the Further reading section at the end of the chapter.

- Learn about p-values and p-hacking

[here](https://en.wikipedia.org/wiki/Misunderstandings_of_p-values).

It looks like you're exploring some essential Python libraries for data analysis! Here's a
brief overview of some of the key libraries mentioned:
### NumPy

NumPy, short for Numerical Python, is a cornerstone of numerical computing in Python.

It provides:

- A fast and efficient multidimensional array object (`ndarray`).

- Functions for element-wise computations and mathematical operations between

arrays.

- Tools for reading and writing array-based datasets to disk.

- Linear algebra operations, Fourier transform, and random number generation.

- A mature C API for Python extensions and native C or C++ code to access NumPy’s
data structures.

### pandas

pandas offers high-level data structures and functions for working with structured or
tabular data. Key features include:

- **DataFrame:** A tabular, column-oriented data structure with both row and column
labels.

- Series: A one-dimensional labeled array object.

- Convenient indexing functionality for reshaping, slicing, aggregating, and selecting

subsets of data.

- Integrated time series functionality and flexible handling of missing data.

- Merge and other relational operations found in popular databases (e.g., SQL).

### matplotlib

matplotlib is the most popular Python library for producing plots and other two-
dimensional data visualizations. It is designed for creating publication-quality plots and
integrates well with the rest of the Python data ecosystem.

These libraries form the backbone of data analysis in Python, making it a powerful and
productive environment for data scientists and analysts. If you have any specific
questions or need further details about any of these libraries, feel free to ask!
UNIT 2
The NumPy ndarray: A Multidimensional Array Object
NumPy's core feature is its N-dimensional array object, known as ndarray,
which serves as a fast and flexible container for large datasets in Python.
This allows for efficient mathematical operations on entire blocks of data
using syntax similar to that of scalar operations.
Creating and Using ndarrays
To create an ndarray, you can use the np.array() function, which accepts
any sequence-like object (like lists) and produces a new NumPy array:
•
Transposing: Use .T or transpose() to switch rows and columns in an
array.
• Matrix Multiplication: Use numpy.dot() or the @ operator to perform
matrix multiplication with transposed arrays.
• Swapping Axes: Use swapaxes(axis1, axis2) to rearrange the
dimensions of an array without copying the data.
These operations are fundamental in data analysis and scientific
computing, allowing for efficient manipulation of multidimensional data
structures.

Amit_Khilare_Used_Device_Data_PM_Project
No ratings yet
Amit_Khilare_Used_Device_Data_PM_Project
25 pages
Foundation of Data Science previous year question paper
No ratings yet
Foundation of Data Science previous year question paper
40 pages
Research Methodogy Class 4
No ratings yet
Research Methodogy Class 4
29 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
29 pages
Data Analytics Fundamentals-2
No ratings yet
Data Analytics Fundamentals-2
34 pages
Data Analysis CheatSheet
No ratings yet
Data Analysis CheatSheet
34 pages
Unit 2 1
No ratings yet
Unit 2 1
54 pages
Basic Data Analysis
No ratings yet
Basic Data Analysis
16 pages
NAC.pdf (1)
No ratings yet
NAC.pdf (1)
23 pages
Maxbox Starter138 Top7 Statistical Methods
No ratings yet
Maxbox Starter138 Top7 Statistical Methods
7 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
Dav Exps - Merged - Merged
No ratings yet
Dav Exps - Merged - Merged
99 pages
Python for Data Analysis
No ratings yet
Python for Data Analysis
84 pages
Research Methodogy Class 5
No ratings yet
Research Methodogy Class 5
29 pages
ml programs
No ratings yet
ml programs
41 pages
DMBI Sem 6 Important Topics (IT)
No ratings yet
DMBI Sem 6 Important Topics (IT)
20 pages
DEV_CORE
No ratings yet
DEV_CORE
7 pages
DAL Oral Question Bank
No ratings yet
DAL Oral Question Bank
7 pages
Day 17 - Numpy
No ratings yet
Day 17 - Numpy
7 pages
UNIT II-DSDA.docx Notes
No ratings yet
UNIT II-DSDA.docx Notes
26 pages
Statistics Concepts
No ratings yet
Statistics Concepts
19 pages
DS Chapter - 2
No ratings yet
DS Chapter - 2
73 pages
LNC-Large-Screen-Milling Operation Manual V01.004408210135 ENG
No ratings yet
LNC-Large-Screen-Milling Operation Manual V01.004408210135 ENG
251 pages
Aiml Answers
No ratings yet
Aiml Answers
20 pages
Qunt Data Coding & Analysis
No ratings yet
Qunt Data Coding & Analysis
104 pages
8TypesofDataAnalysisBuiltIn_1736859875544
No ratings yet
8TypesofDataAnalysisBuiltIn_1736859875544
10 pages
probability and stat unit 1
No ratings yet
probability and stat unit 1
12 pages
Oral Aswers Dsbda
No ratings yet
Oral Aswers Dsbda
7 pages
ADS IA 1 syllabus prep (1)
No ratings yet
ADS IA 1 syllabus prep (1)
5 pages
UNIT-2
No ratings yet
UNIT-2
36 pages
BUSINESS ANALYTICS
No ratings yet
BUSINESS ANALYTICS
14 pages
Da Laqs Saqs
No ratings yet
Da Laqs Saqs
23 pages
Unit 2, 3
No ratings yet
Unit 2, 3
9 pages
FTA-Module 1-Notes (1)
No ratings yet
FTA-Module 1-Notes (1)
24 pages
Data Analysis Concepts Explanation (1)
No ratings yet
Data Analysis Concepts Explanation (1)
3 pages
Session1-DataCharacteristics
No ratings yet
Session1-DataCharacteristics
41 pages
EDA_INDEPTH
No ratings yet
EDA_INDEPTH
19 pages
Week - 1 Day - 1 Descriptive Statistics
No ratings yet
Week - 1 Day - 1 Descriptive Statistics
40 pages
4 - Basics in Statistics and Linear Algebra
No ratings yet
4 - Basics in Statistics and Linear Algebra
7 pages
Unit 1 - Intro To EDA
No ratings yet
Unit 1 - Intro To EDA
40 pages
Corrected_Index_of_Topics
No ratings yet
Corrected_Index_of_Topics
2 pages
Fds Csheet and Read The Rule
No ratings yet
Fds Csheet and Read The Rule
4 pages
DSA question bank
No ratings yet
DSA question bank
22 pages
FDS Pyq2
No ratings yet
FDS Pyq2
10 pages
ASL903 Notes
No ratings yet
ASL903 Notes
5 pages
Data Mining Reviewer
No ratings yet
Data Mining Reviewer
4 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
Unit 2
No ratings yet
Unit 2
58 pages
Unit .......
No ratings yet
Unit .......
45 pages
FDS PYQ Solution
No ratings yet
FDS PYQ Solution
8 pages
Getting Started With Python Data Analysis - Sample Chapter
0% (1)
Getting Started With Python Data Analysis - Sample Chapter
17 pages
program-1_
No ratings yet
program-1_
15 pages
Oracle+NetSuite+Service+Descriptions
No ratings yet
Oracle+NetSuite+Service+Descriptions
42 pages
Unit Ii-Ds
No ratings yet
Unit Ii-Ds
12 pages
Cognizant Data Analyst Interview Questions 1745235888
No ratings yet
Cognizant Data Analyst Interview Questions 1745235888
18 pages
Week - 6-7
No ratings yet
Week - 6-7
9 pages
Data_Analysis_Python
No ratings yet
Data_Analysis_Python
3 pages
Internship Presentation On Autocad Software: Presented by
No ratings yet
Internship Presentation On Autocad Software: Presented by
17 pages
I MO 3010.2H 1227 800 VMP 113 - 0 - Code2
No ratings yet
I MO 3010.2H 1227 800 VMP 113 - 0 - Code2
38 pages
Object Oriented Programming: Chapter Four
No ratings yet
Object Oriented Programming: Chapter Four
34 pages
Exploratory Data Analysis: Datascience Using Python Topic: 3
No ratings yet
Exploratory Data Analysis: Datascience Using Python Topic: 3
32 pages
OpGuide WebPower Adapter
No ratings yet
OpGuide WebPower Adapter
67 pages
Stats Unit1
No ratings yet
Stats Unit1
27 pages
OOP - Tieng Anh
No ratings yet
OOP - Tieng Anh
114 pages
MTMLDII-U01 Cap4
100% (1)
MTMLDII-U01 Cap4
33 pages
Data Minds - Data Science Curriculum 2023 V2
No ratings yet
Data Minds - Data Science Curriculum 2023 V2
15 pages
M.Tech. Geo-Informatics and Its Application
No ratings yet
M.Tech. Geo-Informatics and Its Application
16 pages
Exploratory Data Analysis-1
No ratings yet
Exploratory Data Analysis-1
10 pages
Aoc 2216sa LCD Monitor
No ratings yet
Aoc 2216sa LCD Monitor
57 pages
Commissioning Procedure - Inergen - SNY
No ratings yet
Commissioning Procedure - Inergen - SNY
7 pages
IMD310 Industrial Training Slide
No ratings yet
IMD310 Industrial Training Slide
18 pages
Virtual Elements in CDS Views 1731724717
No ratings yet
Virtual Elements in CDS Views 1731724717
7 pages
IPadOS 18 - Apple (in) 2
No ratings yet
IPadOS 18 - Apple (in) 2
1 page
Multiple Choice Criteria For A Good Questionnaire
No ratings yet
Multiple Choice Criteria For A Good Questionnaire
7 pages
How To Use and Customize Safari On Your Iphone or Ipad
No ratings yet
How To Use and Customize Safari On Your Iphone or Ipad
8 pages
Subnetting & CIDR: Tahir Azim
No ratings yet
Subnetting & CIDR: Tahir Azim
17 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
3 pages
Ece427l Activity 1 - Calaunan-Macaraeg
No ratings yet
Ece427l Activity 1 - Calaunan-Macaraeg
5 pages
Bypass SMS and Text Verification With Real USA Numbers - OPRIME VERIFY
No ratings yet
Bypass SMS and Text Verification With Real USA Numbers - OPRIME VERIFY
1 page
Kinds of Task
No ratings yet
Kinds of Task
3 pages
Example Literature Review Concept Map
100% (2)
Example Literature Review Concept Map
7 pages
Cse4006writ 1feb 22
No ratings yet
Cse4006writ 1feb 22
9 pages
Questionnaire: Different Modes of Entertainment Preferred by People
No ratings yet
Questionnaire: Different Modes of Entertainment Preferred by People
4 pages
Summary of MATLAB Onramp
No ratings yet
Summary of MATLAB Onramp
3 pages
Rman
No ratings yet
Rman
3 pages
Key Features of Sales Cloud: Close More Deals
No ratings yet
Key Features of Sales Cloud: Close More Deals
2 pages
1ST Term .SSS2 Data Processing
No ratings yet
1ST Term .SSS2 Data Processing
22 pages
Be - Information Technology - Semester 8 - 2023 - May - Dloc VI Cloud Computing and Services Rev 2019 C Scheme
No ratings yet
Be - Information Technology - Semester 8 - 2023 - May - Dloc VI Cloud Computing and Services Rev 2019 C Scheme
1 page
V Ug Type List
No ratings yet
V Ug Type List
3 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet