0% found this document useful (0 votes)
22 views17 pages

UNIT 1,2

delhi university DAV using python DSE

Uploaded by

sanyanigam05
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views17 pages

UNIT 1,2

delhi university DAV using python DSE

Uploaded by

sanyanigam05
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

UNIT 1

Fundamentals of Data Analysis

Data analysis is an iterative process that involves several key steps: data collection,
preparation (or wrangling), exploratory data analysis (EDA), and drawing conclusions.
This workflow is often heavily weighted towards data preparation, which can consume
up to 80% of a data scientist's time, despite being the least enjoyable aspect of their
work.

1. Data Collection

The first step in data analysis is data collection. This can begin even before the actual
data is obtained, as it involves determining what to investigate and what data will be
useful. Common sources of data include:

• Web Scraping: Extracting data from websites using tools like Selenium,
Requests, Scrapy, and BeautifulSoup.

• APIs: Collecting data from web services using the Requests package.

• Databases: Extracting data using SQL or other querying languages.

• Downloadable Resources: Accessing data from government websites or


financial platforms like Yahoo! Finance.

• Log Files: Analyzing data from system logs.

It's crucial to collect relevant data that will help answer the questions posed in the
analysis. For instance, if analyzing the relationship between temperature and hot
chocolate sales, one should focus on sales data and temperature records, rather than
unrelated metrics.

2. Data Wrangling

Data wrangling is the process of cleaning and preparing data for analysis. Data is often
"dirty," meaning it may contain errors or inconsistencies. Common issues include:

• Human Errors: Incorrect data entry or multiple versions of the same entry (e.g.,
"New York City," "NYC," "nyc").

• Computer Errors: Missing data due to recording issues.

• Unexpected Values: Non-standard representations of missing values (e.g.,


using "?" for missing numeric data).

• Incomplete Information: Missing responses in surveys.


• Resolution Issues: Data collected at a different frequency than required (e.g.,
daily vs. hourly).

• Relevance: Data collected for other purposes may not be suitable for the
current analysis.

• Format Issues: Data may need reshaping to be usable.

Addressing these issues is essential to ensure the integrity of the analysis. Chapters 3
and 4 of the book will delve deeper into data wrangling techniques.

3. Exploratory Data Analysis (EDA)

EDA involves using visualizations and summary statistics to understand the data better.
Visualizations are crucial as they can reveal patterns and insights that may not be
apparent from raw data alone. Common EDA tasks include:

• Analyzing trends over time.

• Comparing categorical observations.

• Identifying outliers.

• Examining distributions of variables.

However, care must be taken to avoid misleading visualizations, such as those caused
by inappropriate scaling of axes. EDA and data wrangling are closely linked, as data
often needs to be cleaned before effective analysis can occur.

4. Drawing Conclusions

After data collection, cleaning, and EDA, the next step is to draw conclusions. This
involves summarizing findings and determining the next steps, such as:

• Identifying patterns or relationships in the data.

• Assessing the potential for predictive modeling.

• Evaluating the need for additional data collection.

• Understanding the distribution of the data.

If modeling is pursued, it typically falls under the realm of machine learning and
statistics, which will be covered in later chapters.

Statistical Foundations

Statistics play a vital role in data analysis, with two main categories: descriptive and
inferential statistics.
• Descriptive Statistics: These summarize the sample data, providing insights
into its characteristics.

• Inferential Statistics: These use sample data to make inferences about the
larger population.

Sampling

A key principle in statistics is that samples must be random and representative of the
population to avoid bias. Various sampling methods exist, including simple random
sampling and stratified random sampling.

Descriptive Statistics

Descriptive statistics can be categorized into measures of central tendency and


measures of spread.

• Measures of Central Tendency:

• Mean: The average of the data, sensitive to outliers.

• Median: The middle value, robust to outliers.

• Mode: The most frequently occurring value.

• Measures of Spread:

• Range: The difference between the maximum and minimum values.

• Variance: The average squared deviation from the mean.

• Standard Deviation: The square root of the variance, providing a measure


of spread in the same units as the data.

• Interquartile Range (IQR): The range between the first and third quartiles,
indicating the spread of the middle 50% of the data.

Summarizing Data

The 5-number summary (minimum, Q1, median, Q3, maximum) provides a quick
overview of the data distribution. Box plots visually represent this summary, highlighting
the median, quartiles, and potential
Inferential statistics is all about making inferences or deductions from sample data to
make statements about the population as a whole. Here's a brief overview:

1. **Observational Study vs. Experiment:**

- **Observational Study:** The independent variable is not controlled by researchers.


For example, studies on smoking where researchers observe participants without
influencing their smoking habits. This means causation cannot be concluded.

- **Experiment:** Researchers can directly influence the independent variable and


randomly assign subjects to control and test groups. For example, A/B tests for website
redesigns. The ideal setup is double-blind, where neither the researchers nor the
subjects know who receives the treatment or placebo.

2. **Bayesian vs. Frequentist Inference:**

- **Frequentist Statistics:** Focuses on the frequency of events.


- **Bayesian Statistics:** Uses a degree of belief to determine the probability of an
event. You can read more about these approaches
[here](https://www.probabilisticworld.com/frequentist-bayesian-approaches-
inferential-statistics/).

3. **Confidence Intervals:**

- Provide a point estimate and a margin of error around it.

- At the 95% confidence level, 95% of the confidence intervals calculated from
random samples contain the true population parameter.

- Common confidence levels are 90%, 95%, and 99%.

4. **Hypothesis Tests:**

- Test whether the true population parameter is less than, greater than, or not equal to
a certain value at a specific significance level (alpha).

- Involves stating a null hypothesis (e.g., the true population mean is 0), picking a
significance level (usually 5%), calculating the critical value for the test statistic, and
comparing it to the test statistic from the data.

- A result is statistically significant if the null hypothesis value is not in the confidence
interval.

5. **Further Reading:**

- For more information on confidence intervals and hypothesis tests, check out the
link in the Further reading section at the end of the chapter.

- Learn about p-values and p-hacking


[here](https://en.wikipedia.org/wiki/Misunderstandings_of_p-values).

It looks like you're exploring some essential Python libraries for data analysis! Here's a
brief overview of some of the key libraries mentioned:
### NumPy

NumPy, short for Numerical Python, is a cornerstone of numerical computing in Python.


It provides:

- A fast and efficient multidimensional array object (`ndarray`).

- Functions for element-wise computations and mathematical operations between


arrays.

- Tools for reading and writing array-based datasets to disk.

- Linear algebra operations, Fourier transform, and random number generation.

- A mature C API for Python extensions and native C or C++ code to access NumPy’s
data structures.

### pandas

pandas offers high-level data structures and functions for working with structured or
tabular data. Key features include:

- **DataFrame:** A tabular, column-oriented data structure with both row and column
labels.

- **Series:** A one-dimensional labeled array object.

- Convenient indexing functionality for reshaping, slicing, aggregating, and selecting


subsets of data.

- Integrated time series functionality and flexible handling of missing data.

- Merge and other relational operations found in popular databases (e.g., SQL).

### matplotlib

matplotlib is the most popular Python library for producing plots and other two-
dimensional data visualizations. It is designed for creating publication-quality plots and
integrates well with the rest of the Python data ecosystem.

These libraries form the backbone of data analysis in Python, making it a powerful and
productive environment for data scientists and analysts. If you have any specific
questions or need further details about any of these libraries, feel free to ask!
UNIT 2
The NumPy ndarray: A Multidimensional Array Object
NumPy's core feature is its N-dimensional array object, known as ndarray,
which serves as a fast and flexible container for large datasets in Python.
This allows for efficient mathematical operations on entire blocks of data
using syntax similar to that of scalar operations.
Creating and Using ndarrays
To create an ndarray, you can use the np.array() function, which accepts
any sequence-like object (like lists) and produces a new NumPy array:

Transposing: Use .T or transpose() to switch rows and columns in an
array.
• Matrix Multiplication: Use numpy.dot() or the @ operator to perform
matrix multiplication with transposed arrays.
• Swapping Axes: Use swapaxes(axis1, axis2) to rearrange the
dimensions of an array without copying the data.
These operations are fundamental in data analysis and scientific
computing, allowing for efficient manipulation of multidimensional data
structures.

You might also like