0% found this document useful (0 votes)
4 views

Data Analysis Tools

The document provides an overview of data analysis tools, focusing on libraries such as NumPy and Pandas for numerical operations and data manipulation. It covers key concepts including array creation, indexing, operations, and data cleaning techniques like handling missing values and removing duplicates. Additionally, it demonstrates DataFrame operations including creation, import/export, and data type conversions.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Data Analysis Tools

The document provides an overview of data analysis tools, focusing on libraries such as NumPy and Pandas for numerical operations and data manipulation. It covers key concepts including array creation, indexing, operations, and data cleaning techniques like handling missing values and removing duplicates. Additionally, it demonstrates DataFrame operations including creation, import/export, and data type conversions.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

================================================================================

WEEK 3 --> Data Analysis Tools


================================================================================

Importing Libraries
NumPy: For numerical and array operations.
Pandas: For data manipulation and analysis.
Matplotlib and Seaborn: For creating visualizations.
Scipy.stats: For statistical tests and probability calculations.
Datetime: For working with dates and times.

# NumPy Array Operations

# Import required libraries


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import datetime

# %matplotlib inline # For Jupyter notebook display

import warnings
warnings.filterwarnings("ignore")

Numpy

1. Array Creation
Creating Arrays from Lists: Convert lists into NumPy arrays.
Zeros and Ones Arrays: np.zeros and np.ones create arrays filled with zeros or ones,
respectively.
Range and Linear Space Arrays: np.arange generates a sequence of numbers; np.- linspace generates
linearly spaced values between two numbers.
Random Arrays: np.random.rand generates arrays with random values between 0 and 1.

## Array Creation
print("1. Array Creation")
# Create array from list
arr1 = np.array([1, 2, 3, 4, 5])
print("From list:", arr1)

# Create array with zeros and ones


zeros_arr = np.zeros((3, 3))
ones_arr = np.ones((2, 4))
print("\nZeros array:\n", zeros_arr)
print("\nOnes array:\n", ones_arr)

# Create array with range


range_arr = np.arange(0, 10, 2)
print("\nRange array:", range_arr)

# Create array with linear space


linear_arr = np.linspace(0, 1, 5)
print("\nLinear space array:", linear_arr)

# Random arrays
random_arr = np.random.rand(3, 3)
print("\nRandom array:\n", random_arr)
1. Array Creation
From list: [1 2 3 4 5]

Zeros array:
[[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]]

Ones array:
[[1. 1. 1. 1.]
[1. 1. 1. 1.]]

Range array: [0 2 4 6 8]

Linear space array: [0. 0.25 0.5 0.75 1. ]

Random array:
[[0.4322104 0.39502232 0.19561678]
[0.47345545 0.37266552 0.80169765]
[0.83859935 0.42368236 0.27333106]]

2. Array Indexing and Slicing


Accessing specific array elements or subsets of elements.
Single Element Access: Retrieve a specific element with arr2d[1,1].
Row and Column Access: Use indexing to access entire rows or columns.
Subarrays: Use slicing (arr2d[0:2, 1:3]) to retrieve a subsection of the array.

## Array Indexing and Slicing


print("\n2. Array Indexing and Slicing")
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("2D array:\n", arr2d)
print("\nElement at position (1,1):", arr2d[1,1])
print("First row:", arr2d[0])
print("First column:", arr2d[:,0])
print("Subarray:", arr2d[0:2, 1:3])

2. Array Indexing and Slicing


2D array:
[[1 2 3]
[4 5 6]
[7 8 9]]

Element at position (1,1): 5


First row: [1 2 3]
First column: [1 4 7]
Subarray: [[2 3]
[5 6]]

3. Array Operations
Element-wise Operations: Array addition and multiplication are applied element by element.
Mathematical Functions: np.sqrt, np.sum, and np.mean perform various mathematical operations on
arrays.

## Array Operations
print("\n3. Array Operations")
arr_a = np.array([1, 2, 3])
arr_b = np.array([4, 5, 6])

print("Addition:", arr_a + arr_b)


print("Multiplication:", arr_a * arr_b)
print("Square root:", np.sqrt(arr_a))
print("Sum:", np.sum(arr_a))
print("Mean:", np.mean(arr_a))

3. Array Operations
Addition: [5 7 9]
Multiplication: [ 4 10 18]
Square root: [1. 1.41421356 1.73205081]
Sum: 6
Mean: 2.0

4. Broadcasting
Broadcasting:
Allows performing operations between arrays of different shapes by automatically expanding
smaller arrays to match the dimensions of the larger array.

## Broadcasting
print("\n4. Broadcasting")
arr_2d = np.array([[1, 2, 3], [4, 5, 6]])
scalar = 2
print("Original array:\n", arr_2d)
print("\nArray * scalar:\n", arr_2d * scalar)

4. Broadcasting
Original array:
[[1 2 3]
[4 5 6]]

Array * scalar:
[[ 2 4 6]
[ 8 10 12]]

5. Universal Functions
Universal Functions: Pre-built mathematical functions in NumPy, like

np.exp (exponential),
np.sqrt (square root),
np.sin (sine),
that apply operations to each element of an array.

## Universal Functions
print("\n5. Universal Functions")
arr = np.array([1, 2, 3, 4])
print("Original array:", arr)
print("Exponential:", np.exp(arr))
print("Square root:", np.sqrt(arr))
print("Sine:", np.sin(arr))

5. Universal Functions
Original array: [1 2 3 4]
Exponential: [ 2.71828183 7.3890561 20.08553692 54.59815003]
Square root: [1. 1.41421356 1.73205081 2. ]
Sine: [ 0.84147098 0.90929743 0.14112001 -0.7568025 ]

Pandas DataFrame Operations

1. DataFrame Creation
From Dictionary or List: Create DataFrames from dictionaries or lists for easy data organization.

## DataFrame Creation
print("\n1. DataFrame Creation")
# From dictionary
data_dict = {
'name': ['John', 'Anna', 'Peter'],
'age': [28, 22, 35],
'city': ['New York', 'Paris', 'London']
}
df_dict = pd.DataFrame(data_dict)
print("DataFrame from dictionary:\n", df_dict)

# From list of lists


data_list = [[1, 'John', 28], [2, 'Anna', 22], [3, 'Peter', 35]]
df_list = pd.DataFrame(data_list, columns=['id', 'name', 'age'])
print("\nDataFrame from list:\n", df_list)

1. DataFrame Creation
DataFrame from dictionary:
name age city
0 John 28 New York
1 Anna 22 Paris
2 Peter 35 London

DataFrame from list:


id name age
0 1 John 28
1 2 Anna 22
2 3 Peter 35
2. Data Import/Export
Export to CSV: Save DataFrames as CSV files using to_csv.
Read CSV: Load CSV data back into a DataFrame using pd.read_csv.

## Data Import/Export
# Create sample CSV file
df_dict.to_csv('sample_data.csv', index=False)
print("\nReading CSV file:")
df_csv = pd.read_csv('sample_data.csv')
print(df_csv)

Reading CSV file:


name age city
0 John 28 New York
1 Anna 22 Paris
2 Peter 35 London

3. Column and Row Operations


Adding Columns: Adding new columns with calculated values, like a “bonus” column based on
“salary.”
Adding Rows: df.append() adds a new row, though in practice, using pd.concat is often more
efficient.

print("\n2. Column Operations")


# Add new column
df_csv['salary'] = [50000, 60000, 75000]
print("Added new column:\n", df_csv)

# Column calculations
df_csv['bonus'] = df_csv['salary'] * 0.1
print("\nCalculated bonus column:\n", df_csv)

## Row Operations
print("\n3. Row Operations")
# Add new row
new_row = {'name': 'Sarah', 'age': 30, 'city': 'Berlin', 'salary': 65000, 'bonus': 6500}
df_csv = pd.concat([df_csv, pd.DataFrame([new_row])], ignore_index=True)
print("Added new row:\n", df_csv)

2. Column Operations
Added new column:
name age city salary
0 John 28 New York 50000
1 Anna 22 Paris 60000
2 Peter 35 London 75000

Calculated bonus column:


name age city salary bonus
0 John 28 New York 50000 5000.0
1 Anna 22 Paris 60000 6000.0
2 Peter 35 London 75000 7500.0

3. Row Operations
Added new row:
name age city salary bonus
0 John 28 New York 50000 5000.0
1 Anna 22 Paris 60000 6000.0
2 Peter 35 London 75000 7500.0
3 Sarah 30 Berlin 65000 6500.0

4. Indexing and Selection


loc and iloc: loc selects by label, while iloc selects by position.
Boolean Indexing: Retrieve rows based on conditions, like age > 30.

## Indexing and Selection


print("\n4. Indexing and Selection")
# loc
print("Using loc to select by label:\n", df_csv.loc[0])
print("\nSelecting multiple columns with loc:\n", df_csv.loc[:, ['name', 'age']])

# iloc
print("\nUsing iloc to select by position:\n", df_csv.iloc[0])
print("\nSelecting using iloc slicing:\n", df_csv.iloc[0:2, 0:2])

# Boolean indexing
print("\nBoolean indexing (age > 30):\n", df_csv[df_csv['age'] > 30])

4. Indexing and Selection


Using loc to select by label:
name John
age 28
city New York
salary 50000
bonus 5000.0
Name: 0, dtype: object

Selecting multiple columns with loc:


name age
0 John 28
1 Anna 22
2 Peter 35
3 Sarah 30

Using iloc to select by position:


name John
age 28
city New York
salary 50000
bonus 5000.0
Name: 0, dtype: object

Selecting using iloc slicing:


name age
0 John 28
1 Anna 22

Boolean indexing (age > 30):


name age city salary bonus
2 Peter 35 London 75000 7500.0

Data Cleaning

1. Handling Missing Values


Handling Missing Values: Use fillna, ffill, and bfill to fill missing values.

# Data Cleaning

## Handling Missing Values


print("\n1. Handling Missing Values")
# Create DataFrame with missing values
df_missing = pd.DataFrame({
'A': [1, np.nan, 3],
'B': [np.nan, 5, 6],
'C': [7, 8, np.nan]
})
print("DataFrame with missing values:\n", df_missing)

1. Handling Missing Values


DataFrame with missing values:
A B C
0 1.0 NaN 7.0
1 NaN 5.0 8.0
2 3.0 6.0 NaN

2. Removing Duplicates
Removing Duplicates: drop_duplicates removes duplicate rows based on column values.

# Fill missing values


print("\nFilled with 0:\n", df_missing.fillna(0))
print("\nForward fill:\n", df_missing.ffill())
print("\nBackward fill:\n", df_missing.bfill())

## Removing Duplicates
print("\n2. Removing Duplicates")
df_dupes = pd.DataFrame({
'A': [1, 1, 2, 3],
'B': [4, 4, 5, 6]
})
print("DataFrame with duplicates:\n", df_dupes)
print("\nAfter removing duplicates:\n", df_dupes.drop_duplicates())
Filled with 0:
A B C
0 1.0 0.0 7.0
1 0.0 5.0 8.0
2 3.0 6.0 0.0

Forward fill:
A B C
0 1.0 NaN 7.0
1 1.0 5.0 8.0
2 3.0 6.0 8.0

Backward fill:
A B C
0 1.0 5.0 7.0
1 3.0 5.0 8.0
2 3.0 6.0 NaN

2. Removing Duplicates
DataFrame with duplicates:
A B
0 1 4
1 1 4
2 2 5
3 3 6

After removing duplicates:


A B
0 1 4
2 2 5
3 3 6

3. Data Type Conversion


Data Type Conversion: Convert columns to different types, such as converting a string column to
integers.

## Data Type Conversion


print("\n3. Data Type Conversion")
df_types = pd.DataFrame({
'string_col': ['1', '2', '3'],
'float_col': [1.1, 2.2, 3.3]
})
print("Original types:\n", df_types.dtypes)
df_types['string_col'] = df_types['string_col'].astype(int)
print("\nAfter conversion:\n", df_types.dtypes)

3. Data Type Conversion


Original types:
string_col object
float_col float64
dtype: object

After conversion:
string_col int64
float_col float64
dtype: object

4. String Operations
String Operations: Use str.lower and str.contains to manipulate or filter based on text data.

## String Operations
print("\n4. String Operations")
df_str = pd.DataFrame({
'text': ['HELLO', 'World', 'Python']
})
print("Original text:\n", df_str)
print("\nLowercase:\n", df_str['text'].str.lower())
print("Contains 'o':\n", df_str['text'].str.contains('o'))
4. String Operations
Original text:
text
0 HELLO
1 World
2 Python

Lowercase:
0 hello
1 world
2 python
Name: text, dtype: object
Contains 'o':
0 False
1 True
2 True
Name: text, dtype: bool

5. Date/Time Operations
Date/Time Operations: Extract year, month, etc., from datetime columns using .dt attributes.

## Date/Time Operations
print("\n5. Date/Time Operations")
df_dates = pd.DataFrame({
'dates': pd.date_range(start='2024-01-01', periods=3)
})
print("Dates DataFrame:\n", df_dates)
print("\nExtract year:\n", df_dates['dates'].dt.year)
print("Extract month:\n", df_dates['dates'].dt.month)

5. Date/Time Operations
Dates DataFrame:
dates
0 2024-01-01
1 2024-01-02
2 2024-01-03

Extract year:
0 2024
1 2024
2 2024
Name: dates, dtype: int32
Extract month:
0 1
1 1
2 1
Name: dates, dtype: int32

Matplotlib Visualizations
Basic Plots
Line Plot: Plot data on a line graph, such as a sine wave.
Scatter Plot: Plot discrete data points to identify trends or patterns.
Bar Plot: Visualize data in different categories.
Histogram: Plot frequency distributions to observe the spread of data.

# Matplotlib Visualizations

## Basic Plots
print("\n1. Basic Plots")
x = np.linspace(0, 10, 100)
y = np.sin(x)

# Line plot
plt.figure(figsize=(10, 6))
plt.plot(x, y)
plt.title('Line Plot')
plt.xlabel('x')
plt.ylabel('sin(x)')
plt.grid(True)
plt.show()

1. Basic Plots
# Scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(x[::10], y[::10])
plt.title('Scatter Plot')
plt.xlabel('x')
plt.ylabel('sin(x)')
plt.grid(True)
plt.show()

# Bar plot
categories = ['A', 'B', 'C', 'D']
values = [4, 3, 2, 1]
plt.figure(figsize=(10, 6))
plt.bar(categories, values)
plt.title('Bar Plot')
plt.xlabel('Categories')
plt.ylabel('Values')
plt.show()

# Histogram
data = np.random.randn(1000)
plt.figure(figsize=(10, 6))
plt.hist(data, bins=30)
plt.title('Histogram')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

Subplots
Subplots: plt.subplots lets you plot multiple plots in a single figure.

## Subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
ax1.plot(x, np.sin(x))
ax1.set_title('Sine')
ax2.plot(x, np.cos(x))
ax2.set_title('Cosine')
plt.show()

Seaborn Visualizations
Statistical Plots
Distribution Plot: Visualize distributions of data with KDE (Kernel Density Estimation).
Categorical Plots: Use sns.boxplot to show distributions across categories.
Regression Plot: Shows linear relationship with sns.regplot.
Heatmap: Shows correlation between variables with color-coded cells.

# Seaborn Visualizations

## Statistical Plots
# Distribution plot
plt.figure(figsize=(10, 6))
sns.histplot(data=np.random.randn(1000), kde=True)
plt.title('Distribution Plot')
plt.show()

## Categorical Plots
data = pd.DataFrame({
'category': ['A', 'A', 'B', 'B', 'C', 'C'] * 10,
'value': np.random.randn(60)
})

plt.figure(figsize=(10, 6))
sns.boxplot(x='category', y='value', data=data)
plt.title('Box Plot')
plt.show()

## Regression Plot
x = np.random.randn(100)
y = 2 * x + np.random.randn(100) * 0.4
plt.figure(figsize=(10, 6))
sns.regplot(x=x, y=y)
plt.title('Regression Plot')
plt.show()

## Heatmap
correlation_matrix = np.corrcoef(np.random.randn(5, 100))
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

Statistical Analysis

Descriptive Statistics
Descriptive Statistics: Summarize data properties like mean, median, and percentiles.

# Statistical Analysis

## Descriptive Statistics
data = np.random.randn(1000)
print("\nDescriptive Statistics:")
print("Mean:", np.mean(data))
print("Median:", np.median(data))
print("Standard Deviation:", np.std(data))
print("Variance:", np.var(data))
print("Minimum:", np.min(data))
print("Maximum:", np.max(data))
print("25th percentile:", np.percentile(data, 25))
print("75th percentile:", np.percentile(data, 75))

Descriptive Statistics:
Mean: 0.030350822870012565
Median: 0.04352972676337927
Standard Deviation: 0.9608360092179253
Variance: 0.923205836609829
Minimum: -2.86059084616054
Maximum: 2.671077577964687
25th percentile: -0.5802432556961682
75th percentile: 0.6828388868730023

Probability and Distributions


Normal Distribution:
Generate data with a Gaussian distribution and visualize.

## Probability and Distributions


## Probability and Distributions
# Normal distribution
normal_data = np.random.normal(loc=0, scale=1, size=1000)
plt.figure(figsize=(10, 6))
sns.histplot(normal_data, kde=True)
plt.title('Normal Distribution')
plt.show()

Statistical Testing
t-test: Compare the means of two samples using a t-test.
Chi-square Test: Compare observed and expected frequencies.
Confidence Intervals: Calculate a confidence interval for the mean.

## Statistical Testing
# t-test
sample1 = np.random.normal(loc=0, scale=1, size=100)
sample2 = np.random.normal(loc=0.5, scale=1, size=100)
t_stat, p_value = stats.ttest_ind(sample1, sample2)
print("\nt-test results:")
print("t-statistic:", t_stat)
print("p-value:", p_value)

t-test results:
t-statistic: -5.398015890847371
p-value: 1.9157658589928207e-07

import numpy as np
from scipy.stats import chi2_contingency

# Create a contingency table with categorical data (example data)


data = np.array([[50, 30, 20], [30, 50, 20]])

# Perform the Chi-Square test


chi2_statistic, p_value, degrees_of_freedom, expected_frequencies = chi2_contingency(data)

# Display results
print("Chi-Square Test Results")
print(f"Chi-Square Statistic: {chi2_statistic:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"Degrees of Freedom: {degrees_of_freedom}")
print("Expected Frequencies Table:\n", expected_frequencies)

# Interpretation
alpha = 0.05 # Significance level
if p_value < alpha:
print("\nConclusion: Reject the null hypothesis")
print("The variables are likely dependent (there is a significant association).")
else:
print("\nConclusion: Fail to reject the null hypothesis")
print("The variables are likely independent (no significant association).")

Chi-Square Test Results


Chi-Square Statistic: 10.0000
p-value: 0.0067
Degrees of Freedom: 2
Expected Frequencies Table:
[[40. 40. 20.]
[40. 40. 20.]]

Conclusion: Reject the null hypothesis


The variables are likely dependent (there is a significant association).

import numpy as np
from scipy.stats import norm

# Sample data
data = np.array([23, 21, 25, 27, 22, 26, 24, 28, 22, 30])

# Calculate mean and standard deviation


mean = np.mean(data)
std_dev = np.std(data, ddof=1) # ddof=1 for sample standard deviation
n = len(data)

# Confidence level and Z-score (for 95% confidence level)


confidence_level = 0.95
z_score = norm.ppf((1 + confidence_level) / 2)

# Calculate the margin of error


margin_of_error = z_score * (std_dev / np.sqrt(n))

# Confidence interval
ci_lower = mean - margin_of_error
ci_upper = mean + margin_of_error

print(f"Mean: {mean}")
print(f"95% Confidence Interval: ({ci_lower}, {ci_upper})")

Mean: 24.8
95% Confidence Interval: (22.98005737426862, 26.619942625731383)

ADVANCED DATA ANALYSIS


# Advanced Data Science Examples
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import datetime

Advanced NumPy Operations


## Advanced Array Manipulation
# Reshaping arrays
arr = np.arange(12)
print("Original array:", arr)
reshaped = arr.reshape(3, 4)
print("\nReshaped to 3x4:\n", reshaped)
print("\nTransposed:\n", reshaped.T)

# Advanced indexing
arr_3d = np.arange(24).reshape(2, 3, 4)
print("\n3D array:\n", arr_3d)
print("\nComplex slice:\n", arr_3d[0, 1:, 2:])

# Fancy indexing
arr = np.arange(10)
indices = [2, 5, -1]
print("\nFancy indexing:", arr[indices])

# Boolean masking
mask = arr > 5
print("\nBoolean masking:", arr[mask])
Original array: [ 0 1 2 3 4 5 6 7 8 9 10 11]

Reshaped to 3x4:
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]

Transposed:
[[ 0 4 8]
[ 1 5 9]
[ 2 6 10]
[ 3 7 11]]

3D array:
[[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]

[[12 13 14 15]
[16 17 18 19]
[20 21 22 23]]]

Complex slice:
[[ 6 7]
[10 11]]

Fancy indexing: [2 5 9]

Boolean masking: [6 7 8 9]

Advanced Broadcasting
## **Advanced Broadcasting**
# Broadcasting with different dimensions
a = np.array([[1, 2, 3], [4, 5, 6]]) # 2x3
b = np.array([10, 20, 30]) # 1x3
print("\nBroadcasting result:\n", a * b)

Broadcasting result:
[[ 10 40 90]
[ 40 100 180]]

Advanced Pandas Operations

Complex DataFrame Operations


##
# Create sample data
dates = pd.date_range('2024-01-01', periods=6)
df = pd.DataFrame(np.random.randn(6, 4),
index=dates,
columns=list('ABCD'))
print("Sample DataFrame:\n", df)

# Rolling statistics
print("\nRolling mean (window=3):\n", df.rolling(window=3).mean())

# Groupby operations with multiple functions


df['category'] = ['X', 'X', 'Y', 'Y', 'Z', 'Z']
grouped_stats = df.groupby('category').agg({
'A': ['mean', 'std'],
'B': ['min', 'max'],
'C': 'sum'
})
print("\nGrouped statistics:\n", grouped_stats)
Sample DataFrame:
A B C D
2024-01-01 0.528341 0.108282 0.099562 -1.131907
2024-01-02 0.487417 2.346762 0.431917 -0.594799
2024-01-03 -0.982138 0.205079 0.399308 0.550459
2024-01-04 0.217459 0.980465 -0.636395 1.532399
2024-01-05 1.442697 -1.031132 -1.181728 2.547159
2024-01-06 -1.246989 0.456673 -0.791499 -2.226223

Rolling mean (window=3):


A B C D
2024-01-01 NaN NaN NaN NaN
2024-01-02 NaN NaN NaN NaN
2024-01-03 0.011206 0.886708 0.310262 -0.392082
2024-01-04 -0.092421 1.177435 0.064943 0.496020
2024-01-05 0.226006 0.051471 -0.472938 1.543339
2024-01-06 0.137722 0.135335 -0.869874 0.617778

Grouped statistics:
A B C
mean std min max sum
category
X 0.507879 0.028938 0.108282 2.346762 0.531478
Y -0.382340 0.848243 0.205079 0.980465 -0.237087
Z 0.097854 1.901895 -1.031132 0.456673 -1.973227

Advanced Time Series


import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller, kpss
import warnings

# Suppress specific warnings


warnings.filterwarnings('ignore', category=UserWarning)
warnings.filterwarnings('ignore', category=FutureWarning)

def safe_kpss_test(timeseries, **kw):


"""
Safely perform KPSS test with proper error handling
"""
try:
# Suppress the specific interpolation warning
with warnings.catch_warnings():
warnings.filterwarnings('ignore', category=InterpolationWarning)
statistic, p_value, n_lags, critical_values = kpss(timeseries, **kw)

kpss_result = {
'statistic': statistic,
'p_value': p_value,
'n_lags': n_lags,
'critical_values': critical_values,
'status': 'success'
}
except Exception as e:
kpss_result = {
'statistic': None,
'p_value': None,
'n_lags': None,
'critical_values': None,
'status': f'failed: {str(e)}'
}

return kpss_result

def check_stationarity(ts):
"""
Perform comprehensive stationarity analysis
"""
results = {}

# 1. ADF Test
try:
adf_test = adfuller(ts.dropna())
results['adf_test'] = {
'statistic': adf_test[0],
'p_value': adf_test[1],
'critical_values': adf_test[4],
'status': 'success'
}
except Exception as e:
results['adf_test'] = {
'status': f'failed: {str(e)}'
}

# 2. KPSS Test with proper handling


results['kpss_test'] = safe_kpss_test(ts.dropna())

# 3. Additional stationarity checks


results['rolling_stats'] = {
'mean_stationarity': abs(ts.rolling(window=30).mean().std()) < 1,
'var_stationarity': abs(ts.rolling(window=30).std().std()) < 1
}

return results

def analyze_timeseries_stability(ts):
"""
Analyze time series stability and characteristics
"""
# Calculate basic stability metrics
stability_metrics = {
'volatility': ts.std(),
'trend_strength': abs(np.corrcoef(ts.values, np.arange(len(ts)))[0, 1]),
'seasonality_test': seasonal_strength(ts)
}

return stability_metrics

def seasonal_strength(ts, period=30):


"""
Calculate the strength of seasonality in the time series
"""
try:
# Perform seasonal decomposition
decomposition = seasonal_decompose(ts, period=period)

# Calculate strength of seasonality


seasonal_strength = (
abs(decomposition.seasonal).mean() /
(abs(decomposition.seasonal).mean() + abs(decomposition.resid).mean())
)

return seasonal_strength
except:
return None

# Example usage
if __name__ == "__main__":
# Create sample time series
np.random.seed(42)
ts = pd.Series(
np.random.randn(1000),
index=pd.date_range('2024-01-01', periods=1000)
)

# Add some seasonality and trend for demonstration


t = np.arange(len(ts))
ts += 0.01 * t # Add trend
ts += 2 * np.sin(2 * np.pi * t / 30) # Add seasonality

# Perform stationarity analysis


stationarity_results = check_stationarity(ts)
stability_metrics = analyze_timeseries_stability(ts)

# Print results in a formatted way


print("\nTime Series Analysis Results")
print("-" * 50)

print("\n1. Stationarity Tests:")


if stationarity_results['adf_test']['status'] == 'success':
print(f"\nADF Test:")
print(f"Statistic: {stationarity_results['adf_test']['statistic']:.4f}")
print(f"P-value: {stationarity_results['adf_test']['p_value']:.4f}")
print("\nCritical Values:")
for key, value in stationarity_results['adf_test']['critical_values'].items():
print(f"{key}: {value:.4f}")

if stationarity_results['kpss_test']['status'] == 'success':
print(f"\nKPSS Test:")
print(f"Statistic: {stationarity_results['kpss_test']['statistic']:.4f}")
print(f"P-value: {stationarity_results['kpss_test']['p_value']:.4f}")
print("\n2. Stability Metrics:")
print(f"Volatility: {stability_metrics['volatility']:.4f}")
print(f"Trend Strength: {stability_metrics['trend_strength']:.4f}")
print(f"Seasonality Strength: {stability_metrics['seasonality_test']:.4f}")

print("\n3. Rolling Statistics Assessment:")


print(f"Mean Stationarity: {stationarity_results['rolling_stats']['mean_stationarity']}")
print(f"Variance Stationarity: {stationarity_results['rolling_stats']['var_stationarity']}")

Time Series Analysis Results


--------------------------------------------------

1. Stationarity Tests:

ADF Test:
Statistic: -0.6882
P-value: 0.8498

Critical Values:
1%: -3.4371
5%: -2.8645
10%: -2.5683

2. Stability Metrics:
Volatility: 3.3868
Trend Strength: 0.8602
Seasonality Strength: 0.6268

3. Rolling Statistics Assessment:


Mean Stationarity: False
Variance Stationarity: True

Complex String Operations


## Complex String Operations
# Create sample text data
text_df = pd.DataFrame({
'text': ['John Doe, 30', 'Jane_Smith-25', 'Bob Wilson:35'],
'email': ['[email protected]', '[email protected]', '[email protected]']
})

# Extract names using regex


text_df['name'] = text_df['text'].str.extract(r'([A-Za-z\s]+)')
print("Extracted names:\n", text_df['name'])

# Extract age
text_df['age'] = text_df['text'].str.extract(r'(\d+)').astype(int)
print("\nExtracted ages:\n", text_df['age'])

Extracted names:
0 John Doe
1 Jane
2 Bob Wilson
Name: name, dtype: object

Extracted ages:
0 30
1 25
2 35
Name: age, dtype: int64

Advanced Missing Value Handling


## Advanced Missing Value Handling
# Create DataFrame with missing values in patterns
df_missing = pd.DataFrame({
'A': [1, np.nan, 3, np.nan, 5],
'B': [np.nan, 2, np.nan, 4, 5],
'C': [1, 2, 3, 4, 5]
})

# Interpolation
print("\nLinear interpolation:\n", df_missing.interpolate(method='linear'))
print("\nPolynomial interpolation:\n", df_missing.interpolate(method='polynomial', order=2))
Linear interpolation:
A B C
0 1.0 NaN 1
1 2.0 2.0 2
2 3.0 3.0 3
3 4.0 4.0 4
4 5.0 5.0 5

Polynomial interpolation:
A B C
0 1.0 NaN 1
1 2.0 2.0 2
2 3.0 3.0 3
3 4.0 4.0 4
4 5.0 5.0 5

Advanced Visualization
## Complex Matplotlib Plots
# Create sample data
x = np.linspace(0, 10, 100)
y1 = np.sin(x)
y2 = np.cos(x)

# Multiple plots with shared axis


fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 8), sharex=True)
ax1.plot(x, y1, 'b-', label='sin(x)')
ax1.fill_between(x, y1, alpha=0.3)
ax1.legend()
ax1.grid(True)

ax2.plot(x, y2, 'r-', label='cos(x)')


ax2.fill_between(x, y2, alpha=0.3)
ax2.legend()
ax2.grid(True)

plt.tight_layout()
plt.show()
Advanced Seaborn Visualizations
## Advanced Seaborn Visualizations
# Create complex dataset
n_points = 1000
data = pd.DataFrame({
'x': np.random.normal(0, 1, n_points),
'y': np.random.normal(0, 1, n_points),
'category': np.random.choice(['A', 'B', 'C'], n_points),
'size': np.random.uniform(10, 100, n_points)
})

# Advanced scatter plot with multiple variables


plt.figure(figsize=(10, 6))
sns.scatterplot(data=data, x='x', y='y',
hue='category', size='size',
palette='deep')
plt.title('Multi-variable Scatter Plot')
plt.show()

Advanced Statistical Analysis


## Advanced Hypothesis Testing
# Two-way ANOVA
np.random.seed(42)
# Create sample data
n = 10
a = np.random.normal(0, 1, n)
b = np.random.normal(0.5, 1, n)
c = np.random.normal(1, 1, n)

# Perform two-way ANOVA


from scipy import stats
f_stat, p_val = stats.f_oneway(a, b, c)
print("One-way ANOVA results:")
print(f"F-statistic: {f_stat}")
print(f"p-value: {p_val}")

One-way ANOVA results:


F-statistic: 5.11775405930974
p-value: 0.013046303383976988

Advanced Distribution Analysis


## Advanced Distribution Analysis
# Create mixture of distributions
data1 = np.random.normal(0, 1, 1000)
data2 = np.random.normal(4, 1.5, 1000)
mixture = np.concatenate([data1, data2])

# Plot histogram with KDE


plt.figure(figsize=(10, 6))
sns.histplot(mixture, kde=True, bins=50)
plt.title('Mixture of Normal Distributions')
plt.show()

# Kernel Density Estimation


kde = stats.gaussian_kde(mixture)
x_range = np.linspace(mixture.min(), mixture.max(), 200)
plt.figure(figsize=(10, 6))
plt.plot(x_range, kde(x_range))
plt.title('Kernel Density Estimation')
plt.show()

Advanced Time Series Analysis


## Advanced Time Series Analysis
# Create seasonal time series
t = np.linspace(0, 365, 365)
seasonal = 10 * np.sin(2 * np.pi * t / 365) + np.random.normal(0, 1, 365)
trend = 0.05 * t
time_series = seasonal + trend

# Plot components
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(12, 10))
ax1.plot(t, time_series, label='Original')
ax1.legend()
ax2.plot(t, trend, label='Trend', color='red')
ax2.legend()
ax3.plot(t, seasonal, label='Seasonal + Noise', color='green')
ax3.legend()
plt.tight_layout()
plt.show()

Advanced Time Series Analysis - Machine Learning Based


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import TimeSeriesSplit
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller, kpss
import warnings
warnings.filterwarnings('ignore')

class TimeSeriesAnalyzer:
def __init__(self, data, date_column=None, target_column=None, freq='D'):
"""
Initialize with time series data
"""
self.raw_data = data
self.freq = freq
# Handle datetime index
if date_column:
self.data = data.set_index(date_column)
else:
self.data = data

if target_column:
self.target = self.data[target_column]
else:
self.target = self.data

self.scaler = StandardScaler()

def preprocess_timeseries(self, test_size=0.2, n_splits=5):


"""
Preprocess time series data with proper splitting and scaling
"""
# Create time series split
tscv = TimeSeriesSplit(n_splits=n_splits)

# Get the last split for final training/testing


train_index = None
test_index = None

for train_idx, test_idx in tscv.split(self.target):


train_index = train_idx
test_index = test_idx

# Split the data


self.X_train = self.target.iloc[train_index]
self.X_test = self.target.iloc[test_index]

# Scale the data


self.X_train_scaled = pd.Series(
self.scaler.fit_transform(self.X_train.values.reshape(-1, 1)).flatten(),
index=self.X_train.index
)
self.X_test_scaled = pd.Series(
self.scaler.transform(self.X_test.values.reshape(-1, 1)).flatten(),
index=self.X_test.index
)

return self.X_train_scaled, self.X_test_scaled

def plot_seasonal_decomposition(self, ax):


"""
Plot seasonal decomposition components
"""
decomposition = seasonal_decompose(self.target, period=min(len(self.target)//2, 30))

# Plot trend
ax.plot(decomposition.trend, label='Trend', color='blue')
# Plot seasonal
ax.plot(decomposition.seasonal, label='Seasonal', color='green', alpha=0.6)
# Plot residual
ax.plot(decomposition.resid, label='Residual', color='red', alpha=0.4)
ax.legend()
ax.set_title('Seasonal Decomposition')

def visualize_preprocessing(self):
"""
Create comprehensive time series visualizations
"""
fig = plt.figure(figsize=(20, 15))

# 1. Original Time Series Plot


ax1 = plt.subplot(321)
self.target.plot(ax=ax1, title='Original Time Series')
ax1.set_xlabel('Date')
ax1.set_ylabel('Value')

# 2. Seasonal Decomposition
ax2 = plt.subplot(322)
self.plot_seasonal_decomposition(ax2)

# 3. Training vs Testing Split


ax3 = plt.subplot(323)
self.X_train.plot(ax=ax3, label='Training', title='Train/Test Split')
self.X_test.plot(ax=ax3, label='Testing')
ax3.legend()

# 4. Scaled Data Comparison


ax4 = plt.subplot(324)
self.X_train_scaled.plot(ax=ax4, label='Scaled Training', title='Scaled Train/Test Data')
self.X_test_scaled.plot(ax=ax4, label='Scaled Testing')
ax4.legend()

# 5. Distribution Plots
ax5 = plt.subplot(325)
sns.kdeplot(data=self.target, ax=ax5, label='Original')
sns.kdeplot(data=self.X_train_scaled, ax=ax5, label='Scaled')
ax5.set_title('Distribution Comparison')
ax5.legend()

# 6. Rolling Statistics
ax6 = plt.subplot(326)
window_size = min(30, len(self.target) // 10)
rolling_mean = self.target.rolling(window=window_size).mean()
rolling_std = self.target.rolling(window=window_size).std()

self.target.plot(ax=ax6, alpha=0.5, label='Original', color='gray')


rolling_mean.plot(ax=ax6, label=f'Rolling Mean ({window_size}d)', color='red')
rolling_std.plot(ax=ax6, label=f'Rolling Std ({window_size}d)', color='blue')
ax6.set_title('Rolling Statistics')
ax6.legend()

plt.tight_layout()
return fig

def analyze_timeseries(self):
"""
Perform statistical analysis of the time series
"""
# Compute rolling statistics
window_size = min(30, len(self.target) // 10)
rolling_mean = self.target.rolling(window=window_size).mean()
rolling_std = self.target.rolling(window=window_size).std()

analysis = {
'stationarity': {
'adf_test': adfuller(self.target.dropna()),
'kpss_test': kpss(self.target.dropna(), regression='ct')
},
'statistics': {
'original': {
'mean': self.target.mean(),
'std': self.target.std(),
'skew': self.target.skew(),
'kurtosis': self.target.kurtosis(),
'missing_values': self.target.isnull().sum()
},
'scaled': {
'mean': self.X_train_scaled.mean(),
'std': self.X_train_scaled.std(),
'skew': self.X_train_scaled.skew(),
'kurtosis': self.X_train_scaled.kurtosis()
}
},
'rolling_statistics': {
'mean_range': (rolling_mean.min(), rolling_mean.max()),
'std_range': (rolling_std.min(), rolling_std.max())
},
'seasonality': {
'period': pd.infer_freq(self.target.index),
'autocorr': self.target.autocorr()
}
}
return analysis

# Example usage
if __name__ == "__main__":
# Create sample time series data with trend and seasonality
dates = pd.date_range(start='2024-01-01', periods=1000, freq='D')
trend = np.linspace(0, 100, 1000)
seasonal = 10 * np.sin(2 * np.pi * np.arange(1000) / 365.25) # Yearly seasonality
noise = np.random.randn(1000) * 2
values = trend + seasonal + noise

ts_data = pd.Series(values, index=dates)

# Initialize analyzer
ts_analyzer = TimeSeriesAnalyzer(ts_data)

# Preprocess data
X_train_scaled, X_test_scaled = ts_analyzer.preprocess_timeseries()
# Create visualizations
fig = ts_analyzer.visualize_preprocessing()

# Get analysis results


analysis = ts_analyzer.analyze_timeseries()

# Print analysis results


print("\nTime Series Analysis Results:")
print("-" * 50)

print("\nStationarity Tests:")
print(f"ADF Test p-value: {analysis['stationarity']['adf_test'][1]:.4f}")
print(f"KPSS Test p-value: {analysis['stationarity']['kpss_test'][1]:.4f}")

print("\nOriginal Data Statistics:")


for stat, value in analysis['statistics']['original'].items():
print(f"{stat}: {value:.4f}")

print("\nRolling Statistics Ranges:")


print(f"Mean Range: {analysis['rolling_statistics']['mean_range']}")
print(f"Std Range: {analysis['rolling_statistics']['std_range']}")

print("\nSeasonality Analysis:")
print("Inferred Frequency:", analysis['seasonality']['period'])
print(f"Lag-1 Autocorrelation: {analysis['seasonality']['autocorr']:.4f}")

plt.show()

Time Series Analysis Results:


--------------------------------------------------

Stationarity Tests:
ADF Test p-value: 0.8380
KPSS Test p-value: 0.0100

Original Data Statistics:


mean: 50.6421
std: 28.7555
skew: 0.0959
kurtosis: -1.2890
missing_values: 0.0000

Rolling Statistics Ranges:


Mean Range: (3.874087232854882, 94.48724792189482)
Std Range: (1.4204589719136187, 3.400214128293137)

Seasonality Analysis:
Inferred Frequency: D
Lag-1 Autocorrelation: 0.9949

You might also like