Data Analysis Tools
Data Analysis Tools
Importing Libraries
NumPy: For numerical and array operations.
Pandas: For data manipulation and analysis.
Matplotlib and Seaborn: For creating visualizations.
Scipy.stats: For statistical tests and probability calculations.
Datetime: For working with dates and times.
import warnings
warnings.filterwarnings("ignore")
Numpy
1. Array Creation
Creating Arrays from Lists: Convert lists into NumPy arrays.
Zeros and Ones Arrays: np.zeros and np.ones create arrays filled with zeros or ones,
respectively.
Range and Linear Space Arrays: np.arange generates a sequence of numbers; np.- linspace generates
linearly spaced values between two numbers.
Random Arrays: np.random.rand generates arrays with random values between 0 and 1.
## Array Creation
print("1. Array Creation")
# Create array from list
arr1 = np.array([1, 2, 3, 4, 5])
print("From list:", arr1)
# Random arrays
random_arr = np.random.rand(3, 3)
print("\nRandom array:\n", random_arr)
1. Array Creation
From list: [1 2 3 4 5]
Zeros array:
[[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]]
Ones array:
[[1. 1. 1. 1.]
[1. 1. 1. 1.]]
Range array: [0 2 4 6 8]
Random array:
[[0.4322104 0.39502232 0.19561678]
[0.47345545 0.37266552 0.80169765]
[0.83859935 0.42368236 0.27333106]]
3. Array Operations
Element-wise Operations: Array addition and multiplication are applied element by element.
Mathematical Functions: np.sqrt, np.sum, and np.mean perform various mathematical operations on
arrays.
## Array Operations
print("\n3. Array Operations")
arr_a = np.array([1, 2, 3])
arr_b = np.array([4, 5, 6])
3. Array Operations
Addition: [5 7 9]
Multiplication: [ 4 10 18]
Square root: [1. 1.41421356 1.73205081]
Sum: 6
Mean: 2.0
4. Broadcasting
Broadcasting:
Allows performing operations between arrays of different shapes by automatically expanding
smaller arrays to match the dimensions of the larger array.
## Broadcasting
print("\n4. Broadcasting")
arr_2d = np.array([[1, 2, 3], [4, 5, 6]])
scalar = 2
print("Original array:\n", arr_2d)
print("\nArray * scalar:\n", arr_2d * scalar)
4. Broadcasting
Original array:
[[1 2 3]
[4 5 6]]
Array * scalar:
[[ 2 4 6]
[ 8 10 12]]
5. Universal Functions
Universal Functions: Pre-built mathematical functions in NumPy, like
np.exp (exponential),
np.sqrt (square root),
np.sin (sine),
that apply operations to each element of an array.
## Universal Functions
print("\n5. Universal Functions")
arr = np.array([1, 2, 3, 4])
print("Original array:", arr)
print("Exponential:", np.exp(arr))
print("Square root:", np.sqrt(arr))
print("Sine:", np.sin(arr))
5. Universal Functions
Original array: [1 2 3 4]
Exponential: [ 2.71828183 7.3890561 20.08553692 54.59815003]
Square root: [1. 1.41421356 1.73205081 2. ]
Sine: [ 0.84147098 0.90929743 0.14112001 -0.7568025 ]
1. DataFrame Creation
From Dictionary or List: Create DataFrames from dictionaries or lists for easy data organization.
## DataFrame Creation
print("\n1. DataFrame Creation")
# From dictionary
data_dict = {
'name': ['John', 'Anna', 'Peter'],
'age': [28, 22, 35],
'city': ['New York', 'Paris', 'London']
}
df_dict = pd.DataFrame(data_dict)
print("DataFrame from dictionary:\n", df_dict)
1. DataFrame Creation
DataFrame from dictionary:
name age city
0 John 28 New York
1 Anna 22 Paris
2 Peter 35 London
## Data Import/Export
# Create sample CSV file
df_dict.to_csv('sample_data.csv', index=False)
print("\nReading CSV file:")
df_csv = pd.read_csv('sample_data.csv')
print(df_csv)
# Column calculations
df_csv['bonus'] = df_csv['salary'] * 0.1
print("\nCalculated bonus column:\n", df_csv)
## Row Operations
print("\n3. Row Operations")
# Add new row
new_row = {'name': 'Sarah', 'age': 30, 'city': 'Berlin', 'salary': 65000, 'bonus': 6500}
df_csv = pd.concat([df_csv, pd.DataFrame([new_row])], ignore_index=True)
print("Added new row:\n", df_csv)
2. Column Operations
Added new column:
name age city salary
0 John 28 New York 50000
1 Anna 22 Paris 60000
2 Peter 35 London 75000
3. Row Operations
Added new row:
name age city salary bonus
0 John 28 New York 50000 5000.0
1 Anna 22 Paris 60000 6000.0
2 Peter 35 London 75000 7500.0
3 Sarah 30 Berlin 65000 6500.0
# iloc
print("\nUsing iloc to select by position:\n", df_csv.iloc[0])
print("\nSelecting using iloc slicing:\n", df_csv.iloc[0:2, 0:2])
# Boolean indexing
print("\nBoolean indexing (age > 30):\n", df_csv[df_csv['age'] > 30])
Data Cleaning
# Data Cleaning
2. Removing Duplicates
Removing Duplicates: drop_duplicates removes duplicate rows based on column values.
## Removing Duplicates
print("\n2. Removing Duplicates")
df_dupes = pd.DataFrame({
'A': [1, 1, 2, 3],
'B': [4, 4, 5, 6]
})
print("DataFrame with duplicates:\n", df_dupes)
print("\nAfter removing duplicates:\n", df_dupes.drop_duplicates())
Filled with 0:
A B C
0 1.0 0.0 7.0
1 0.0 5.0 8.0
2 3.0 6.0 0.0
Forward fill:
A B C
0 1.0 NaN 7.0
1 1.0 5.0 8.0
2 3.0 6.0 8.0
Backward fill:
A B C
0 1.0 5.0 7.0
1 3.0 5.0 8.0
2 3.0 6.0 NaN
2. Removing Duplicates
DataFrame with duplicates:
A B
0 1 4
1 1 4
2 2 5
3 3 6
After conversion:
string_col int64
float_col float64
dtype: object
4. String Operations
String Operations: Use str.lower and str.contains to manipulate or filter based on text data.
## String Operations
print("\n4. String Operations")
df_str = pd.DataFrame({
'text': ['HELLO', 'World', 'Python']
})
print("Original text:\n", df_str)
print("\nLowercase:\n", df_str['text'].str.lower())
print("Contains 'o':\n", df_str['text'].str.contains('o'))
4. String Operations
Original text:
text
0 HELLO
1 World
2 Python
Lowercase:
0 hello
1 world
2 python
Name: text, dtype: object
Contains 'o':
0 False
1 True
2 True
Name: text, dtype: bool
5. Date/Time Operations
Date/Time Operations: Extract year, month, etc., from datetime columns using .dt attributes.
## Date/Time Operations
print("\n5. Date/Time Operations")
df_dates = pd.DataFrame({
'dates': pd.date_range(start='2024-01-01', periods=3)
})
print("Dates DataFrame:\n", df_dates)
print("\nExtract year:\n", df_dates['dates'].dt.year)
print("Extract month:\n", df_dates['dates'].dt.month)
5. Date/Time Operations
Dates DataFrame:
dates
0 2024-01-01
1 2024-01-02
2 2024-01-03
Extract year:
0 2024
1 2024
2 2024
Name: dates, dtype: int32
Extract month:
0 1
1 1
2 1
Name: dates, dtype: int32
Matplotlib Visualizations
Basic Plots
Line Plot: Plot data on a line graph, such as a sine wave.
Scatter Plot: Plot discrete data points to identify trends or patterns.
Bar Plot: Visualize data in different categories.
Histogram: Plot frequency distributions to observe the spread of data.
# Matplotlib Visualizations
## Basic Plots
print("\n1. Basic Plots")
x = np.linspace(0, 10, 100)
y = np.sin(x)
# Line plot
plt.figure(figsize=(10, 6))
plt.plot(x, y)
plt.title('Line Plot')
plt.xlabel('x')
plt.ylabel('sin(x)')
plt.grid(True)
plt.show()
1. Basic Plots
# Scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(x[::10], y[::10])
plt.title('Scatter Plot')
plt.xlabel('x')
plt.ylabel('sin(x)')
plt.grid(True)
plt.show()
# Bar plot
categories = ['A', 'B', 'C', 'D']
values = [4, 3, 2, 1]
plt.figure(figsize=(10, 6))
plt.bar(categories, values)
plt.title('Bar Plot')
plt.xlabel('Categories')
plt.ylabel('Values')
plt.show()
# Histogram
data = np.random.randn(1000)
plt.figure(figsize=(10, 6))
plt.hist(data, bins=30)
plt.title('Histogram')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
Subplots
Subplots: plt.subplots lets you plot multiple plots in a single figure.
## Subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
ax1.plot(x, np.sin(x))
ax1.set_title('Sine')
ax2.plot(x, np.cos(x))
ax2.set_title('Cosine')
plt.show()
Seaborn Visualizations
Statistical Plots
Distribution Plot: Visualize distributions of data with KDE (Kernel Density Estimation).
Categorical Plots: Use sns.boxplot to show distributions across categories.
Regression Plot: Shows linear relationship with sns.regplot.
Heatmap: Shows correlation between variables with color-coded cells.
# Seaborn Visualizations
## Statistical Plots
# Distribution plot
plt.figure(figsize=(10, 6))
sns.histplot(data=np.random.randn(1000), kde=True)
plt.title('Distribution Plot')
plt.show()
## Categorical Plots
data = pd.DataFrame({
'category': ['A', 'A', 'B', 'B', 'C', 'C'] * 10,
'value': np.random.randn(60)
})
plt.figure(figsize=(10, 6))
sns.boxplot(x='category', y='value', data=data)
plt.title('Box Plot')
plt.show()
## Regression Plot
x = np.random.randn(100)
y = 2 * x + np.random.randn(100) * 0.4
plt.figure(figsize=(10, 6))
sns.regplot(x=x, y=y)
plt.title('Regression Plot')
plt.show()
## Heatmap
correlation_matrix = np.corrcoef(np.random.randn(5, 100))
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()
Statistical Analysis
Descriptive Statistics
Descriptive Statistics: Summarize data properties like mean, median, and percentiles.
# Statistical Analysis
## Descriptive Statistics
data = np.random.randn(1000)
print("\nDescriptive Statistics:")
print("Mean:", np.mean(data))
print("Median:", np.median(data))
print("Standard Deviation:", np.std(data))
print("Variance:", np.var(data))
print("Minimum:", np.min(data))
print("Maximum:", np.max(data))
print("25th percentile:", np.percentile(data, 25))
print("75th percentile:", np.percentile(data, 75))
Descriptive Statistics:
Mean: 0.030350822870012565
Median: 0.04352972676337927
Standard Deviation: 0.9608360092179253
Variance: 0.923205836609829
Minimum: -2.86059084616054
Maximum: 2.671077577964687
25th percentile: -0.5802432556961682
75th percentile: 0.6828388868730023
Statistical Testing
t-test: Compare the means of two samples using a t-test.
Chi-square Test: Compare observed and expected frequencies.
Confidence Intervals: Calculate a confidence interval for the mean.
## Statistical Testing
# t-test
sample1 = np.random.normal(loc=0, scale=1, size=100)
sample2 = np.random.normal(loc=0.5, scale=1, size=100)
t_stat, p_value = stats.ttest_ind(sample1, sample2)
print("\nt-test results:")
print("t-statistic:", t_stat)
print("p-value:", p_value)
t-test results:
t-statistic: -5.398015890847371
p-value: 1.9157658589928207e-07
import numpy as np
from scipy.stats import chi2_contingency
# Display results
print("Chi-Square Test Results")
print(f"Chi-Square Statistic: {chi2_statistic:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"Degrees of Freedom: {degrees_of_freedom}")
print("Expected Frequencies Table:\n", expected_frequencies)
# Interpretation
alpha = 0.05 # Significance level
if p_value < alpha:
print("\nConclusion: Reject the null hypothesis")
print("The variables are likely dependent (there is a significant association).")
else:
print("\nConclusion: Fail to reject the null hypothesis")
print("The variables are likely independent (no significant association).")
import numpy as np
from scipy.stats import norm
# Sample data
data = np.array([23, 21, 25, 27, 22, 26, 24, 28, 22, 30])
# Confidence interval
ci_lower = mean - margin_of_error
ci_upper = mean + margin_of_error
print(f"Mean: {mean}")
print(f"95% Confidence Interval: ({ci_lower}, {ci_upper})")
Mean: 24.8
95% Confidence Interval: (22.98005737426862, 26.619942625731383)
# Advanced indexing
arr_3d = np.arange(24).reshape(2, 3, 4)
print("\n3D array:\n", arr_3d)
print("\nComplex slice:\n", arr_3d[0, 1:, 2:])
# Fancy indexing
arr = np.arange(10)
indices = [2, 5, -1]
print("\nFancy indexing:", arr[indices])
# Boolean masking
mask = arr > 5
print("\nBoolean masking:", arr[mask])
Original array: [ 0 1 2 3 4 5 6 7 8 9 10 11]
Reshaped to 3x4:
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
Transposed:
[[ 0 4 8]
[ 1 5 9]
[ 2 6 10]
[ 3 7 11]]
3D array:
[[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
[[12 13 14 15]
[16 17 18 19]
[20 21 22 23]]]
Complex slice:
[[ 6 7]
[10 11]]
Fancy indexing: [2 5 9]
Boolean masking: [6 7 8 9]
Advanced Broadcasting
## **Advanced Broadcasting**
# Broadcasting with different dimensions
a = np.array([[1, 2, 3], [4, 5, 6]]) # 2x3
b = np.array([10, 20, 30]) # 1x3
print("\nBroadcasting result:\n", a * b)
Broadcasting result:
[[ 10 40 90]
[ 40 100 180]]
# Rolling statistics
print("\nRolling mean (window=3):\n", df.rolling(window=3).mean())
Grouped statistics:
A B C
mean std min max sum
category
X 0.507879 0.028938 0.108282 2.346762 0.531478
Y -0.382340 0.848243 0.205079 0.980465 -0.237087
Z 0.097854 1.901895 -1.031132 0.456673 -1.973227
kpss_result = {
'statistic': statistic,
'p_value': p_value,
'n_lags': n_lags,
'critical_values': critical_values,
'status': 'success'
}
except Exception as e:
kpss_result = {
'statistic': None,
'p_value': None,
'n_lags': None,
'critical_values': None,
'status': f'failed: {str(e)}'
}
return kpss_result
def check_stationarity(ts):
"""
Perform comprehensive stationarity analysis
"""
results = {}
# 1. ADF Test
try:
adf_test = adfuller(ts.dropna())
results['adf_test'] = {
'statistic': adf_test[0],
'p_value': adf_test[1],
'critical_values': adf_test[4],
'status': 'success'
}
except Exception as e:
results['adf_test'] = {
'status': f'failed: {str(e)}'
}
return results
def analyze_timeseries_stability(ts):
"""
Analyze time series stability and characteristics
"""
# Calculate basic stability metrics
stability_metrics = {
'volatility': ts.std(),
'trend_strength': abs(np.corrcoef(ts.values, np.arange(len(ts)))[0, 1]),
'seasonality_test': seasonal_strength(ts)
}
return stability_metrics
return seasonal_strength
except:
return None
# Example usage
if __name__ == "__main__":
# Create sample time series
np.random.seed(42)
ts = pd.Series(
np.random.randn(1000),
index=pd.date_range('2024-01-01', periods=1000)
)
if stationarity_results['kpss_test']['status'] == 'success':
print(f"\nKPSS Test:")
print(f"Statistic: {stationarity_results['kpss_test']['statistic']:.4f}")
print(f"P-value: {stationarity_results['kpss_test']['p_value']:.4f}")
print("\n2. Stability Metrics:")
print(f"Volatility: {stability_metrics['volatility']:.4f}")
print(f"Trend Strength: {stability_metrics['trend_strength']:.4f}")
print(f"Seasonality Strength: {stability_metrics['seasonality_test']:.4f}")
1. Stationarity Tests:
ADF Test:
Statistic: -0.6882
P-value: 0.8498
Critical Values:
1%: -3.4371
5%: -2.8645
10%: -2.5683
2. Stability Metrics:
Volatility: 3.3868
Trend Strength: 0.8602
Seasonality Strength: 0.6268
# Extract age
text_df['age'] = text_df['text'].str.extract(r'(\d+)').astype(int)
print("\nExtracted ages:\n", text_df['age'])
Extracted names:
0 John Doe
1 Jane
2 Bob Wilson
Name: name, dtype: object
Extracted ages:
0 30
1 25
2 35
Name: age, dtype: int64
# Interpolation
print("\nLinear interpolation:\n", df_missing.interpolate(method='linear'))
print("\nPolynomial interpolation:\n", df_missing.interpolate(method='polynomial', order=2))
Linear interpolation:
A B C
0 1.0 NaN 1
1 2.0 2.0 2
2 3.0 3.0 3
3 4.0 4.0 4
4 5.0 5.0 5
Polynomial interpolation:
A B C
0 1.0 NaN 1
1 2.0 2.0 2
2 3.0 3.0 3
3 4.0 4.0 4
4 5.0 5.0 5
Advanced Visualization
## Complex Matplotlib Plots
# Create sample data
x = np.linspace(0, 10, 100)
y1 = np.sin(x)
y2 = np.cos(x)
plt.tight_layout()
plt.show()
Advanced Seaborn Visualizations
## Advanced Seaborn Visualizations
# Create complex dataset
n_points = 1000
data = pd.DataFrame({
'x': np.random.normal(0, 1, n_points),
'y': np.random.normal(0, 1, n_points),
'category': np.random.choice(['A', 'B', 'C'], n_points),
'size': np.random.uniform(10, 100, n_points)
})
# Plot components
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(12, 10))
ax1.plot(t, time_series, label='Original')
ax1.legend()
ax2.plot(t, trend, label='Trend', color='red')
ax2.legend()
ax3.plot(t, seasonal, label='Seasonal + Noise', color='green')
ax3.legend()
plt.tight_layout()
plt.show()
class TimeSeriesAnalyzer:
def __init__(self, data, date_column=None, target_column=None, freq='D'):
"""
Initialize with time series data
"""
self.raw_data = data
self.freq = freq
# Handle datetime index
if date_column:
self.data = data.set_index(date_column)
else:
self.data = data
if target_column:
self.target = self.data[target_column]
else:
self.target = self.data
self.scaler = StandardScaler()
# Plot trend
ax.plot(decomposition.trend, label='Trend', color='blue')
# Plot seasonal
ax.plot(decomposition.seasonal, label='Seasonal', color='green', alpha=0.6)
# Plot residual
ax.plot(decomposition.resid, label='Residual', color='red', alpha=0.4)
ax.legend()
ax.set_title('Seasonal Decomposition')
def visualize_preprocessing(self):
"""
Create comprehensive time series visualizations
"""
fig = plt.figure(figsize=(20, 15))
# 2. Seasonal Decomposition
ax2 = plt.subplot(322)
self.plot_seasonal_decomposition(ax2)
# 5. Distribution Plots
ax5 = plt.subplot(325)
sns.kdeplot(data=self.target, ax=ax5, label='Original')
sns.kdeplot(data=self.X_train_scaled, ax=ax5, label='Scaled')
ax5.set_title('Distribution Comparison')
ax5.legend()
# 6. Rolling Statistics
ax6 = plt.subplot(326)
window_size = min(30, len(self.target) // 10)
rolling_mean = self.target.rolling(window=window_size).mean()
rolling_std = self.target.rolling(window=window_size).std()
plt.tight_layout()
return fig
def analyze_timeseries(self):
"""
Perform statistical analysis of the time series
"""
# Compute rolling statistics
window_size = min(30, len(self.target) // 10)
rolling_mean = self.target.rolling(window=window_size).mean()
rolling_std = self.target.rolling(window=window_size).std()
analysis = {
'stationarity': {
'adf_test': adfuller(self.target.dropna()),
'kpss_test': kpss(self.target.dropna(), regression='ct')
},
'statistics': {
'original': {
'mean': self.target.mean(),
'std': self.target.std(),
'skew': self.target.skew(),
'kurtosis': self.target.kurtosis(),
'missing_values': self.target.isnull().sum()
},
'scaled': {
'mean': self.X_train_scaled.mean(),
'std': self.X_train_scaled.std(),
'skew': self.X_train_scaled.skew(),
'kurtosis': self.X_train_scaled.kurtosis()
}
},
'rolling_statistics': {
'mean_range': (rolling_mean.min(), rolling_mean.max()),
'std_range': (rolling_std.min(), rolling_std.max())
},
'seasonality': {
'period': pd.infer_freq(self.target.index),
'autocorr': self.target.autocorr()
}
}
return analysis
# Example usage
if __name__ == "__main__":
# Create sample time series data with trend and seasonality
dates = pd.date_range(start='2024-01-01', periods=1000, freq='D')
trend = np.linspace(0, 100, 1000)
seasonal = 10 * np.sin(2 * np.pi * np.arange(1000) / 365.25) # Yearly seasonality
noise = np.random.randn(1000) * 2
values = trend + seasonal + noise
# Initialize analyzer
ts_analyzer = TimeSeriesAnalyzer(ts_data)
# Preprocess data
X_train_scaled, X_test_scaled = ts_analyzer.preprocess_timeseries()
# Create visualizations
fig = ts_analyzer.visualize_preprocessing()
print("\nStationarity Tests:")
print(f"ADF Test p-value: {analysis['stationarity']['adf_test'][1]:.4f}")
print(f"KPSS Test p-value: {analysis['stationarity']['kpss_test'][1]:.4f}")
print("\nSeasonality Analysis:")
print("Inferred Frequency:", analysis['seasonality']['period'])
print(f"Lag-1 Autocorrelation: {analysis['seasonality']['autocorr']:.4f}")
plt.show()
Stationarity Tests:
ADF Test p-value: 0.8380
KPSS Test p-value: 0.0100
Seasonality Analysis:
Inferred Frequency: D
Lag-1 Autocorrelation: 0.9949