Python
Python
Shell or Notebook: Launching - Launching the Jupiter Notebook* - Help and Documentation
in IPython - Exploring Modules with Tab Completion - Keyboard Shortcuts in the IPython Shell -
IPython Magic Commands.
As with the previous request, providing "full notes" equivalent to a textbook is beyond a single AI
response. However, I can give you a detailed and extensive explanation of each topic from
your syllabus, providing a strong foundation for your learning.
● IPython: An enhanced interactive Python shell that Jupyter Notebooks run on top of. It
provides many features that make interactive computing more powerful.
● ? (Introspection):
○ Append ? to a variable, function, method, or object to get quick access to its
documentation (docstring).
Example:
Python
import numpy as np
np.array?
○
○ This will open a pager at the bottom of the screen with information like signature,
docstring, type, and file location.
● ?? (Source Code):
○ Append ?? to a function or method to view its full source code (if available, not
for compiled C extensions).
Example:
Python
def my_function(x):
"""This is a docstring."""
return x * 2
my_function??
○
● Tab Completion:
○ Object Methods/Attributes: Type object_name. and press Tab to see
available methods and attributes.
○ Module Contents: Type module_name. and press Tab to see functions,
classes, and variables within that module.
○ File Path Completion: In string literals, press Tab to complete file paths.
○Function Signature: After typing function_name( pressing Shift + Tab
(once, twice, thrice) can bring up parameter information/docstring (especially
useful in notebooks).
● help() function:
○ Standard Python built-in function. help(object_name) provides a more
verbose, pager-based help.
○ Example: help(np.mean)
Example:
Python
import pandas as pd
pd.DataFrame. # Press Tab here to see all methods like .head, .describe, .iloc etc.
●
● This significantly reduces the need to constantly look up documentation manually,
making coding faster and more efficient.
● Concept: Special commands that start with % (line magics) or %% (cell magics). They
extend the functionality of IPython and Jupyter, offering convenient shortcuts for
common tasks.
● Line Magics (%): Apply to a single line.
○ %run script.py: Run a Python script.
%timeit expression: Time the execution of a single line of Python code (runs multiple times
for accuracy).
Python
%timeit [i**2 for i in range(1000)]
○
○ %debug: Enter the interactive debugger after an exception.
○ %who / %whos: List variables defined in the current namespace (with details).
○ %lsmagic: List all available magic commands.
○ %pwd: Print current working directory.
○ %cd path/to/directory: Change current working directory.
○ %env: List environment variables.
○ %matplotlib inline (or %matplotlib notebook): Render Matplotlib plots
directly within the notebook output. Essential for visualization.
○ %load_ext autoreload: Load the autoreload extension.
○ %autoreload 2: Automatically reload modules before executing code (useful
during development).
● Cell Magics (%%): Apply to the entire cell. Must be the first line of the cell.
○
○ %%time: Report the wall clock time and CPU time for the cell.
○ %%bash / %%sh: Execute the cell content as a bash/shell command.
○ %%html: Render the cell content as HTML.
○ %%writefile filename.py: Write the cell content to a file.
○ %%file filename.txt: (Deprecated, %%writefile is preferred).
○ %%latex: Render the cell content as LaTeX.
Unit – II: NumPy (15 Hrs)
NumPy (Numerical Python) is the fundamental package for numerical computation in Python,
providing powerful N-dimensional array objects and tools for integrating C/C++ and Fortran
code.
1. Introduction to NumPy
● Why NumPy?
○ Python lists are general-purpose, but for numerical operations, they are slow and
inefficient for large datasets.
○ NumPy arrays (ndarray) are designed for efficient numerical operations on
large amounts of data.
○ They are homogenous (all elements are of the same data type), which allows
for highly optimized, vectorized operations.
○ Under the hood, NumPy operations are often implemented in C or Fortran,
making them much faster than pure Python loops.
● Key Features:
○ ndarray: A fast and efficient multi-dimensional array object.
○ Mathematical functions for operating on arrays (linear algebra, Fourier
transforms, random number generation).
○ Tools for integrating C/C++ and Fortran code.
● Installation: Usually comes with Anaconda. Otherwise: pip install numpy
● Import Convention: import numpy as np
● Creating Arrays:
Fixed-Size Arrays:
Python
np.zeros(5, dtype=int) # array([0, 0, 0, 0, 0])
np.ones((3, 5), dtype=float) # 3x5 array of ones
np.full((2, 2), 7) # 2x2 array with all 7s
np.empty(3) # Uninitialized values
○
Sequences:
Python
np.arange(0, 10, 2) # Like range(), but returns an array: array([0, 2, 4, 6, 8])
np.linspace(0, 1, 5) # 5 evenly spaced numbers between 0 and 1: array([0. , 0.25, 0.5 ,
0.75, 1. ])
Random Arrays:
Python
np.random.rand(3, 3) # Uniform distribution [0, 1)
np.random.randn(3, 3) # Standard normal distribution
np.random.randint(0, 10, size=(3, 3)) # Random integers
Identity Matrix:
Python
np.eye(3) # 3x3 identity matrix
○
● Array Attributes:
○ ndim: Number of dimensions.
○ shape: Tuple indicating the size of each dimension.
○ size: Total number of elements.
○ dtype: Data type of the elements (e.g., int64, float64).
○ itemsize: Size of each element in bytes.
● Array Indexing and Slicing:
Multi-dimensional Arrays:
Python
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
arr2d[0, 0] # 1
arr2d[2, 1] # 8
arr2d[:2, :2] # Slice rows and columns
# array([[1, 2],
# [4, 5]])
arr2d[1, :] # Second row: array([4, 5, 6])
arr2d[:, 0] # First column: array([1, 4, 7])
Fancy Indexing: Using arrays of integers or booleans to select arbitrary subsets of data.
Python
arr = np.array([10, 20, 30, 40, 50])
indices = [0, 2, 4]
arr[indices] # array([10, 30, 50])
○
● Reshaping Arrays:
reshape(): Returns a new array with a different shape, without changing the data.
Python
arr = np.arange(1, 10)
arr.reshape((3, 3))
# array([[1, 2, 3],
# [4, 5, 6],
# [7, 8, 9]])
○
○ ravel() / flatten(): Convert multi-dimensional array to 1D. flatten()
returns a copy, ravel() returns a view (if possible).
● Concatenation and Splitting:
○
○ np.vstack(): Stack arrays vertically (row-wise).
○ np.hstack(): Stack arrays horizontally (column-wise).
○ np.dstack(): Stack arrays depth-wise (3D).
○ np.split(), np.vsplit(), np.hsplit(), np.dsplit(): Split arrays into
multiple sub-arrays.
○
○ Comparison Operators: >, <, ==, !=, >=, <=. Return boolean arrays.
○ Trigonometric Functions: np.sin(), np.cos(), np.tan().
○ Exponentials and Logarithms: np.exp(), np.log(), np.log2(),
np.log10().
○ Other UFuncs: np.abs(), np.sqrt(), np.ceil(), np.floor(),
np.round().
● Broadcasting:
○ A powerful mechanism that allows NumPy to perform operations on arrays of
different shapes.
○ It effectively "stretches" the smaller array across the larger array so that they
have compatible shapes.
○ Rules:
1. If the arrays have different numbers of dimensions, prepend 1s to the
shape of the smaller array until both shapes have the same length.
2. Two dimensions are compatible when they are equal, or one of them is 1.
3. If the dimensions are incompatible, an error is raised.
Example:
Python
a = np.array([0, 10, 20, 30]) # shape (4,)
b = np.array([0, 1, 2]) # shape (3,)
# Cannot directly add.
# But:
a = np.arange(3)[:, np.newaxis] # shape (3, 1)
b = np.arange(3) # shape (3,)
a+b
# array([[0, 1, 2],
# [1, 2, 3],
# [2, 3, 4]])
○ The b array (shape (3,)) is broadcast across the columns of a (shape (3,1)).
● Definition: Operations that collapse an array (or parts of it) into a single value.
● Common Aggregations:
○ np.sum(), np.min(), np.max(), np.mean(), np.median(), np.std()
(standard deviation), np.var() (variance).
● Axis Argument: Most aggregation functions accept an axis argument to specify along
which dimension the aggregation should occur.
○ axis=0: Aggregate down the columns (result has number of rows = 1).
○ axis=1: Aggregate across the rows (result has number of columns = 1).
○ If axis is not specified, the aggregation is performed over the entire array.
Python
M = np.random.randint(0, 10, (3, 4))
# array([[8, 2, 5, 0],
# [7, 6, 8, 8],
# [0, 2, 4, 7]])
●
● np.nansum(), np.nanmean(), etc.: Versions of aggregation functions that ignore NaN
(Not a Number) values.
● Sorting Arrays:
○ np.sort(arr): Returns a sorted copy of the array.
○ arr.sort(): Sorts the array in-place.
○ np.argsort(arr): Returns the indices that would sort the array.
Python
x = np.array([2, 1, 4, 3, 5])
np.sort(x) # array([1, 2, 3, 4, 5])
i = np.argsort(x) # array([1, 0, 3, 2, 4])
x[i] # array([1, 2, 3, 4, 5])
●
○ Sorting along an axis for multi-dimensional arrays.
● Partial Sorts:
○ np.partition(arr, K): Returns a copy with the Kth smallest value in its
sorted position, and all smaller values to its left, larger values to its right.
○ np.argpartition(): Returns the indices.
● Structured Arrays:
○ Arrays with compound data types, allowing elements to have multiple fields (like
a C struct or database row).
Python
data = np.zeros(3, dtype={'names':('name', 'age', 'weight'),
'formats':('U10', 'i4', 'f8')})
# data[0] = ('Alice', 25, 55.5)
# data['name'] # Access by field name
●
● Masking and Filtering:
○ Using boolean arrays created by comparison operators to select elements.
Python
x = np.arange(10)
x[x % 2 == 0] # Select even numbers: array([0, 2, 4, 6, 8])
●
● Linear Algebra:
○ np.dot(a, b) or a @ b (Python 3.5+): Dot product of two arrays (matrix
multiplication).
○ np.linalg.inv(matrix): Inverse of a matrix.
○ np.linalg.det(matrix): Determinant of a matrix.
○ np.linalg.eig(matrix): Eigenvalues and eigenvectors.
● Broadcasting in more complex scenarios (e.g., adding a 1D array to a 2D array,
column-wise).
● Why Pandas?
○ NumPy is great for numerical arrays, but lacks labels for rows/columns and
handles heterogeneous data poorly.
○ Pandas introduces two primary data structures: Series (1D labeled array) and
DataFrame (2D labeled table).
○ Makes data cleaning, transformation, analysis, and visualization much easier and
more intuitive.
○ Excellent for handling tabular data, time series data, and heterogeneous data.
● Key Features:
○ DataFrame objects for data manipulation with integrated indexing.
○ Tools for reading and writing data between in-memory data structures and
different formats: CSV, text files, SQL databases, HDF5 format.
○ Intelligent data alignment and integrated handling of missing data.
○ Flexible groupby functionality for performing split-apply-combine operations.
○ High performance merging and joining of datasets.
○ Time series functionality.
● Installation: Usually comes with Anaconda. Otherwise: pip install pandas
● Import Convention: import pandas as pd
● Definition: A one-dimensional labeled array capable of holding any data type (integers,
strings, floats, Python objects, etc.).
● Creation:
From List/Array:
Python
s = pd.Series([0.25, 0.5, 0.75, 1.0])
○
● Attributes: s.values (NumPy array of values), s.index (Pandas Index object).
● Indexing and Slicing:
○ Explicit Index: s['a']
○ Implicit (Integer) Index: s[0]
○ Slicing by Explicit Index: s['a':'c'] (inclusive of end)
○ Slicing by Implicit Index: s[0:2] (exclusive of end)
○ Fancy Indexing: s[['a', 'c']]
○ Boolean Masking: s[s > 0.5]
○ loc (label-location based indexer): s.loc['a'], s.loc['a':'c']
○ iloc (integer-location based indexer): s.iloc[0], s.iloc[0:2]
○
○ From CSV, Excel, etc.: pd.read_csv('file.csv'),
pd.read_excel('file.xlsx')
● Attributes: df.values (NumPy array), df.index, df.columns.
● Indexing and Selection:
○ Column Selection: df['col1'] (returns Series), df[['col1', 'col2']]
(returns DataFrame).
○ Row Selection:
■ df.iloc[0] (first row by integer position)
■ df.loc['row_label'] (row by label)
■ Slicing rows: df[1:3] (by integer position, iloc preferred)
■ df.loc['label1':'label3'] (by label, loc preferred)
○ Combined Selection (loc, iloc):
■ df.loc[row_label, column_label]
■ df.iloc[row_pos, column_pos]
■ df.loc[df['col1'] > 1, ['col2']] (Boolean indexing rows, label
indexing columns)
Adding/Modifying Columns:
Python
df['new_col'] = df['col1'] * 2
df['another_col'] = ['X', 'Y', 'Z']
●
● Dropping Columns/Rows:
○ df.drop('column_name', axis=1, inplace=True) (inplace modifies
DataFrame)
○ df.drop(['row_label1', 'row_label2'], axis=0)
● rename(): Renaming columns or index labels.
● set_index() / reset_index(): Changing the DataFrame index.
● Representation: Pandas uses NaN (Not a Number) for floating-point missing values and
None for object-type missing values (Python None). NumPy's NaN is used internally.
● Detection:
○ df.isnull(): Returns a boolean DataFrame of the same shape, True where
null.
○ df.notnull(): Opposite of isnull().
○ df.isna() and df.notna() are aliases for isnull() and notnull().
● Counting Nulls:
○ df.isnull().sum(): Counts nulls per column.
○ df.isnull().sum().sum(): Total nulls in DataFrame.
● Handling Missing Data:
○ dropna() (Dropping):
■ df.dropna(): Drops any row containing any NaN value.
■ df.dropna(axis='columns') or df.dropna(axis=1): Drops any
column containing any NaN value.
■ df.dropna(how='all'): Drops rows/columns only if all values are
NaN.
■ df.dropna(thresh=N): Requires at least N non-null values for a
row/column to be kept.
○ fillna() (Filling):
■ df.fillna(0): Fill all NaN values with 0.
■ df.fillna(method='ffill') or df.fillna(method='pad'):
Forward-fill (propagate last valid observation forward).
■ df.fillna(method='bfill') or
df.fillna(method='backfill'): Backward-fill (propagate next valid
observation backward).
■ df.fillna(df.mean()): Fill NaN in each column with that column's
mean.
■ df['column'].fillna(df['column'].median()): Fill specific
column's NaN with its median.
■ df.fillna(value=dictionary_of_column_values): Fill with
different values per column.
● Concept: Allows you to have multiple index levels on an axis (row or column), enabling
data to be stored and manipulated in higher dimensional space.
● Creation:
○
○ Using pd.MultiIndex.from_arrays, from_product, from_tuples:
○
● Indexing and Slicing MultiIndex:
○ Accessing by Outer Level: multi_index_series['California']
○ Accessing by Inner Level: Requires slicing with loc or iloc.
○ xs() (Cross-section): For selecting data at a particular level.
Partial Indexing:
Python
multi_index_series['California', 2000] # Single element
multi_index_series.loc['California': 'New York'] # Slice on outer
5. Combining Datasets
● pd.concat() (Concatenation):
○ Joins DataFrame or Series objects along a particular axis (rows or columns).
○ pd.concat([df1, df2]): By default, stacks vertically (rows). Aligns columns.
○ pd.concat([df1, df2], axis=1): Stacks horizontally (columns). Aligns
rows.
○ join argument: 'inner' (default is 'outer') for intersection of indices.
○ ignore_index=True: Resets the resulting index.
○ keys argument: Creates a hierarchical index to identify the origin of each chunk.
● pd.merge() (Merging/Joining):
○ Combines DataFrames based on common columns or indices (like SQL JOINs).
○ on: Column name(s) to join on.
○ left_on, right_on: Column names if different in left/right DataFrames.
○ left_index, right_index: Join on index.
○ how argument (Join Types):
■ 'inner' (default): Intersection of keys.
■ 'outer': Union of keys (includes all rows, fills with NaN).
■ 'left': Includes all rows from left DataFrame, matching from right.
■ 'right': Includes all rows from right DataFrame, matching from left.
● df.join(): A convenience method for joining DataFrames on their indexes (or a key
column to index). Simpler for index-based joins than merge.
Common Applications:
Python
df.groupby('category_col')['value_col'].mean() # Mean of 'value_col' for each 'category_col'
df.groupby('category_col').size() # Count of items in each category
df.groupby('category_col').describe() # Full statistics for each category
transform(): Return a Series/DataFrame with the same shape as the original, where values
are group-wise transformations.
Python
df['normalized_val'] = df.groupby('category_col')['value_col'].transform(lambda x: (x - x.mean()) /
x.std())
○
○ apply(): Apply an arbitrary function to each group. Very flexible but can be
slower.
1. Introduction to Matplotlib
● Why Matplotlib?
○ Provides a highly flexible and powerful way to create a wide variety of static,
animated, and interactive plots.
○ Can produce publication-quality figures in a variety of hardcopy formats and
interactive environments.
○ Often integrated with NumPy and Pandas.
● Key Concepts:
○ Figures: The top-level container for all plot elements. You can have multiple
figures.
○ Axes: The actual plotting area where the data is drawn. A figure can contain
multiple axes (subplots). Each axes has x-axis, y-axis, titles, labels, etc.
○ pyplot module: matplotlib.pyplot is a collection of functions that make
Matplotlib work like MATLAB. It's the most common way to use Matplotlib.
● Installation: Usually comes with Anaconda. Otherwise: pip install matplotlib
● Import Convention: import matplotlib.pyplot as plt
● Displaying Plots in Jupyter:
○ %matplotlib inline: Displays plots statically embedded within the notebook
output.
○ %matplotlib notebook: Displays interactive plots within the notebook (zoom,
pan).
●
● Object-Oriented Interface (Recommended for complex plots):
○ Explicitly create Figure and Axes objects.
○ Allows for more fine-grained control.
Python
fig = plt.figure() # Create a new figure
ax = fig.add_subplot(1, 1, 1) # Create axes (1 row, 1 col, 1st subplot)
ax.plot(x, np.cos(x))
ax.set_title("Cosine Wave")
ax.set_xlabel("X-axis")
ax.set_ylabel("Y-axis")
plt.show()
●
● Saving Plots: plt.savefig('my_plot.png') or fig.savefig('my_plot.pdf')
4. Scatter Plots
Python
rng = np.random.RandomState(0)
x = rng.randn(100)
y = rng.randn(100)
colors = rng.rand(100)
sizes = 1000 * rng.rand(100)
●
● vs. plt.plot(x, y, 'o'):
○ plt.plot() is optimized for plotting points along a line (even if markers are
used). It's faster for large datasets of uniform properties.
○ plt.scatter() offers more flexibility in controlling individual point properties
(color, size) from data arrays.
5. Visualizing Errors
Python
x = np.linspace(0, 10, 50)
dy = 0.8
y = np.sin(x) + dy * np.random.randn(50)
plt.errorbar(x, y, yerr=dy, fmt='o', color='black', ecolor='lightgray', elinewidth=3, capsize=0);
● Histograms (plt.hist()):
○ Shows the distribution of a single numerical variable by dividing data into bins
and counting observations in each bin.
○ bins: Number of bins or sequence of bin edges.
○ density=True: Normalize to form a probability density (area under histogram
sums to 1).
○ alpha: Transparency.
○ histtype: 'bar' (default), 'barstacked', 'step', 'stepfilled'.
Python
data = np.random.randn(1000)
plt.hist(data, bins=30, alpha=0.5, histtype='stepfilled', color='steelblue', edgecolor='none');
●
● Two-Dimensional Histograms (plt.hist2d(), plt.hexbin()):
○ For visualizing the joint distribution of two variables.
○ plt.hist2d(): Rectangular bins, color intensity indicates density.
○ plt.hexbin(): Hexagonal bins, often more visually appealing.
● Kernel Density Estimation (KDE):
○ A non-parametric way to estimate the probability density function of a random
variable. Smoothed version of a histogram.
○ Often done using seaborn.kdeplot or scipy.stats.gaussian_kde.
○ plt.plot(density_x, density_y) after calculating KDE.
7. Customizing Plots
8. Multiple Subplots
plt.subplots(): A convenient helper function that creates a figure and a grid of subplots in a
single call.
Python
fig, axes = plt.subplots(2, 2, figsize=(8, 6)) # 2 rows, 2 columns
●
● plt.subplot(): MATLAB-style interface for creating subplots (less flexible for
complex layouts).
9. Text Annotation
1. Introduction to Scikit-Learn
● Why Scikit-Learn?
1. Provides a consistent API for a wide range of machine learning algorithms.
2. Built on NumPy, SciPy, and Matplotlib.
3. Emphasis on ease of use, robust implementation, and good documentation.
4. Does not handle data loading, data manipulation (use Pandas), or highly
customized deep learning (use TensorFlow/PyTorch).
● Key Principles:
1. Estimators: All objects in scikit-learn that learn from data are called estimators.
They typically have fit() and predict() methods.
2. Consistency: All estimators share a common API.
3. Inspection: All parameters and learned attributes are public attributes.
4. Sensible Defaults: Most parameters have reasonable default values.
● Installation: Usually comes with Anaconda. Otherwise: pip install scikit-learn
● Common Workflow:
1. Choose a Model Class: Import the appropriate estimator (e.g., from
sklearn.linear_model import LinearRegression).
2. Choose Model Hyperparameters: Instantiate the model with desired settings
(e.g., model = LinearRegression(fit_intercept=True)).
3. Arrange Data: Prepare data into feature matrix X (2D NumPy array or Pandas
DataFrame) and target vector y (1D NumPy array or Pandas Series).
4. Fit the Model: Use the fit() method to train the model on your data
(model.fit(X, y)).
5. Predict/Transform: Use the predict() method to make predictions on new
data (model.predict(X_new)) or transform() for feature transformation.
2. Data Representation
Example:
Python
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data # Feature matrix: sepal length, sepal width, petal length, petal width
y = iris.target # Target vector: species (0, 1, 2)
X.shape # (150, 4) -> 150 samples, 4 features
y.shape # (150,) -> 150 target values
●
● Feature Engineering: The process of creating new features from existing raw data to
improve model performance.
● Hyperparameters:
○ Parameters of the learning algorithm that are set prior to training (e.g.,
n_estimators in a Random Forest, alpha in Ridge regression, C in SVM).
○ They are not learned from the data during training.
○ Crucial for controlling model complexity and preventing overfitting/underfitting.
● Model Validation:
○ The process of evaluating how well a trained model generalizes to unseen data.
○ Why not just use training error? Training error is usually overly optimistic; a
complex model might memorize the training data but perform poorly on new data
(overfitting).
○ Common Validation Approaches:
■ Train/Test Split: Divide data into training set (e.g., 70-80%) and test set
(remaining). Train on training, evaluate on test.
■ from sklearn.model_selection import
train_test_split
■ X_train, X_test, y_train, y_test =
train_test_split(X, y, test_size=0.3,
random_state=42)
■ Cross-Validation: A more robust method for estimating model
performance and selecting hyperparameters. Divides the data into
multiple "folds."
■ K-Fold Cross-Validation: Data is split into K equal-sized folds. In
K iterations, one fold is used for testing, and the remaining K-1
folds for training. The performance is averaged across K
iterations.
■ Stratified K-Fold: Ensures that each fold has a similar proportion
of class labels as the original dataset (important for imbalanced
datasets).
■ from sklearn.model_selection import KFold,
StratifiedKFold, cross_val_score
■ scores = cross_val_score(model, X, y, cv=5)
● Hyperparameter Tuning:
○ Grid Search: Systematically explores a predefined set of hyperparameter values
for a given model.
■ from sklearn.model_selection import GridSearchCV
■ param_grid = {'n_neighbors': [3, 5, 7], 'weights':
['uniform', 'distance']}
■ grid = GridSearchCV(KNeighborsClassifier(),
param_grid, cv=5)
■ grid.fit(X_train, y_train)
■ grid.best_params_, grid.best_score_,
grid.best_estimator_
○ Randomized Search: Randomly samples a fixed number of hyperparameter
combinations from a specified distribution. Often more efficient than Grid Search
for large search spaces.
■ from sklearn.model_selection import RandomizedSearchCV
4. Learning Curves
● Concept: Plots that show how a model's performance (e.g., score) changes as the
amount of training data increases.
● Interpretation:
○ High Bias (Underfitting): Both training and validation scores converge to a low
value. The model is too simple and cannot learn the underlying patterns. Getting
more data won't help much.
○ High Variance (Overfitting): Training score is high, but validation score is low,
and there's a significant gap between them. The model is too complex and has
memorized noise in the training data. Getting more data might help.
○ Good Fit: Training and validation scores converge to a high value, with a small
gap.
● Usage: Helps diagnose whether adding more training data, reducing model complexity,
or increasing model complexity would be beneficial.
● from sklearn.model_selection import learning_curve
5. Correlation
● Definition: A statistical measure that expresses the extent to which two variables are
linearly related.
● Pearson Correlation Coefficient (r):
○ Ranges from -1 to +1.
○ +1: Perfect positive linear relationship.
○ -1: Perfect negative linear relationship.
○ 0: No linear relationship.
● Usage in Data Science:
○ Feature Selection: Identify highly correlated features (multicollinearity can be a
problem for some models like linear regression).
○ Exploratory Data Analysis (EDA): Understand relationships between variables.
Calculation (Pandas):
Python
import pandas as pd
df = pd.DataFrame({'A': [1,2,3,4,5], 'B': [2,4,5,4,5], 'C': [5,4,3,2,1]})
df.corr() # Returns a correlation matrix
df['A'].corr(df['B']) # Correlation between two specific columns
●
● Correlation vs. Causation: Correlation does not imply causation.
6. Linear Regression
● Goal: To model the relationship between a dependent variable (target y) and one or
more independent variables (features X) by fitting a linear equation to observed data.
● Assumptions of Linear Regression: Linearity, independence of errors,
homoscedasticity, normality of residuals.
● Model: y=β0+β1x1+ϵ
○ y: Dependent variable.
○ x1: Independent variable (single feature).
○ β0: Y-intercept.
○ β1: Coefficient (slope).
○ ϵ: Error term.
● Minimizing Residual Sum of Squares (RSS): The model finds the β0 and β1 that
minimize the sum of the squared differences between the observed values and the
values predicted by the linear model.
Implementation (Scikit-Learn):
Python
from sklearn.linear_model import LinearRegression
import numpy as np
y_pred = model.predict(X)
●
● Evaluation Metrics (for Regression):
○ Mean Squared Error (MSE): Average of the squared differences between
predicted and actual values.
○ Root Mean Squared Error (RMSE): Square root of MSE (in the same units as
y).
○ Mean Absolute Error (MAE): Average of the absolute differences. Less
sensitive to outliers than MSE.
○ R-squared (R2): Proportion of the variance in the dependent variable that is
predictable from the independent variables. Higher is better (max 1.0).
■ model.score(X, y) gives R2.
6.3. Regularization
● Problem: In complex linear models (e.g., with many features or high-degree polynomial
basis functions), the model can overfit the training data by assigning very large
coefficients to specific features.
● Solution: Regularization adds a penalty term to the loss function (the function being
minimized during training) that discourages large coefficients. This effectively shrinks
coefficients towards zero.
● Types of Regularization:
○ Reduces overfitting.
○ Improves generalization performance on unseen data.
○ Lasso can perform automatic feature selection.
This is a comprehensive set of notes covering the topics outlined in your syllabus. To truly
master these concepts and tools, remember to:
● Practice extensively: The best way to learn is by coding. Use Jupyter Notebooks to
follow along with examples and experiment.
● Refer to documentation: The official documentation for NumPy, Pandas, Matplotlib,
and Scikit-learn is excellent.
● Work on projects: Apply these skills to real-world datasets.
● Read relevant books/tutorials: "Python for Data Analysis" by Wes McKinney (Pandas
creator) and "Python Data Science Handbook" by Jake VanderPlas are highly
recommended resources.