0% found this document useful (0 votes)
2 views27 pages

Python

The document outlines a syllabus for a data science course covering key Python libraries including IPython/Jupyter, NumPy, Pandas, Matplotlib, and Scikit-Learn. Each unit focuses on different aspects of these libraries, such as launching Jupyter Notebooks, manipulating data with NumPy and Pandas, visualizing data with Matplotlib, and implementing machine learning with Scikit-Learn. The curriculum is designed to provide a comprehensive foundation for students entering the fields of data analysis and machine learning.

Uploaded by

rajkumarm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views27 pages

Python

The document outlines a syllabus for a data science course covering key Python libraries including IPython/Jupyter, NumPy, Pandas, Matplotlib, and Scikit-Learn. Each unit focuses on different aspects of these libraries, such as launching Jupyter Notebooks, manipulating data with NumPy and Pandas, visualizing data with Matplotlib, and implementing machine learning with Scikit-Learn. The curriculum is designed to provide a comprehensive foundation for students entering the fields of data analysis and machine learning.

Uploaded by

rajkumarm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Unit – I (15 Hrs)

Shell or Notebook: Launching - Launching the Jupiter Notebook* - Help and Documentation
in IPython - Exploring Modules with Tab Completion - Keyboard Shortcuts in the IPython Shell -
IPython Magic Commands.

Unit – II (15 Hrs)


NumPy:Introduction to NumPy – The Basics of NumPy arrays*–Computation on NumPy
Arrays–Aggregations: Min. Max, and Everything in Between – Computation on Arrays.
Unit – III (15 Hrs)
Pandas:Introduction to pandas - Data manipulation with pandas–Operating on null values*,
hierarchical indexing – Combining Datasets – Aggregation and Grouping.
Unit – IV (15 Hrs)
Matplotlib:Introduction to Matplotlib– Visualization with Matplotlib – Simple line plots*–
scatter plots – visualizing errors – Histograms, binnings and density – Customizing plots –
Multiple
sub plots – Text annotation.
Unit – V (15 Hrs)
Sci-kit Learn: Introduction to Scikit Learn: Data representation – Hyper parameters
&Validation: Selecting the best model – Learning Curves. – Correlation - Linear
Regression: Simple
Linear Regression – Basis function regression – Regularization.
You're looking for comprehensive notes on a syllabus that seems to cover core Python libraries
for data science: IPython/Jupyter, NumPy, Pandas, Matplotlib, and Scikit-Learn. This is an
excellent curriculum for anyone getting into data analysis and machine learning!

As with the previous request, providing "full notes" equivalent to a textbook is beyond a single AI
response. However, I can give you a detailed and extensive explanation of each topic from
your syllabus, providing a strong foundation for your learning.

Let's break down each unit:

Unit – I: Shell or Notebook (15 Hrs)


This unit focuses on the interactive computing environment, primarily IPython and Jupyter
Notebooks, which are indispensable tools for data science in Python.

1. Launching Jupyter Notebook

● What is Jupyter Notebook?


○ An open-source web application that allows you to create and share documents
containing live code, equations, visualizations, and narrative text.
○ It's widely used for data cleaning and transformation, numerical simulation,
statistical modeling, data visualization, and machine learning.
○ The "Jupyter" name comes from the core languages it supports: Julia, Python,
and R.
● Installation:
○ Most commonly installed as part of the Anaconda distribution (highly
recommended for data science). Anaconda includes Python, Jupyter, NumPy,
Pandas, Matplotlib, SciPy, and many other essential libraries.
○ Alternatively, you can install it via pip: pip install jupyter
● Launching from Command Line:
○ Open your terminal or command prompt.
○ Navigate to the directory where you want to save your notebooks (e.g., cd
Documents/MyProjects).
○ Type jupyter notebook and press Enter.
○ This will typically open a new tab in your default web browser, showing the
Jupyter Notebook dashboard, which is a file explorer.
● Dashboard Features:
○ Files: Lists files and directories, allows you to open existing notebooks or create
new ones.
○ Running: Shows all currently running notebooks and terminals.
○ Clusters: (Less common now, related to older IPython parallel computing).
● Creating a New Notebook: From the dashboard, click "New" -> "Python 3" (or your
desired kernel). This opens a new, empty notebook.
● Notebook Interface:
○ Cells: The fundamental building blocks. Can be Code cells (for Python code) or
Markdown cells (for text, headings, images, equations).
○ Toolbar: Icons for saving, adding cells, cutting/copying/pasting cells, running
cells, interrupting kernel, restarting kernel.
○ Kernel: The computational engine that executes the code in your notebook. The
"Python 3" kernel means your code will be executed by a Python 3 interpreter.
○ Modes:
■ Command Mode (Blue border): For manipulating cells (add, delete,
move). Press Esc to enter.
■ Edit Mode (Green border): For typing inside a cell. Press Enter to
enter.
● Running Cells:
○ Shift + Enter: Run the current cell and select the next cell below.
○ Ctrl + Enter: Run the current cell in place.
○ Alt + Enter: Run the current cell and insert a new cell below.

2. Help and Documentation in IPython

● IPython: An enhanced interactive Python shell that Jupyter Notebooks run on top of. It
provides many features that make interactive computing more powerful.
● ? (Introspection):
○ Append ? to a variable, function, method, or object to get quick access to its
documentation (docstring).

Example:
Python
import numpy as np
np.array?


○ This will open a pager at the bottom of the screen with information like signature,
docstring, type, and file location.
● ?? (Source Code):
○ Append ?? to a function or method to view its full source code (if available, not
for compiled C extensions).

Example:
Python
def my_function(x):
"""This is a docstring."""
return x * 2
my_function??


● Tab Completion:
○ Object Methods/Attributes: Type object_name. and press Tab to see
available methods and attributes.
○ Module Contents: Type module_name. and press Tab to see functions,
classes, and variables within that module.
○ File Path Completion: In string literals, press Tab to complete file paths.
○Function Signature: After typing function_name( pressing Shift + Tab
(once, twice, thrice) can bring up parameter information/docstring (especially
useful in notebooks).
● help() function:
○ Standard Python built-in function. help(object_name) provides a more
verbose, pager-based help.
○ Example: help(np.mean)

3. Exploring Modules with Tab Completion

● As mentioned above, tab completion is crucial for exploring libraries.

Example:
Python
import pandas as pd
pd.DataFrame. # Press Tab here to see all methods like .head, .describe, .iloc etc.

data = {'A': [1,2], 'B': [3,4]}


df = pd.DataFrame(data)
df.loc[0, 'A'] # After 'A', press Tab for quotes completion, or if you had a variable name.


● This significantly reduces the need to constantly look up documentation manually,
making coding faster and more efficient.

4. Keyboard Shortcuts in the IPython Shell (and Jupyter Notebook)

● In Command Mode (Esc to activate, blue cell border):


○ A: Insert cell Above.
○ B: Insert cell Below.
○ DD: Delete selected cell(s).
○ M: Change cell to Markdown.
○ Y: Change cell to Code (Yes, code).
○ Z: Zundo last cell deletion.
○ X: Xut (cut) selected cell(s).
○ C: Copy selected cell(s).
○ V: Vaste (paste) cell below.
○ Shift + V: Paste cell above.
○ L: Toggle line numbers in the selected cell.
○ Up/Down arrows: Navigate between cells.
○ Shift + Up/Down: Select multiple cells.
○ Enter: Enter Edit Mode.
● In Edit Mode (Enter to activate, green cell border):
○ Ctrl + Enter: Run current cell.
○ Shift + Enter: Run current cell and select next.
○ Alt + Enter: Run current cell and insert new cell below.
○ Tab: Code completion.
○ Shift + Tab: Tooltip/docstring for function arguments (repeat for more detail).
○ Ctrl + /: Comment/uncomment selected lines.

5. IPython Magic Commands

● Concept: Special commands that start with % (line magics) or %% (cell magics). They
extend the functionality of IPython and Jupyter, offering convenient shortcuts for
common tasks.
● Line Magics (%): Apply to a single line.
○ %run script.py: Run a Python script.

%timeit expression: Time the execution of a single line of Python code (runs multiple times
for accuracy).
Python
%timeit [i**2 for i in range(1000)]


○ %debug: Enter the interactive debugger after an exception.
○ %who / %whos: List variables defined in the current namespace (with details).
○ %lsmagic: List all available magic commands.
○ %pwd: Print current working directory.
○ %cd path/to/directory: Change current working directory.
○ %env: List environment variables.
○ %matplotlib inline (or %matplotlib notebook): Render Matplotlib plots
directly within the notebook output. Essential for visualization.
○ %load_ext autoreload: Load the autoreload extension.
○ %autoreload 2: Automatically reload modules before executing code (useful
during development).
● Cell Magics (%%): Apply to the entire cell. Must be the first line of the cell.

%%timeit: Time the execution of the entire cell.


Python
%%timeit
L = []
for n in range(1000):
L.append(n**2)


○ %%time: Report the wall clock time and CPU time for the cell.
○ %%bash / %%sh: Execute the cell content as a bash/shell command.
○ %%html: Render the cell content as HTML.
○ %%writefile filename.py: Write the cell content to a file.
○ %%file filename.txt: (Deprecated, %%writefile is preferred).
○ %%latex: Render the cell content as LaTeX.
Unit – II: NumPy (15 Hrs)
NumPy (Numerical Python) is the fundamental package for numerical computation in Python,
providing powerful N-dimensional array objects and tools for integrating C/C++ and Fortran
code.

1. Introduction to NumPy

● Why NumPy?
○ Python lists are general-purpose, but for numerical operations, they are slow and
inefficient for large datasets.
○ NumPy arrays (ndarray) are designed for efficient numerical operations on
large amounts of data.
○ They are homogenous (all elements are of the same data type), which allows
for highly optimized, vectorized operations.
○ Under the hood, NumPy operations are often implemented in C or Fortran,
making them much faster than pure Python loops.
● Key Features:
○ ndarray: A fast and efficient multi-dimensional array object.
○ Mathematical functions for operating on arrays (linear algebra, Fourier
transforms, random number generation).
○ Tools for integrating C/C++ and Fortran code.
● Installation: Usually comes with Anaconda. Otherwise: pip install numpy
● Import Convention: import numpy as np

2. The Basics of NumPy Arrays (ndarray)

● Creating Arrays:

From Python Lists:


Python
import numpy as np
arr1 = np.array([1, 2, 3, 4]) # 1D array (vector)
arr2 = np.array([[1, 2], [3, 4]]) # 2D array (matrix)

Fixed-Size Arrays:
Python
np.zeros(5, dtype=int) # array([0, 0, 0, 0, 0])
np.ones((3, 5), dtype=float) # 3x5 array of ones
np.full((2, 2), 7) # 2x2 array with all 7s
np.empty(3) # Uninitialized values


Sequences:
Python
np.arange(0, 10, 2) # Like range(), but returns an array: array([0, 2, 4, 6, 8])
np.linspace(0, 1, 5) # 5 evenly spaced numbers between 0 and 1: array([0. , 0.25, 0.5 ,
0.75, 1. ])

Random Arrays:
Python
np.random.rand(3, 3) # Uniform distribution [0, 1)
np.random.randn(3, 3) # Standard normal distribution
np.random.randint(0, 10, size=(3, 3)) # Random integers

Identity Matrix:
Python
np.eye(3) # 3x3 identity matrix


● Array Attributes:
○ ndim: Number of dimensions.
○ shape: Tuple indicating the size of each dimension.
○ size: Total number of elements.
○ dtype: Data type of the elements (e.g., int64, float64).
○ itemsize: Size of each element in bytes.
● Array Indexing and Slicing:

1D Arrays (like Python lists):


Python
arr = np.array([0, 1, 2, 3, 4, 5])
arr[0] #0
arr[-1] # 5
arr[1:4] # array([1, 2, 3])
arr[:3] # array([0, 1, 2])

Multi-dimensional Arrays:
Python
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
arr2d[0, 0] # 1
arr2d[2, 1] # 8
arr2d[:2, :2] # Slice rows and columns
# array([[1, 2],
# [4, 5]])
arr2d[1, :] # Second row: array([4, 5, 6])
arr2d[:, 0] # First column: array([1, 4, 7])

Fancy Indexing: Using arrays of integers or booleans to select arbitrary subsets of data.
Python
arr = np.array([10, 20, 30, 40, 50])
indices = [0, 2, 4]
arr[indices] # array([10, 30, 50])

mask = arr > 30


arr[mask] # array([40, 50])


● Reshaping Arrays:

reshape(): Returns a new array with a different shape, without changing the data.
Python
arr = np.arange(1, 10)
arr.reshape((3, 3))
# array([[1, 2, 3],
# [4, 5, 6],
# [7, 8, 9]])

-1 in reshape: NumPy can infer one dimension.


Python
arr.reshape((3, -1)) # same as (3, 3)


○ ravel() / flatten(): Convert multi-dimensional array to 1D. flatten()
returns a copy, ravel() returns a view (if possible).
● Concatenation and Splitting:

np.concatenate(): Join arrays along an existing axis.


Python
x = np.array([1, 2, 3])
y = np.array([4, 5, 6])
np.concatenate([x, y]) # array([1, 2, 3, 4, 5, 6])

grid = np.array([[1, 2], [3, 4]])


np.concatenate([grid, grid]) # along axis=0 (rows)
# array([[1, 2],
# [3, 4],
# [1, 2],
# [3, 4]])
np.concatenate([grid, grid], axis=1) # along axis=1 (columns)
# array([[1, 2, 1, 2],
# [3, 4, 3, 4]])


○ np.vstack(): Stack arrays vertically (row-wise).
○ np.hstack(): Stack arrays horizontally (column-wise).
○ np.dstack(): Stack arrays depth-wise (3D).
○ np.split(), np.vsplit(), np.hsplit(), np.dsplit(): Split arrays into
multiple sub-arrays.

3. Computation on NumPy Arrays

● Universal Functions (UFuncs):


○ NumPy provides "vectorized" operations via ufuncs, which perform element-wise
operations on arrays. These are much faster than explicit Python loops.

Arithmetic Operations: +, -, *, /, // (floor division), %, ** (exponentiation).


Python
arr = np.array([1, 2, 3])
arr + 5 # array([6, 7, 8])
arr * arr # array([1, 4, 9])


○ Comparison Operators: >, <, ==, !=, >=, <=. Return boolean arrays.
○ Trigonometric Functions: np.sin(), np.cos(), np.tan().
○ Exponentials and Logarithms: np.exp(), np.log(), np.log2(),
np.log10().
○ Other UFuncs: np.abs(), np.sqrt(), np.ceil(), np.floor(),
np.round().
● Broadcasting:
○ A powerful mechanism that allows NumPy to perform operations on arrays of
different shapes.
○ It effectively "stretches" the smaller array across the larger array so that they
have compatible shapes.
○ Rules:
1. If the arrays have different numbers of dimensions, prepend 1s to the
shape of the smaller array until both shapes have the same length.
2. Two dimensions are compatible when they are equal, or one of them is 1.
3. If the dimensions are incompatible, an error is raised.

Example:
Python
a = np.array([0, 10, 20, 30]) # shape (4,)
b = np.array([0, 1, 2]) # shape (3,)
# Cannot directly add.
# But:
a = np.arange(3)[:, np.newaxis] # shape (3, 1)
b = np.arange(3) # shape (3,)
a+b
# array([[0, 1, 2],
# [1, 2, 3],
# [2, 3, 4]])

○ The b array (shape (3,)) is broadcast across the columns of a (shape (3,1)).

4. Aggregations: Min, Max, and Everything In Between

● Definition: Operations that collapse an array (or parts of it) into a single value.
● Common Aggregations:
○ np.sum(), np.min(), np.max(), np.mean(), np.median(), np.std()
(standard deviation), np.var() (variance).
● Axis Argument: Most aggregation functions accept an axis argument to specify along
which dimension the aggregation should occur.
○ axis=0: Aggregate down the columns (result has number of rows = 1).
○ axis=1: Aggregate across the rows (result has number of columns = 1).
○ If axis is not specified, the aggregation is performed over the entire array.

Python
M = np.random.randint(0, 10, (3, 4))
# array([[8, 2, 5, 0],
# [7, 6, 8, 8],
# [0, 2, 4, 7]])

M.sum() # Sum of all elements (scalar)


M.min(axis=0) # Min along columns (returns array with 4 elements)
# array([0, 2, 4, 0])
M.max(axis=1) # Max along rows (returns array with 3 elements)
# array([8, 8, 7])


● np.nansum(), np.nanmean(), etc.: Versions of aggregation functions that ignore NaN
(Not a Number) values.

5. Computation on Arrays (More Advanced Topics)

● Sorting Arrays:
○ np.sort(arr): Returns a sorted copy of the array.
○ arr.sort(): Sorts the array in-place.
○ np.argsort(arr): Returns the indices that would sort the array.
Python
x = np.array([2, 1, 4, 3, 5])
np.sort(x) # array([1, 2, 3, 4, 5])
i = np.argsort(x) # array([1, 0, 3, 2, 4])
x[i] # array([1, 2, 3, 4, 5])


○ Sorting along an axis for multi-dimensional arrays.
● Partial Sorts:
○ np.partition(arr, K): Returns a copy with the Kth smallest value in its
sorted position, and all smaller values to its left, larger values to its right.
○ np.argpartition(): Returns the indices.
● Structured Arrays:
○ Arrays with compound data types, allowing elements to have multiple fields (like
a C struct or database row).

Python
data = np.zeros(3, dtype={'names':('name', 'age', 'weight'),
'formats':('U10', 'i4', 'f8')})
# data[0] = ('Alice', 25, 55.5)
# data['name'] # Access by field name


● Masking and Filtering:
○ Using boolean arrays created by comparison operators to select elements.

Python
x = np.arange(10)
x[x % 2 == 0] # Select even numbers: array([0, 2, 4, 6, 8])


● Linear Algebra:
○ np.dot(a, b) or a @ b (Python 3.5+): Dot product of two arrays (matrix
multiplication).
○ np.linalg.inv(matrix): Inverse of a matrix.
○ np.linalg.det(matrix): Determinant of a matrix.
○ np.linalg.eig(matrix): Eigenvalues and eigenvectors.
● Broadcasting in more complex scenarios (e.g., adding a 1D array to a 2D array,
column-wise).

Unit – III: Pandas (15 Hrs)


Pandas is a powerful, flexible, and easy-to-use open-source data analysis and manipulation
library for Python. It builds on NumPy and provides high-performance data structures.
1. Introduction to Pandas

● Why Pandas?
○ NumPy is great for numerical arrays, but lacks labels for rows/columns and
handles heterogeneous data poorly.
○ Pandas introduces two primary data structures: Series (1D labeled array) and
DataFrame (2D labeled table).
○ Makes data cleaning, transformation, analysis, and visualization much easier and
more intuitive.
○ Excellent for handling tabular data, time series data, and heterogeneous data.
● Key Features:
○ DataFrame objects for data manipulation with integrated indexing.
○ Tools for reading and writing data between in-memory data structures and
different formats: CSV, text files, SQL databases, HDF5 format.
○ Intelligent data alignment and integrated handling of missing data.
○ Flexible groupby functionality for performing split-apply-combine operations.
○ High performance merging and joining of datasets.
○ Time series functionality.
● Installation: Usually comes with Anaconda. Otherwise: pip install pandas
● Import Convention: import pandas as pd

2. Data Manipulation with Pandas

2.1. Pandas Series

● Definition: A one-dimensional labeled array capable of holding any data type (integers,
strings, floats, Python objects, etc.).
● Creation:

From List/Array:
Python
s = pd.Series([0.25, 0.5, 0.75, 1.0])

With Custom Index:


Python
s = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])

From Dictionary: Keys become index, values become data.


Python
population = {'California': 38332521, 'Texas': 26448193, ...}
s = pd.Series(population)


● Attributes: s.values (NumPy array of values), s.index (Pandas Index object).
● Indexing and Slicing:
○ Explicit Index: s['a']
○ Implicit (Integer) Index: s[0]
○ Slicing by Explicit Index: s['a':'c'] (inclusive of end)
○ Slicing by Implicit Index: s[0:2] (exclusive of end)
○ Fancy Indexing: s[['a', 'c']]
○ Boolean Masking: s[s > 0.5]
○ loc (label-location based indexer): s.loc['a'], s.loc['a':'c']
○ iloc (integer-location based indexer): s.iloc[0], s.iloc[0:2]

2.2. Pandas DataFrame

● Definition: A two-dimensional labeled data structure with columns of potentially different


types. You can think of it as a spreadsheet, a SQL table, or a dictionary of Series
objects.
● Creation:

From Dictionary of Series:


Python
data = {'col1': pd.Series([1, 2, 3]), 'col2': pd.Series(['A', 'B', 'C'])}
df = pd.DataFrame(data)

From Dictionary of Lists/Arrays:


Python
data = {'col1': [1, 2, 3], 'col2': ['A', 'B', 'C']}
df = pd.DataFrame(data)

From NumPy Array (with explicit index/columns):


Python
df = pd.DataFrame(np.random.rand(3, 2), columns=['foo', 'bar'], index=['a', 'b', 'c'])


○ From CSV, Excel, etc.: pd.read_csv('file.csv'),
pd.read_excel('file.xlsx')
● Attributes: df.values (NumPy array), df.index, df.columns.
● Indexing and Selection:
○ Column Selection: df['col1'] (returns Series), df[['col1', 'col2']]
(returns DataFrame).
○ Row Selection:
■ df.iloc[0] (first row by integer position)
■ df.loc['row_label'] (row by label)
■ Slicing rows: df[1:3] (by integer position, iloc preferred)
■ df.loc['label1':'label3'] (by label, loc preferred)
○ Combined Selection (loc, iloc):
■ df.loc[row_label, column_label]
■ df.iloc[row_pos, column_pos]
■ df.loc[df['col1'] > 1, ['col2']] (Boolean indexing rows, label
indexing columns)

Adding/Modifying Columns:
Python
df['new_col'] = df['col1'] * 2
df['another_col'] = ['X', 'Y', 'Z']


● Dropping Columns/Rows:
○ df.drop('column_name', axis=1, inplace=True) (inplace modifies
DataFrame)
○ df.drop(['row_label1', 'row_label2'], axis=0)
● rename(): Renaming columns or index labels.
● set_index() / reset_index(): Changing the DataFrame index.

3. Operating on Null Values (Missing Data)

● Representation: Pandas uses NaN (Not a Number) for floating-point missing values and
None for object-type missing values (Python None). NumPy's NaN is used internally.
● Detection:
○ df.isnull(): Returns a boolean DataFrame of the same shape, True where
null.
○ df.notnull(): Opposite of isnull().
○ df.isna() and df.notna() are aliases for isnull() and notnull().
● Counting Nulls:
○ df.isnull().sum(): Counts nulls per column.
○ df.isnull().sum().sum(): Total nulls in DataFrame.
● Handling Missing Data:
○ dropna() (Dropping):
■ df.dropna(): Drops any row containing any NaN value.
■ df.dropna(axis='columns') or df.dropna(axis=1): Drops any
column containing any NaN value.
■ df.dropna(how='all'): Drops rows/columns only if all values are
NaN.
■ df.dropna(thresh=N): Requires at least N non-null values for a
row/column to be kept.
○ fillna() (Filling):
■ df.fillna(0): Fill all NaN values with 0.
■ df.fillna(method='ffill') or df.fillna(method='pad'):
Forward-fill (propagate last valid observation forward).
■ df.fillna(method='bfill') or
df.fillna(method='backfill'): Backward-fill (propagate next valid
observation backward).
■ df.fillna(df.mean()): Fill NaN in each column with that column's
mean.
■ df['column'].fillna(df['column'].median()): Fill specific
column's NaN with its median.
■ df.fillna(value=dictionary_of_column_values): Fill with
different values per column.

4. Hierarchical Indexing (MultiIndex)

● Concept: Allows you to have multiple index levels on an axis (row or column), enabling
data to be stored and manipulated in higher dimensional space.
● Creation:

From list of tuples:


Python
index = [('California', 2000), ('California', 2010),
('New York', 2000), ('New York', 2010),
('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
18976457, 19378102,
20851820, 25145561]
multi_index_series = pd.Series(populations, index=index)


○ Using pd.MultiIndex.from_arrays, from_product, from_tuples:

From DataFrame columns using set_index():


Python
df.set_index(['State', 'Year'], inplace=True)


● Indexing and Slicing MultiIndex:
○ Accessing by Outer Level: multi_index_series['California']
○ Accessing by Inner Level: Requires slicing with loc or iloc.
○ xs() (Cross-section): For selecting data at a particular level.

Partial Indexing:
Python
multi_index_series['California', 2000] # Single element
multi_index_series.loc['California': 'New York'] # Slice on outer

loc with tuple indexing:


Python
df.loc[('California', 2000):('New York', 2010)]
df.loc[(slice(None), 2000), :] # Select all states for year 2000

● Rearranging MultiIndex:
○ unstack(): Converts a MultiIndex level from rows to columns.
○ stack(): Converts a MultiIndex level from columns to rows.
● Sorting MultiIndex: df.sort_index(level=0) or
df.sort_index(level='State')

5. Combining Datasets

● pd.concat() (Concatenation):
○ Joins DataFrame or Series objects along a particular axis (rows or columns).
○ pd.concat([df1, df2]): By default, stacks vertically (rows). Aligns columns.
○ pd.concat([df1, df2], axis=1): Stacks horizontally (columns). Aligns
rows.
○ join argument: 'inner' (default is 'outer') for intersection of indices.
○ ignore_index=True: Resets the resulting index.
○ keys argument: Creates a hierarchical index to identify the origin of each chunk.
● pd.merge() (Merging/Joining):
○ Combines DataFrames based on common columns or indices (like SQL JOINs).
○ on: Column name(s) to join on.
○ left_on, right_on: Column names if different in left/right DataFrames.
○ left_index, right_index: Join on index.
○ how argument (Join Types):
■ 'inner' (default): Intersection of keys.
■ 'outer': Union of keys (includes all rows, fills with NaN).
■ 'left': Includes all rows from left DataFrame, matching from right.
■ 'right': Includes all rows from right DataFrame, matching from left.
● df.join(): A convenience method for joining DataFrames on their indexes (or a key
column to index). Simpler for index-based joins than merge.

6. Aggregation and Grouping

● Simple Aggregations: df.sum(), df.mean(), df.min(), df.max(), df.count(),


df.median(), df.std(), df.var(), df.describe().
○ By default, these aggregate per column. axis=1 to aggregate per row.
● groupby() (Split-Apply-Combine):
○ A powerful method for grouping rows together based on one or more column
values and then applying an aggregation or transformation.
○ Split: The data is divided into groups based on some criterion.
○ Apply: A function is applied to each group independently.
○ Combine: The results are combined into a new data structure.
○ Syntax: df.groupby('column_name') or df.groupby(['col1',
'col2'])

Common Applications:
Python
df.groupby('category_col')['value_col'].mean() # Mean of 'value_col' for each 'category_col'
df.groupby('category_col').size() # Count of items in each category
df.groupby('category_col').describe() # Full statistics for each category

aggregate() (or agg()): Apply multiple aggregation functions.


Python
df.groupby('category_col').agg(['min', np.median, 'max'])
df.groupby('category_col').agg(mean_val=('value_col', 'mean'), sum_val=('another_col', 'sum'))

filter(): Return a subset of the DataFrame based on a group property.


Python
df.groupby('category_col').filter(lambda x: x['value_col'].mean() > 50)

transform(): Return a Series/DataFrame with the same shape as the original, where values
are group-wise transformations.
Python
df['normalized_val'] = df.groupby('category_col')['value_col'].transform(lambda x: (x - x.mean()) /
x.std())


○ apply(): Apply an arbitrary function to each group. Very flexible but can be
slower.

Unit – IV: Matplotlib (15 Hrs)


Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations
in Python. It's the foundation for many other plotting libraries.

1. Introduction to Matplotlib

● Why Matplotlib?
○ Provides a highly flexible and powerful way to create a wide variety of static,
animated, and interactive plots.
○ Can produce publication-quality figures in a variety of hardcopy formats and
interactive environments.
○ Often integrated with NumPy and Pandas.
● Key Concepts:
○ Figures: The top-level container for all plot elements. You can have multiple
figures.
○ Axes: The actual plotting area where the data is drawn. A figure can contain
multiple axes (subplots). Each axes has x-axis, y-axis, titles, labels, etc.
○ pyplot module: matplotlib.pyplot is a collection of functions that make
Matplotlib work like MATLAB. It's the most common way to use Matplotlib.
● Installation: Usually comes with Anaconda. Otherwise: pip install matplotlib
● Import Convention: import matplotlib.pyplot as plt
● Displaying Plots in Jupyter:
○ %matplotlib inline: Displays plots statically embedded within the notebook
output.
○ %matplotlib notebook: Displays interactive plots within the notebook (zoom,
pan).

2. Visualization with Matplotlib

Simple Plotting Interface (plt.plot()):


Python
import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0, 10, 100)


plt.plot(x, np.sin(x))
plt.show() # In scripts, this displays the plot


● Object-Oriented Interface (Recommended for complex plots):
○ Explicitly create Figure and Axes objects.
○ Allows for more fine-grained control.

Python
fig = plt.figure() # Create a new figure
ax = fig.add_subplot(1, 1, 1) # Create axes (1 row, 1 col, 1st subplot)
ax.plot(x, np.cos(x))
ax.set_title("Cosine Wave")
ax.set_xlabel("X-axis")
ax.set_ylabel("Y-axis")
plt.show()


● Saving Plots: plt.savefig('my_plot.png') or fig.savefig('my_plot.pdf')

3. Simple Line Plots

plt.plot(x, y): Plots y versus x as lines and/or markers.


Python
plt.plot([1, 2, 3, 4], [1, 4, 2, 3]) # Simple line plot
plt.plot(x, np.sin(x), '-') # Solid line
plt.plot(x, np.cos(x), '--') # Dashed line
plt.plot(x, np.tan(x), 'o') # Markers only
plt.plot(x, np.sin(x), 'o-') # Line with markers

● Customizing Line Styles and Colors:
○ Color: 'r' (red), 'g' (green), 'b' (blue), 'k' (black), 'c' (cyan), etc.
○ Linestyle: '-', '--', '-.', ':'
○ Marker: 'o', '^', 's', '+', 'x'
○ Combined format string: plt.plot(x, y, 'r--o')

4. Scatter Plots

● plt.scatter(x, y): Plots individual data points as markers.


○ Generally used when there's no direct connection between points (e.g.,
relationship between two variables).
● Customizing Markers:
○ s (size of markers), c (color of markers, can be an array for varying colors),
alpha (transparency), marker (marker style).

Python
rng = np.random.RandomState(0)
x = rng.randn(100)
y = rng.randn(100)
colors = rng.rand(100)
sizes = 1000 * rng.rand(100)

plt.scatter(x, y, c=colors, s=sizes, alpha=0.3, cmap='viridis')


plt.colorbar(); # show color scale


● vs. plt.plot(x, y, 'o'):
○ plt.plot() is optimized for plotting points along a line (even if markers are
used). It's faster for large datasets of uniform properties.
○ plt.scatter() offers more flexibility in controlling individual point properties
(color, size) from data arrays.

5. Visualizing Errors

● plt.errorbar(x, y, yerr=..., xerr=..., fmt='o'):


○ Plots data points with error bars, useful for representing uncertainty in
measurements.
○ yerr: Vertical error. Can be a scalar, array, or 2D array for asymmetric errors.
○ xerr: Horizontal error.
○ fmt: Format string for the main plot (e.g., 'o' for markers).
○ ecolor, capsize, capthick: Customize error bar appearance.

Python
x = np.linspace(0, 10, 50)
dy = 0.8
y = np.sin(x) + dy * np.random.randn(50)
plt.errorbar(x, y, yerr=dy, fmt='o', color='black', ecolor='lightgray', elinewidth=3, capsize=0);

6. Histograms, Binnings, and Density

● Histograms (plt.hist()):
○ Shows the distribution of a single numerical variable by dividing data into bins
and counting observations in each bin.
○ bins: Number of bins or sequence of bin edges.
○ density=True: Normalize to form a probability density (area under histogram
sums to 1).
○ alpha: Transparency.
○ histtype: 'bar' (default), 'barstacked', 'step', 'stepfilled'.

Python
data = np.random.randn(1000)
plt.hist(data, bins=30, alpha=0.5, histtype='stepfilled', color='steelblue', edgecolor='none');


● Two-Dimensional Histograms (plt.hist2d(), plt.hexbin()):
○ For visualizing the joint distribution of two variables.
○ plt.hist2d(): Rectangular bins, color intensity indicates density.
○ plt.hexbin(): Hexagonal bins, often more visually appealing.
● Kernel Density Estimation (KDE):
○ A non-parametric way to estimate the probability density function of a random
variable. Smoothed version of a histogram.
○ Often done using seaborn.kdeplot or scipy.stats.gaussian_kde.
○ plt.plot(density_x, density_y) after calculating KDE.

7. Customizing Plots

● Titles and Labels:


○ plt.title("Plot Title") or ax.set_title()
○ plt.xlabel("X-axis Label") or ax.set_xlabel()
○ plt.ylabel("Y-axis Label") or ax.set_ylabel()
● Legends (plt.legend()):
○ Displays labels for different lines/markers.
○ Needs label='...' argument in plot/scatter calls.
○ loc: Location of the legend (e.g., 'upper left', 'best').
● Limits and Ticks:
○ plt.xlim(xmin, xmax), plt.ylim(ymin, ymax) or ax.set_xlim(),
ax.set_ylim()
○ plt.xticks(ticks, labels), plt.yticks() or ax.set_xticks(),
ax.set_yticks()
○ plt.axis('tight'): Auto-tight limits. plt.axis('equal'): Equal aspect
ratio.
● Stylesheets:
○ plt.style.use('ggplot'), plt.style.use('seaborn-v0_8'),
plt.style.use('fivethirtyeight') etc.
○ Provides a quick way to change the overall aesthetic. plt.style.available
to see options.
● Grid: plt.grid(True)
● Colors and Colormaps:
○ color='red', cmap='viridis', cmap='plasma'.

8. Multiple Subplots

plt.figure() and fig.add_subplot(): Object-oriented way (more flexible).


Python
fig = plt.figure(figsize=(10, 4))
ax1 = fig.add_subplot(1, 2, 1) # 1 row, 2 columns, 1st subplot
ax2 = fig.add_subplot(1, 2, 2) # 1 row, 2 columns, 2nd subplot
ax1.plot(x, np.sin(x))
ax2.plot(x, np.cos(x))
plt.tight_layout() # Adjust subplot params for a tight layout

plt.subplots(): A convenient helper function that creates a figure and a grid of subplots in a
single call.
Python
fig, axes = plt.subplots(2, 2, figsize=(8, 6)) # 2 rows, 2 columns

axes[0, 0].plot(x, np.sin(x)) # Access individual axes using array-like indexing


axes[1, 1].hist(np.random.randn(100), bins=20)
plt.tight_layout()


● plt.subplot(): MATLAB-style interface for creating subplots (less flexible for
complex layouts).

9. Text Annotation

● plt.text(x, y, 'Text') or ax.text(x, y, 'Text'): Add arbitrary text at a


specific data coordinate.
○ ha (horizontal alignment), va (vertical alignment), fontsize, color.
● plt.annotate('Text', xy=(x, y), xytext=(x_arrow, y_arrow),
arrowprops=dict(facecolor='black', shrink=0.05)) or ax.annotate():
○ Adds text with an arrow pointing from xytext to xy (the annotated point).
○ xy: Point to annotate.
○ xytext: Position of the text.
○ arrowprops: Dictionary to customize the arrow.
Mathematical Text ($ for LaTeX-like syntax):
Python
plt.title(r'$\sin(x)$ vs $\cos(x)$')
plt.text(5, 0.5, r'$\mu = 100, \sigma = 15$', fontsize=12)

Unit – V: Sci-kit Learn (15 Hrs)


Scikit-learn is a free software machine learning library for the Python programming language. It
features various classification, regression, and clustering algorithms and is designed to
interoperate with NumPy and Pandas.

1. Introduction to Scikit-Learn

● Why Scikit-Learn?
1. Provides a consistent API for a wide range of machine learning algorithms.
2. Built on NumPy, SciPy, and Matplotlib.
3. Emphasis on ease of use, robust implementation, and good documentation.
4. Does not handle data loading, data manipulation (use Pandas), or highly
customized deep learning (use TensorFlow/PyTorch).
● Key Principles:
1. Estimators: All objects in scikit-learn that learn from data are called estimators.
They typically have fit() and predict() methods.
2. Consistency: All estimators share a common API.
3. Inspection: All parameters and learned attributes are public attributes.
4. Sensible Defaults: Most parameters have reasonable default values.
● Installation: Usually comes with Anaconda. Otherwise: pip install scikit-learn
● Common Workflow:
1. Choose a Model Class: Import the appropriate estimator (e.g., from
sklearn.linear_model import LinearRegression).
2. Choose Model Hyperparameters: Instantiate the model with desired settings
(e.g., model = LinearRegression(fit_intercept=True)).
3. Arrange Data: Prepare data into feature matrix X (2D NumPy array or Pandas
DataFrame) and target vector y (1D NumPy array or Pandas Series).
4. Fit the Model: Use the fit() method to train the model on your data
(model.fit(X, y)).
5. Predict/Transform: Use the predict() method to make predictions on new
data (model.predict(X_new)) or transform() for feature transformation.

2. Data Representation

● Feature Matrix (X):


○ Typically a 2D NumPy array or Pandas DataFrame of shape [n_samples,
n_features].
○ n_samples: The number of observations or data points (rows).
○ n_features: The number of features or attributes per observation (columns).
○ Each row is a single data point/sample.
○ Each column is a specific feature.
● Target Vector (y):
○ Typically a 1D NumPy array or Pandas Series of shape [n_samples] or
[n_samples, 1].
○ Contains the labels or target values corresponding to each sample in X.
○ For supervised learning tasks (classification, regression).

Example:
Python
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data # Feature matrix: sepal length, sepal width, petal length, petal width
y = iris.target # Target vector: species (0, 1, 2)
X.shape # (150, 4) -> 150 samples, 4 features
y.shape # (150,) -> 150 target values


● Feature Engineering: The process of creating new features from existing raw data to
improve model performance.

3. Hyperparameters & Validation: Selecting the Best Model

● Hyperparameters:
○ Parameters of the learning algorithm that are set prior to training (e.g.,
n_estimators in a Random Forest, alpha in Ridge regression, C in SVM).
○ They are not learned from the data during training.
○ Crucial for controlling model complexity and preventing overfitting/underfitting.
● Model Validation:
○ The process of evaluating how well a trained model generalizes to unseen data.
○ Why not just use training error? Training error is usually overly optimistic; a
complex model might memorize the training data but perform poorly on new data
(overfitting).
○ Common Validation Approaches:
■ Train/Test Split: Divide data into training set (e.g., 70-80%) and test set
(remaining). Train on training, evaluate on test.
■ from sklearn.model_selection import
train_test_split
■ X_train, X_test, y_train, y_test =
train_test_split(X, y, test_size=0.3,
random_state=42)
■ Cross-Validation: A more robust method for estimating model
performance and selecting hyperparameters. Divides the data into
multiple "folds."
■ K-Fold Cross-Validation: Data is split into K equal-sized folds. In
K iterations, one fold is used for testing, and the remaining K-1
folds for training. The performance is averaged across K
iterations.
■ Stratified K-Fold: Ensures that each fold has a similar proportion
of class labels as the original dataset (important for imbalanced
datasets).
■ from sklearn.model_selection import KFold,
StratifiedKFold, cross_val_score
■ scores = cross_val_score(model, X, y, cv=5)
● Hyperparameter Tuning:
○ Grid Search: Systematically explores a predefined set of hyperparameter values
for a given model.
■ from sklearn.model_selection import GridSearchCV
■ param_grid = {'n_neighbors': [3, 5, 7], 'weights':
['uniform', 'distance']}
■ grid = GridSearchCV(KNeighborsClassifier(),
param_grid, cv=5)
■ grid.fit(X_train, y_train)
■ grid.best_params_, grid.best_score_,
grid.best_estimator_
○ Randomized Search: Randomly samples a fixed number of hyperparameter
combinations from a specified distribution. Often more efficient than Grid Search
for large search spaces.
■ from sklearn.model_selection import RandomizedSearchCV

4. Learning Curves

● Concept: Plots that show how a model's performance (e.g., score) changes as the
amount of training data increases.
● Interpretation:
○ High Bias (Underfitting): Both training and validation scores converge to a low
value. The model is too simple and cannot learn the underlying patterns. Getting
more data won't help much.
○ High Variance (Overfitting): Training score is high, but validation score is low,
and there's a significant gap between them. The model is too complex and has
memorized noise in the training data. Getting more data might help.
○ Good Fit: Training and validation scores converge to a high value, with a small
gap.
● Usage: Helps diagnose whether adding more training data, reducing model complexity,
or increasing model complexity would be beneficial.
● from sklearn.model_selection import learning_curve

5. Correlation

● Definition: A statistical measure that expresses the extent to which two variables are
linearly related.
● Pearson Correlation Coefficient (r):
○ Ranges from -1 to +1.
○ +1: Perfect positive linear relationship.
○ -1: Perfect negative linear relationship.
○ 0: No linear relationship.
● Usage in Data Science:
○ Feature Selection: Identify highly correlated features (multicollinearity can be a
problem for some models like linear regression).
○ Exploratory Data Analysis (EDA): Understand relationships between variables.

Calculation (Pandas):
Python
import pandas as pd
df = pd.DataFrame({'A': [1,2,3,4,5], 'B': [2,4,5,4,5], 'C': [5,4,3,2,1]})
df.corr() # Returns a correlation matrix
df['A'].corr(df['B']) # Correlation between two specific columns


● Correlation vs. Causation: Correlation does not imply causation.

6. Linear Regression

● Goal: To model the relationship between a dependent variable (target y) and one or
more independent variables (features X) by fitting a linear equation to observed data.
● Assumptions of Linear Regression: Linearity, independence of errors,
homoscedasticity, normality of residuals.

6.1. Simple Linear Regression

● Model: y=β0+β1x1+ϵ
○ y: Dependent variable.
○ x1: Independent variable (single feature).
○ β0: Y-intercept.
○ β1: Coefficient (slope).
○ ϵ: Error term.
● Minimizing Residual Sum of Squares (RSS): The model finds the β0 and β1 that
minimize the sum of the squared differences between the observed values and the
values predicted by the linear model.

Implementation (Scikit-Learn):
Python
from sklearn.linear_model import LinearRegression
import numpy as np

# Generate some sample data


rng = np.random.RandomState(42)
X = 10 * rng.rand(50, 1) # Feature (must be 2D for scikit-learn)
y = 2 * X - 1 + rng.randn(50, 1) # Target with some noise

model = LinearRegression(fit_intercept=True) # fit_intercept=False if you want to force through


origin
model.fit(X, y)
print("Slope (coefficient):", model.coef_) # Beta_1
print("Intercept:", model.intercept_) # Beta_0

y_pred = model.predict(X)

# You can then plot X, y and X, y_pred to visualize the fit


● Evaluation Metrics (for Regression):
○ Mean Squared Error (MSE): Average of the squared differences between
predicted and actual values.
○ Root Mean Squared Error (RMSE): Square root of MSE (in the same units as
y).
○ Mean Absolute Error (MAE): Average of the absolute differences. Less
sensitive to outliers than MSE.
○ R-squared (R2): Proportion of the variance in the dependent variable that is
predictable from the independent variables. Higher is better (max 1.0).
■ model.score(X, y) gives R2.

6.2. Basis Function Regression

● Concept: Extends linear regression to model non-linear relationships by transforming


the input features into a higher-dimensional space using basis functions.
● The model is still linear in the coefficients but non-linear in the original features.
● Common Basis Functions:
○ Polynomial Basis: Adds polynomial terms (e.g., x2,x3).
■ from sklearn.preprocessing import PolynomialFeatures
■ poly = PolynomialFeatures(degree=3,
include_bias=False)
■ X_poly = poly.fit_transform(X)
○ Gaussian Basis: Uses Gaussian (radial basis) functions, often with
KNeighborsTransformer or manual creation.
● Workflow:
○ Transform X using PolynomialFeatures (or other basis functions).
○ Fit a LinearRegression model on the transformed X_poly.
○ This allows fitting complex curves using the simple linear regression framework.

6.3. Regularization

● Problem: In complex linear models (e.g., with many features or high-degree polynomial
basis functions), the model can overfit the training data by assigning very large
coefficients to specific features.

● Solution: Regularization adds a penalty term to the loss function (the function being
minimized during training) that discourages large coefficients. This effectively shrinks
coefficients towards zero.
● Types of Regularization:

○ L2 Regularization (Ridge Regression):


■ Adds a penalty proportional to the sum of the squares of the
coefficients (∑βi2).
■ Shrinks coefficients towards zero, but rarely to exactly zero.
■ from sklearn.linear_model import Ridge
■ alpha parameter controls the strength of the penalty. Larger alpha
means stronger regularization.
○ L1 Regularization (Lasso Regression):
■ Adds a penalty proportional to the sum of the absolute values of the
coefficients (∑∣βi∣).
■ Can drive some coefficients exactly to zero, effectively performing feature
selection.
■ from sklearn.linear_model import Lasso
■ alpha parameter controls the strength of the penalty.
○ Elastic Net: Combines L1 and L2 regularization.
■ from sklearn.linear_model import ElasticNet
■ Has both alpha and l1_ratio parameters.
● Benefits of Regularization:

○ Reduces overfitting.
○ Improves generalization performance on unseen data.
○ Lasso can perform automatic feature selection.

This is a comprehensive set of notes covering the topics outlined in your syllabus. To truly
master these concepts and tools, remember to:

● Practice extensively: The best way to learn is by coding. Use Jupyter Notebooks to
follow along with examples and experiment.
● Refer to documentation: The official documentation for NumPy, Pandas, Matplotlib,
and Scikit-learn is excellent.
● Work on projects: Apply these skills to real-world datasets.
● Read relevant books/tutorials: "Python for Data Analysis" by Wes McKinney (Pandas
creator) and "Python Data Science Handbook" by Jake VanderPlas are highly
recommended resources.

You might also like