lab 1 ML lab
lab 1 ML lab
Week1:
Task: Data Wrangling
Required Python Libraries: numpy, pandas
Pandas is a powerful data manipulation and analysis library for
Python. It provides data structures like series and dataframes to
effectively easily clean, transform, and analyze large datasets and
integrates seamlessly with other python libraries, such as numPy and
matplotlib. It offers powerful functions for data transformation,
aggregation, and visualization.
1. Pandas – Data Wrangling Made Easy
is an open-source library specifically designed for data manipulation
and analysis. It provides two essential data structures: Series (1-
dimensional) and DataFrame (2-dimensional), which make it easy
to work with structured data, such as tables or CSV files.
1|Page
Creating a Series
Code:
import pandas as pd
data = [10, 20, 30, 40]
v = pd.Series(data)
print(v)
Creating a DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)
2|Page
➢ Keys: Become column names (Name, Age, City).
➢ Values: Corresponding rows.
Reading Data from Files
Code:
df = pd.read_csv('example.csv')
or
df=pd.read_csv(“transactions.csv”)
transactions.cav dataset download from this link:
https://github.com/ben519/DataWrangling/blob/master/Data/tra
nsactions.csv
• pd.read_csv('filename.csv'): Reads a CSV file and converts it
into a DataFrame.
• Ensure that example.csv is in the same directory as your script
or provide the full path.
Ex:
printing the first 10 rows of the DataFrame:
code:
import pandas as pd
df = pd.read_csv('transactions.csv')
3|Page
print(df.head(10))
Ex:4
Code:
print(df.describe())
• describe(): Summarizes data with metrics like mean, count,
min, max, etc.
ex:5
code:
print(df['Name'])
4|Page
➢ Accessing a single column
Ex:5
Code:
print(df[['Name', 'City']])
➢ Accessing multiple columns
ex:5
code:
➢ Accessing rows by index
print(df.iloc[1])
➢ Access the second row
❖ df['column']: Retrieves a specific column.
❖ df[['col1', 'col2']]: Retrieves multiple columns.
❖ df.iloc[row_index]: Accesses a row by its position.
Filtering Data
Code:
filtered_df = df[df['Age'] > 30]
print(filtered_df)
• Condition (df['Age'] > 30): Filters rows where the Age
column has values greater than 30.
• Result: Returns a new DataFrame with only matching rows.
5|Page
Renaming Columns
Code:
df.rename(columns={'Name': 'Full Name', 'Age': 'Years'},
inplace=True) print(df)
➢ rename(columns={old: new}): Renames columns in the
DataFrame.
➢ inplace=True: Updates the DataFrame directly without
needing to assign it to a new variable.
Ex:
Code:
import pandas as pd
df = pd.read_csv("transactions.csv")
df.rename(columns={"Quantity": "Quant"}, inplace=True)
print(df.head())
df.rename(columns={...}):
• This method is used to rename the columns of a DataFrame.
• The columns parameter accepts a dictionary where the keys are
the current column names, and the values are the new names.
6|Page
❖ Drop rows with missing values
df.dropna(inplace=True)
Code:
import pandas as pd
df = pd.read_csv("transactions.csv")
df_sorted = df.sort_values(by=["Quantity", "TransactionDate"],
ascending=[True, False])
print(df_sorted.head())
sort_values(by=["Quantity", "TransactionDate"]):
➢ The by parameter specifies the columns to sort by.
➢ Here, the DataFrame is sorted by Quantity first and then by
TransactionDate.
ascending=[True, False]:
➢ This specifies the sorting order for each column.
➢ True for Quantity means ascending order.
➢ False for TransactionDate means descending order.
7|Page
➢
Assign the result:
➢ The sort_values() method returns a new DataFrame with
the specified sorting applied.
➢ Assign it to a new variable (df_sorted) or overwrite the
original DataFrame (df).
Print the sorted DataFrame:
• Use df_sorted.head() to view the first few rows of the sorted
data.
Subsetting data
• Example: Subset rows where Quantity > 1:
• ```python
• subset = df[df['Quantity'] > 1]
• print(subset)
You can combine conditions using `&` and `|`
Ex:1
Code:
Subset rows 1, 3, and 6
subset_rows = df.iloc[[0, 2, 5]]
print(subset_rows)
iloc Method:
• The iloc method is used for integer-location-based indexing.
• The rows are specified as a list of indices: [0, 2, 5].
8|Page
Subset rows where an external array, foo, is True
Code:
foo = [True, False, True, False] # Example of a boolean array
subset_rows = df[foo]
print(subset_rows)
11 | P a g e
➢ The list [new_user_id, new_product_id, new_quantity,
new_transaction_date] contains the values for the new row in
the same order as the columns.
12 | P a g e
• Convert the TransactionDate column to datetime:
• ```python
• df['TransactionDate'] =
pd.to_datetime(df['TransactionDate'])
• ```
• This allows easy manipulation of dates.
Exercise questions:
13 | P a g e
to PID and UID respectively
9. Order the rows of transactions by TransactionId
descending, if ascending then ascending=True
10. Order the rows of transactions by Quantity
ascending, TransactionDate descending
11. Set the column order of Transactions as ProductID,
Quantity, TransactionDate, TransactionID, UserID
12. Make UserID the first column of transactions
13. Extracting arrays from a Data Frame. Get the 2nd
column
14. Get the ProductID Array
15. Get the ProductId array using the following variable
15 | P a g e