Prac3 AAM
Prac3 AAM
Introduction:
In data analysis and machine learning, working with datasets is key. In this Practical, we'll go through steps like
reading data, identifying types, handling missing info, and selecting important features. These steps prep data
for analysis or building models. They ensure data is accurate and ready for use.
Dataset Overview:
The dataset comprises information about cars, encompassing both numeric and categorical attributes. A
thorough understanding of these attributes is crucial before proceeding with any analysis.
Data Analysis:
1. Reading Data:
Text Format: The dataset in text format was read line by line, and each line was processed to extract relevant
information about the cars.
Code:
# Example code to read text file
with open("cars_dataset.txt", "r") as file:
for line in file:
# Process each line to extract information
pass
CSV Format: The dataset in CSV format was read into a Python environment using the Pandas library.
Code:
# Example code to read CSV file
import pandas as pd
df = pd.read_csv("cars_dataset.csv")
JSON Format: The dataset in JSON format was loaded into memory using Python's built-in JSON library.
Code:
# Example code to read JSON file
import json
with open("cars_dataset.json", "r") as file:
data = json.load(file)
XML Format: The dataset in XML format was parsed using Python's lxml library.
Code:
# Example code to read XML file
from lxml import etree
tree = etree.parse("cars_dataset.xml")
root = tree.getroot()
# Traverse XML structure to extract information
2. Attribute Types: The attributes in the dataset were categorized into two types:
Numeric Attributes
Categorical Attributes
Data Preprocessing:
Handling Missing Data: Missing data can hinder analysis and modeling. To address this issue, rows containing
null values were removed from the dataset.
Rescaling and Encoding: To prepare the data for analysis and modeling, rescaling and encoding were
performed:
Rescaling Data: Numeric attributes were rescaled using min-max scaling to bring them within a
common range, ensuring fair comparison between attributes.
Encoding Data: Categorical attributes were encoded using one-hot encoding to convert them into a
numerical format suitable for machine learning algorithms.
Feature Selection:
Feature Selection: Feature selection is crucial for building accurate predictive models. In this report, feature
selection was performed based on correlation analysis:
Correlation Analysis: The correlation matrix was computed to identify features highly correlated with
the target variable. Features with correlation coefficients above a threshold (e.g., 0.5) were selected for
further analysis.
Conclusion:
In this practical, we explored various steps involved in preprocessing a dataset for analysis and modeling tasks.
By reading data in different formats such as text, CSV, JSON, and XML, we gained insights into the dataset's
structure. We identified numeric and categorical attributes, which provided a foundation for subsequent data
cleaning and preprocessing steps. By handling missing data and applying techniques like rescaling and
encoding, we ensured the dataset was ready for analysis. Additionally, feature selection based on correlation
analysis allowed us to focus on relevant attributes for predictive modeling. Overall, these preprocessing steps
are essential for ensuring data accuracy and readiness for advanced analysis techniques in data science and
machine learning.