ISAT 600 Progress Report 2
ISAT 600 Progress Report 2
Data preprocessing is a crucial step in data analysis, transforming raw data into a
clean, structured dataset for subsequent processing. It involves manipulating,
filtering, or augmenting raw data to enhance its quality and ensure consistency,
which is essential for accurate analysis. In statistical and machine learning
applications, data preprocessing addresses challenges like missing data, outliers,
and variability in feature scales through techniques such as normalization and
standardization.
Effective data preprocessing significantly improves the accuracy of analysis by
eliminating inconsistencies and irrelevant data. It also reduces the risk of overfitting,
which occurs when a model learns noise or anomalies in the dataset, thereby limiting
its ability to generalize to new data. Techniques like normalization and feature
scaling help improve the adaptability of machine learning models, enhancing their
performance across varying datasets. Additionally, preprocessing contributes to
increased learning efficiency by ensuring that the data is in an optimal form for model
training.
In this project, data preprocessing focuses on three main areas: handling missing
data, addressing outliers, and normalizing and standardizing the dataset before
analysis. These steps ensure that the data is clean, consistent, and ready for robust
and efficient analysis.
A. Missing data
Missing data points are a common challenge in both basic statistical analysis and
machine learning, where incomplete information can negatively affect the accuracy
and reliability of models. When variables lack data points, it is essential to address
these gaps efficiently to ensure that analysis yields robust, unbiased results. Failing
to handle missing data can reduce the sample size, reducing the precision and
dependability of the analysis. Additionally, missing data can introduce bias, skewing
the results and leading to erroneous conclusions. In some cases, entire analyses
may be impossible if missing data affects crucial variables.
Data can be missing for various reasons, including technical issues, human
errors, privacy concerns, data processing failures, or the inherent characteristics of
the variable itself. Understanding the cause of missing data is vital, as it helps
determine the most appropriate handling strategy to maintain the integrity of the
analysis.
1
There are three main categories of missing data:
By identifying the type of missing data and applying appropriate methods such as
imputation, deletion, or model-based techniques, you can mitigate the impact of
missing values and enhance the quality and reliability of your analysis. Imputation
and deletion are most commonly used remedied to address the problem of missing
data.
Deletion:
Three primary methods of deletion can be employed when missing data problems
are encountered: listwise deletion, pairwise deletion, and dropping variables. Each
method has its own pros and cons and depends on the nature of the data and the
amount of missing information.
I. Listwise Deletion
In listwise deletion, any observation with one or more missing values is completely
eliminated from the analysis. Only observations with a full set of data are retained for
analysis. This method is straightforward and may be suitable when the dataset is
large and the proportion of missing data is minimal. However, it assumes that the
data are missing completely at random (MCAR). If this assumption is not met,
removing incomplete cases can introduce bias, resulting in inaccurate parameter
estimates and reduced statistical power due to a smaller sample size.
2
different analyses, as they are based on different subsets of the data. This makes it
challenging to replicate results, and the inconsistency can undermine the reliability of
conclusions.
Each deletion method comes with trade-offs between data retention, bias, and
accuracy, so the choice should be guided by the type of missing data and the overall
goals of the analysis.
Imputation:
Advantages:
3
Preserving feature relationships: One of the major advantages of KNN
imputation is its ability to maintain the relationships between different features
in the dataset. Since imputed values are drawn from the closest neighboring
data points in feature space, the imputed values are more likely to reflect the
underlying patterns and correlations in the data. This can lead to improved
model performance compared to simpler imputation techniques, such as
mean imputation, which ignores feature interactions.
Flexibility in handling different data types: KNN imputation can be applied to
both numerical and categorical data. For numerical data, the imputed value is
typically the mean or median of the neighboring points. For categorical
variables, the most frequent category (mode) among the nearest neighbors is
used.
Handling multivariate data: KNN imputation can efficiently handle multivariate
datasets, where missing values may occur across multiple variables. Unlike
simpler imputation methods, KNN takes into account the values of other
features in the dataset to inform the imputation process, which can provide
more accurate estimates of missing values.
Adaptability to different distance metrics: The choice of distance metric plays
a significant role in determining which neighbors are considered closest to the
missing data point. Common distance metrics include Euclidean, Manhattan,
and Minkowski distances. The adaptability of KNN imputation to different
distance metrics makes it a versatile tool for various datasets and problem
domains.
4
more sophisticated imputation techniques or domain-specific knowledge may
be required.
Method:
The first step is to compute the distance between the data point with the
missing value and all other points in the dataset.
Let x=(x 1 , x 2 , … , x n) represent a data point in the dataset with some missing
features, and let y=( y 1 , y 2 , … , y n) represent a data point from the dataset with
complete features. The distance between x and y is calculated based on the
available features.
Euclidean Distance:
√∑ (
n
2
d ( x , y )= x i− y i )
i=1
This is the most common metric for continuous numerical data. If a feature is
missing, it is excluded from the summation.
Manhattan Distance:
n
d ( x , y )=∑ |x i− y i|
i=1
( )
n 1
∑|x i− y i|
p p
d ( x , y )=
i=1
The parameter p allows for flexibility: p=1 gives Manhattan distance, and p=2 gives
Euclidean distance.
Once the distances are calculated between the incomplete data point and the
rest of the data, the k smallest distances are identified. These k data points are
called the k-nearest neighbors. The value of k is a hyperparameter chosen based on
cross-validation or domain knowledge. After identifying the k-nearest neighbors, the
missing value is imputed using an aggregation function (mean, median, or mode) of
the corresponding values from the nearest neighbors.
5
WEEKLY PROGRESS
The tasks that were carried out in this week are as follows:
Several past researches were studied to identify potential techniques of using
ML for analysis, modelling and prediction of air and water pollutants.
Data is being collected from the sensor network established near the Lake.
Several data pre processing methods were studied to clean and preprocess
data including replacing missing data.