0% found this document useful (0 votes)
9 views6 pages

ISAT 600 Progress Report 2

The document discusses the importance of data preprocessing in machine learning, particularly for addressing issues like missing data, outliers, and feature scaling to enhance analysis accuracy. It outlines various methods for handling missing data, including deletion and imputation techniques, with a focus on k-Nearest Neighbor (KNN) imputation. The report also details the computational challenges and considerations associated with these methods while summarizing the progress made in researching machine learning applications for environmental modeling.

Uploaded by

Shiron Thalagala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views6 pages

ISAT 600 Progress Report 2

The document discusses the importance of data preprocessing in machine learning, particularly for addressing issues like missing data, outliers, and feature scaling to enhance analysis accuracy. It outlines various methods for handling missing data, including deletion and imputation techniques, with a focus on k-Nearest Neighbor (KNN) imputation. The report also details the computational challenges and considerations associated with these methods while summarizing the progress made in researching machine learning applications for environmental modeling.

Uploaded by

Shiron Thalagala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Intelligent Environmental Modeling: Machine Learning to Tackle

Air and Water Pollution


ISAT 600: Progress Report 2
09/23/2024

1.2 Data preprocessing

Data preprocessing is a crucial step in data analysis, transforming raw data into a
clean, structured dataset for subsequent processing. It involves manipulating,
filtering, or augmenting raw data to enhance its quality and ensure consistency,
which is essential for accurate analysis. In statistical and machine learning
applications, data preprocessing addresses challenges like missing data, outliers,
and variability in feature scales through techniques such as normalization and
standardization.
Effective data preprocessing significantly improves the accuracy of analysis by
eliminating inconsistencies and irrelevant data. It also reduces the risk of overfitting,
which occurs when a model learns noise or anomalies in the dataset, thereby limiting
its ability to generalize to new data. Techniques like normalization and feature
scaling help improve the adaptability of machine learning models, enhancing their
performance across varying datasets. Additionally, preprocessing contributes to
increased learning efficiency by ensuring that the data is in an optimal form for model
training.
In this project, data preprocessing focuses on three main areas: handling missing
data, addressing outliers, and normalizing and standardizing the dataset before
analysis. These steps ensure that the data is clean, consistent, and ready for robust
and efficient analysis.

A. Missing data

Missing data points are a common challenge in both basic statistical analysis and
machine learning, where incomplete information can negatively affect the accuracy
and reliability of models. When variables lack data points, it is essential to address
these gaps efficiently to ensure that analysis yields robust, unbiased results. Failing
to handle missing data can reduce the sample size, reducing the precision and
dependability of the analysis. Additionally, missing data can introduce bias, skewing
the results and leading to erroneous conclusions. In some cases, entire analyses
may be impossible if missing data affects crucial variables.
Data can be missing for various reasons, including technical issues, human
errors, privacy concerns, data processing failures, or the inherent characteristics of
the variable itself. Understanding the cause of missing data is vital, as it helps
determine the most appropriate handling strategy to maintain the integrity of the
analysis.

1
There are three main categories of missing data:

 Missing Completely at Random (MCAR): In this case, the missing data is


unrelated to any observed or unobserved data, where the absence of data
points does not introduce bias and has no impact on the dataset. Although
this can reduce the number of datapoints available for the analysis, it will not
harm the skew the results into an erroneous conclusion.
 Missing at Random (MAR): Here, the likelihood of missing data depends on
other observed variables, but not on the value of the missing data itself. This
type of missing data can be accounted for using imputation methods that
incorporate information from other variables.
 Missing Not at Random (MNAR): In this category, the probability of missing
data is directly related to the value of the missing variable. Handling MNAR is
the most challenging, as it requires more sophisticated approaches to avoid
bias and distortion in the analysis.

By identifying the type of missing data and applying appropriate methods such as
imputation, deletion, or model-based techniques, you can mitigate the impact of
missing values and enhance the quality and reliability of your analysis. Imputation
and deletion are most commonly used remedied to address the problem of missing
data.

Deletion:
Three primary methods of deletion can be employed when missing data problems
are encountered: listwise deletion, pairwise deletion, and dropping variables. Each
method has its own pros and cons and depends on the nature of the data and the
amount of missing information.

I. Listwise Deletion
In listwise deletion, any observation with one or more missing values is completely
eliminated from the analysis. Only observations with a full set of data are retained for
analysis. This method is straightforward and may be suitable when the dataset is
large and the proportion of missing data is minimal. However, it assumes that the
data are missing completely at random (MCAR). If this assumption is not met,
removing incomplete cases can introduce bias, resulting in inaccurate parameter
estimates and reduced statistical power due to a smaller sample size.

II. Pairwise Deletion


Pairwise deletion, also known as available-case analysis, only excludes the missing
values for specific variables in each analysis. This method uses all available data for
each pair of variables, allowing more data to be retained compared to listwise
deletion. However, it assumes that data are MCAR, which may not always be the
case. One major drawback is that the results from pairwise deletion can vary across

2
different analyses, as they are based on different subsets of the data. This makes it
challenging to replicate results, and the inconsistency can undermine the reliability of
conclusions.

III. Dropping Variables


When a significant proportion of data is missing for a specific variable, it may be
practical to discard the variable entirely, particularly if it is deemed insignificant to the
analysis. This method prevents incomplete variables from distorting the analysis, but
care must be taken to ensure that the variable is not critical to the research
objectives. Dropping variables should be avoided if they carry valuable information,
as it could lead to a loss of important insights.

Each deletion method comes with trade-offs between data retention, bias, and
accuracy, so the choice should be guided by the type of missing data and the overall
goals of the analysis.

Imputation:

In some instances, deleting data is not a viable option, especially when it


results in the loss of too much information, compromising the reliability of the
analysis. Discarding excessive data may render the dataset insufficient for accurate
analysis or prediction, making the results unreliable. In such cases, imputation offers
a robust alternative for handling missing data. Imputation involves generating new
data points using various statistical methods to replace the missing values, thereby
maintaining the integrity of the dataset. This approach allows analysts to use the
complete dataset without sacrificing valuable information. Depending on the cause of
the missing data and the imputation technique applied, the resulting dataset can be a
reliable approximation of the true data, enabling accurate analysis and predictions.
By leveraging imputation, researchers can avoid the risks associated with data
deletion, such as bias or reduced statistical power, and ensure that their models are
trained on datasets that are as complete as possible.

k-Nearest Neighbor Imputation

K-Nearest Neighbors (KNN) Imputation is a popular method for handling


missing data by uisng the relationships between data points. The technique works by
identifying the k-nearest neighbors (based on a specified distance metric, such as
Euclidean distance) for the data points that contain missing values. Once the nearest
neighbors are found, the missing values are imputed using the mean, median, or
mode of the neighboring data points, depending on the nature of the variable and the
chosen strategy.

Advantages:

3
 Preserving feature relationships: One of the major advantages of KNN
imputation is its ability to maintain the relationships between different features
in the dataset. Since imputed values are drawn from the closest neighboring
data points in feature space, the imputed values are more likely to reflect the
underlying patterns and correlations in the data. This can lead to improved
model performance compared to simpler imputation techniques, such as
mean imputation, which ignores feature interactions.
 Flexibility in handling different data types: KNN imputation can be applied to
both numerical and categorical data. For numerical data, the imputed value is
typically the mean or median of the neighboring points. For categorical
variables, the most frequent category (mode) among the nearest neighbors is
used.
 Handling multivariate data: KNN imputation can efficiently handle multivariate
datasets, where missing values may occur across multiple variables. Unlike
simpler imputation methods, KNN takes into account the values of other
features in the dataset to inform the imputation process, which can provide
more accurate estimates of missing values.
 Adaptability to different distance metrics: The choice of distance metric plays
a significant role in determining which neighbors are considered closest to the
missing data point. Common distance metrics include Euclidean, Manhattan,
and Minkowski distances. The adaptability of KNN imputation to different
distance metrics makes it a versatile tool for various datasets and problem
domains.

Challenges and Considerations:


 Computational cost: KNN imputation can be computationally expensive,
especially for large datasets. For each missing value, the algorithm needs to
calculate distances to all other data points in the dataset, which increases the
computational load as the size of the dataset grows.
 Sensitivity to the choice of k: The performance of KNN imputation depends on
the selection of the parameter k (the number of neighbors to consider). A
small k may lead to unreliable imputed values due to insufficient information,
while a large k may dilute the influence of the most relevant neighbors,
leading to less accurate imputations. Cross-validation can be used to select
the optimal k for a given dataset.
 Handling outliers: KNN imputation may be sensitive to outliers in the data. If
outliers are present among the nearest neighbors, they can influence the
imputed values and reduce the accuracy of the method. Preprocessing steps
such as outlier detection and removal may be necessary to ensure reliable
imputations.
 MNAR data: While KNN imputation works well for data MCAR or MAR, it may
not perform as effectively with data MNAR, where the probability of
missingness is related to the value of the missing data itself. In such cases,

4
more sophisticated imputation techniques or domain-specific knowledge may
be required.

Method:

The first step is to compute the distance between the data point with the
missing value and all other points in the dataset.

Let x=(x 1 , x 2 , … , x n) represent a data point in the dataset with some missing
features, and let y=( y 1 , y 2 , … , y n) represent a data point from the dataset with
complete features. The distance between x and y is calculated based on the
available features.

Common distance metrics include:

Euclidean Distance:

√∑ (
n
2
d ( x , y )= x i− y i )
i=1

This is the most common metric for continuous numerical data. If a feature is
missing, it is excluded from the summation.

Manhattan Distance:
n
d ( x , y )=∑ |x i− y i|
i=1

This metric is often used for categorical or ordinal data.

Minkowski Distance (Generalized Distance Metric):

( )
n 1

∑|x i− y i|
p p
d ( x , y )=
i=1

The parameter p allows for flexibility: p=1 gives Manhattan distance, and p=2 gives
Euclidean distance.

Once the distances are calculated between the incomplete data point and the
rest of the data, the k smallest distances are identified. These k data points are
called the k-nearest neighbors. The value of k is a hyperparameter chosen based on
cross-validation or domain knowledge. After identifying the k-nearest neighbors, the
missing value is imputed using an aggregation function (mean, median, or mode) of
the corresponding values from the nearest neighbors.

5
WEEKLY PROGRESS

The tasks that were carried out in this week are as follows:
 Several past researches were studied to identify potential techniques of using
ML for analysis, modelling and prediction of air and water pollutants.
 Data is being collected from the sensor network established near the Lake.
 Several data pre processing methods were studied to clean and preprocess
data including replacing missing data.

You might also like