0% found this document useful (0 votes)
6 views

2 3-FeatureRelatedIssues

Uploaded by

Sujithra Jones
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

2 3-FeatureRelatedIssues

Uploaded by

Sujithra Jones
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 10

NPTEL

Video Course on Machine Learning

Professor Carl Gustaf Jansson, KTH

Week 2 Characterization of
Learning Problems

Video 2.3 Feature related issues


The Concept Learning task in terms of
Features, Feature vectors and the Object
(Feature) Space
The classical way of viewing a scenario for a learning task is to:
- Define an appropriate set of Features
- View each Data-item as a Feature vector
- Consider the Feature (Object) Space spanned by the Features.
- Populate the Feature space with the Feature vectors (Data-
items)
- Find optimal multi-dimensional surfaces (Hyperplanes) in the
Object Space that circumscribe the extensions of all concepts
involved.

The engineering of Features is crucial for the complexity of the


Object Space and as a consequence also crucial for the
complexity of the learning problem.
The three basic cases for Feature
Engineering
Case1: A reasonably well composed Set of Features is given
based on domain theoretic considerations.

Case 2: A huge set of possible features is available that need


to be reduced to a manageable size.

Case 3: Data-items are of non digital nature and relevant


features need to be extracted from the primary form of the
Data-items as a separate process.

We will discuss the first and third case shortly and then focus
mostly on the second case.
Case1: A reasonably well composed set of Features is given based
on domain theoretic considerations.
The ZOO dataset The features reflected The problem in this case is
feature list in the Buffalo example of - neither a volume problem due to a large
conventional taxonomy ungraspable set of possible features
- nor a representation problem caused by
hair bilateral symmetry dataitems in non-digital form.
feathers two body openings
eggs embryonic spinal cord
milk vertebra
Still a domain-based sanity check is relevant.
airborne jaws
aquatic # of legs Terminological consistency and clear feature
predator mammary glands definitions are of key importance.
toothed fur
backbone neocortex
breathes # of middle ear bones
venomous give birth to live kids
fins lack of epi-pubic bones
legs hoofed
tail middle to large
domestic chewing
catsize cloven or hoofed
Case 3: Data-items are of a non digital nature and relevant features need to
be extracted from the data-items as a separate process.
Features can be derived in a variety of manners ranging
from totally manual, via manual/automatic hybrids to
totally automated.

Every non-digital form of representation demands it own


specialized techniques in the
automated case.

Each image is a Data-item


Case 2: Dimensionality or Feature Reduction
In most realistic cases the amount of possibly available features
which can be used to characterize data items is overwhelmingly
large. In general we want to reduce the number of considered
features.

The ground for removing a specific feature is that it may be either


redundant or irrelevant, and therefore can be removed without
causing loss of information. The goal is to obtain an adequate set
of informative, relevant and non-redundant features still able to
describe the available data-set.

The underlying motivations for dimensionality reduction can be


elaborated are as follows:
• making models easier to interpret by human users.
• avoiding the curse of dimensionality
• reducing the risk for overfitting.
• shortening the computation times for learning processes.
The curse of dimensionality
The curse of dimensionality refers to various phenomena that arise when
analyzing and organizing data in high-dimensional spaces (typically with
hundreds or thousands of dimensions) that do not occur in low-dimensional
settings such as the three-dimensional physical space. The expression was
coined by Richard E. Bellman considering problems in Dynamic
Optimization.

The common theme of the problematic phenomena is that when the


dimensionality increases, the volume of the space increases so fast that
the available data become sparse.

This sparsity is problematic for any method that requires statistical


significance, as the amount of data needed to support the result often
grows exponentially with the dimensionality.

Also, organizing and searching data often relies on detecting areas where
objects form groups with similar properties; in high dimensional data,
however, all objects appear to be sparse and dissimilar in many ways,
which prevents common data organization strategies from being efficient.
Over-fitting vs Under-fitting
Over-fitting is the production of a model that corresponds too closely or
exactly to a particular data-set, and may therefore fail to fit additional data or
predict future observations reliably. An over-fitted model is a model that
contains more features than can be justified by the data-set.

Under-fitting occurs when a set of features cannot adequately capture the


available data-set. An under-fitted model is a model where some features that
would appear in a correctly specified model are missing. Such a model will
tend to have poor predictive performance.
Feature selection vs Feature extraction
Feature selection is the process of selecting a subset of
relevant features from the original set. The three main
criteria for selection of a feature are:
- informative-ness
- relevance and
- non-redundancy.

Feature extraction is the process of deriving new


features either as simple combinations of original
features or as a more complex mapping from the original
set to the new set.

In both cases, the learning task is supposed to be more


tractable in the resulting feature space than in the
original.
NPTEL

Video Course on Machine Learning

Professor Carl Gustaf Jansson, KTH

Thanks for your attention!

The next lecture 2.4 will be on the


topic:

Scenarios for Concept Learning

You might also like