WRAPPER METHOD – FORWARD AND
BACKWARD SELECTION
WHY DIMENSIONALITY REDUCTION?
◼ It is so easy and convenient to collect data
◼ Data is not collected only for data mining
◼ Data accumulates in an unprecedented speed
◼ Data pre-processing is an important part for effective
  machine learning and data mining
◼ Dimensionality reduction is an effective approach to
  downsizing data
WHY DIMENSIONALITY REDUCTION?
◼ Most machine learning and data mining techniques
  may not be effective for high-dimensional data
  ◼ Curse of Dimensionality
  ◼ The intrinsic dimension may be small.
WHY DIMENSIONALITY REDUCTION?
◼ Visualization: projection of high-dimensional data
  onto 2D or 3D.
◼ Data compression: efficient storage and retrieval.
◼ Noise removal: positive effect on query accuracy.
MOTIVATION
◼ Especially when dealing with a large number of variables
  there is a need for dimensionality reduction!
◼ Dimensionality reduction can significantly improve a learning
  algorithm’s performance!
MAJOR TECHNIQUES OF
DIMENSIONALITY REDUCTION
◼ Feature Selection
◼ Feature Extraction (Reduction)
FEATURE EXTRACTION VS
SELECTION
◼ Feature extraction
  ◼ All original features are used and they are transferred
  ◼    The    transformed    features   are   linear/nonlinear
      combinations of the original features
◼ Feature selection
  ◼ Only a subset of the original features are selected
FEATURE SELECTION
 FEATURE SELECTION
◼ Feature selection:
  Problem of selecting some subset from a set of input
  features upon which it should focus attention, while
  ignoring the rest
◼ Humans/animals do that constantly!
FEATURE SELECTION (DEF.)
◼ Given a set of N features, the role of feature selection
  is to select a subset of size M (M < N) that leads to the
  smallest classification/clustering error.
 WHY IS FEATURE SELECTION? WHY NOT
 FEATURE EXTRACTION?
❖ You may want to extract meaningful rules from your
  classifier
  ◼ When you transform or project, the measurement units (length,
     weight, etc.) of your features are lost
❖ Features may not be numeric
  ◼ A typical situation in the machine learning domain
MOTIVATIONAL EXAMPLE FROM BIOLOGY
Monkeys performing classification task
   • Eye separation, Eye height, Mouth height, Nose length
MOTIVATIONAL EXAMPLE FROM BIOLOGY
Monkeys performing classification task
                                  Diagnostic features:
                                  - Eye separation
                                  - Eye height
                                  Non-Diagnostic features:
                                  - Mouth height
                                  - Nose length
FEATURE SELECTION METHODS
❖ Feature  selection is an
  optimization problem.
  ◼ Search the space of possible
    feature subsets.
  ◼ Pick the subset that is optimal
    or near-optimal with respect
    to an objective function.
     WRAPPER, FILTER AND EMBEDDED METHODS
◼ The value of a feature is related to a model-construction method. Three
     classes of methods:
1.     Wrapper methods are built “around” a specific predictive model (measure
       error rate)
2.     Filter methods use a proxy measure instead of the error rate to score a
       feature subset
3.     Embedded methods perform feature selection as an integral part of the
       model construction process.
TOP-DOWN AND BOTTOM-UP METHODS
◼ In a bottom-up method one gradually adds the ranked features in the order of
   their individual discrimination power and stops when the error rate stops
   decreasing
◼ In a top-down truncation method one starts with the complete set of
   features and progressively eliminates features while searching for the optimal
   performance point
FEATURE SELECTION METHODS
❖ Feature selection is an optimization problem.
  ◼ Search the space of possible feature subsets.
  ◼ Pick the subset that is optimal or near-optimal with respect to a
     certain criterion.
   Search strategies                Evaluation strategies
  ◼ Optimum                          - Filter methods
  ◼ Heuristic                        - Wrapper methods
  ◼ Randomized
EVALUATION STRATEGIES
❖ Filter Methods
   ◼ Evaluation    is independent of the
     classification algorithm.
   ◼ The    objective   function   evaluates
     feature subsets by their information
     content, typically interclass distance,
     statistical        dependence        or
     information-theoretic measures.
EVALUATION STRATEGIES
❖ Wrapper Methods
  ◼ Evaluation uses criteria related to the
     classification algorithm.
  ◼ The objective function is a pattern
     classifier, which evaluates feature
     subsets by their predictive accuracy
     (recognition rate on test data) by
     statistical         resampling      or
     cross-validation.
FILTER VS WRAPPER
APPROACHES
Wrapper Approach
❖ Advantages
   ◼ Accuracy: wrappers generally have better recognition rates than filters since
      they tuned to the specific interactions between the classifier and the
      features.
   ◼ Ability to generalize: wrappers have a mechanism to avoid over fitting, since
      they typically use cross-validation measures of predictive accuracy.
❖ Disadvantages
   ◼ Slow execution
FILTER VS WRAPPER
APPROACHES (CONT’D)
Filter Apporach
◼ Advantages
  ◼ Fast execution: Filters generally involve a non-iterative computation on the
     dataset, which can execute much faster than a classifier training session
  ◼ Generality: Since filters evaluate the intrinsic properties of the data, rather
     than their interactions with a particular classifier, their results exhibit more
     generality; the solution will be “good” for a large family of classifiers
◼ Disadvantages
  ◼ Tendency to select large subsets: Filter objective functions are generally
     monotonic
SEARCH STRATEGIES
                               Four Features – x1, x2, x3,
                               x4
                                                   1,1,1,1
                          0,1,1,1        1,0,1,1                1,1,0,1             1,1,1,0
               0,1,0,1        0,1,0,1       1,0,0,1          0,1,1,0      1,0,1,0        1,1,0,0
◼
                         0,0,0,1        0,0,1,0               0,1,0,0           1,0,0,0
                                                   0,0,0,0
                  1-xi is selected; 0-xi is not selected
NAÏVE SEARCH
❖ Sort the given N features in order of their probability of
   correct recognition.
❖ Select the top M features from this sorted list.
❖ Disadvantage
   ◼ Feature correlation is not considered.
   ◼ Best pair of features may not even contain the best individual
     feature.
    SEQUENTIAL FORWARD SELECTION (SFS)
    (HEURISTIC SEARCH)
❖   First, the best single feature is selected (i.e., using some
    criterion function).
❖   Then, pairs of features are formed using one of the
    remaining features and this best feature, and the best
    pair is selected.
❖   Next, triplets of features are formed using one of the
    remaining features and these two best features, and the
    best triplet is selected.
❖   This procedure continues until a predefined number of          SFS performs best
                                                                   when the optimal
    features are selected.                                         subset is small.
SEQUENTIAL FORWARD SELECTION (SFS)
(HEURISTIC SEARCH)
           {x1, x2, x3,
           x4}
           {x2 , x3 , x1} {x2 , x3, x4}   J(x2, x3 , x1)>=J(x2, x3, x4)
        {x2,      {x2 ,      {x2 ,          J(x2, x3)>=J(x2, xi); i=1,4
        x1}       x3}        x4}
         {x1}{x2}{x3}{x4}                   J(x2)>=J(xi); i=1,3,4
ILLUSTRATION (SFS)
                                                                                         Four Features – x1, x2, x3, x4
                                     1,1,1,1
                                                                                   1-xi is selected; 0-xi is not selected
            0,1,1,1        1,0,1,1               1,1,0,1              1,1,1,0
 0,1,0,1        0,1,0,1      1,0,0,1           0,1,1,0     1,0,1,0             1,1,0,0
           0,0,0,1        0,0,1,0               0,1,0,0              1,0,0,0
                                     0,0,0,0
ILLUSTRATION (SFS)
                                                                                         Four Features – x1, x2, x3, x4
                                     1,1,1,1
                                                                                   1-xi is selected; 0-xi is not selected
            0,1,1,1        1,0,1,1               1,1,0,1              1,1,1,0
 0,1,0,1        0,1,0,1      1,0,0,1           0,1,1,0     1,0,1,0             1,1,0,0
           0,0,0,1        0,0,1,0               0,1,0,0              1,0,0,0                     x3
                                     0,0,0,0
ILLUSTRATION (SFS)
                                                                                         Four Features – x1, x2, x3, x4
                                     1,1,1,1
                                                                                   1-xi is selected; 0-xi is not selected
            0,1,1,1        1,0,1,1               1,1,0,1              1,1,1,0
 0,1,0,1        0,1,0,1      1,0,0,1           0,1,1,0     1,0,1,0             1,1,0,0           x2,
                                                                                                 x3
           0,0,0,1        0,0,1,0               0,1,0,0              1,0,0,0                     x3
                                     0,0,0,0
ILLUSTRATION (SFS)
                                                                                         Four Features – x1, x2, x3, x4
                                     1,1,1,1
                                                                                   1-xi is selected; 0-xi is not selected
            0,1,1,1        1,0,1,1               1,1,0,1              1,1,1,0                    x1,x2,
                                                                                                 x3
 0,1,0,1        0,1,0,1      1,0,0,1           0,1,1,0     1,0,1,0             1,1,0,0           x2,
                                                                                                 x3
           0,0,0,1        0,0,1,0               0,1,0,0              1,0,0,0                     x3
                                     0,0,0,0
ILLUSTRATION (SFS)
                                                                                         Four Features – x1, x2, x3, x4
                                     1,1,1,1
                                                                                   1-xi is selected; 0-xi is not selected
            0,1,1,1        1,0,1,1               1,1,0,1              1,1,1,0                    x1,x2,
                                                                                                 x3
 0,1,0,1        0,1,0,1      1,0,0,1           0,1,1,0     1,0,1,0             1,1,0,0           x2,
                                                                                                 x3
           0,0,0,1        0,0,1,0               0,1,0,0              1,0,0,0                     x3
                                     0,0,0,0
    SEQUENTIAL BACKWARD SELECTION
    (SBS) (HEURISTIC SEARCH)
❖   First, the criterion function is computed for all n features.
❖   Then, each feature is deleted one at a time, the criterion
    function is computed for all subsets with n-1 features, and
    the worst feature is discarded.
❖   Next, each feature among the remaining n-1 is deleted
    one at a time, and the worst feature is discarded to form a
    subset with n-2 features.
❖   This procedure continues until a predefined number of           SBS performs best
                                                                    when the optimal
    features are left.                                              subset is large.
SEQUENTIAL BACKWARD SELECTION
(SBS) (HEURISTIC SEARCH)
                     {x1, x2, x3,
                     x4}
                                                         • J(x1, x2, x3) is
 {x2, x3,              {x1, x3,          {x1, x2,          maximum
 x4}                   x4}               x3}             • x3 is the worst feature
 • J(x2, x3) is maximum       {x2,      {x1,    {x1,
 • x1 is the worst feature
                              x3}       x3}     x2}
                                               • J(x2) is maximum
                             {x2}{x3}          • x3 is the worst feature
ILLUSTRATION (SBS)
                                                                                         Four Features – x1, x2, x3, x4
                                     1,1,1,1
                                                                                   1-xi is selected; 0-xi is not selected
            0,1,1,1        1,0,1,1               1,1,0,1              1,1,1,0
 0,1,0,1        0,1,0,1      1,0,0,1           0,1,1,0     1,0,1,0             1,1,0,0
           0,0,0,1        0,0,1,0               0,1,0,0              1,0,0,0
                                     0,0,0,0
ILLUSTRATION (SBS)
                                                                                         Four Features – x1, x2, x3, x4
                                     1,1,1,1
                                                                                   1-xi is selected; 0-xi is not selected
            0,1,1,1        1,0,1,1               1,1,0,1              1,1,1,0                    x1, x2,
                                                                                                 x3
 0,1,0,1        0,1,0,1      1,0,0,1           0,1,1,0     1,0,1,0             1,1,0,0
           0,0,0,1        0,0,1,0               0,1,0,0              1,0,0,0
                                     0,0,0,0
ILLUSTRATION (SBS)
                                                                                         Four Features – x1, x2, x3, x4
                                     1,1,1,1
                                                                                   1-xi is selected; 0-xi is not selected
            0,1,1,1        1,0,1,1               1,1,0,1              1,1,1,0                    x1, x2,
                                                                                                 x3
 0,1,0,1        0,1,0,1      1,0,0,1           0,1,1,0     1,0,1,0             1,1,0,0           x2,
                                                                                                 x3
           0,0,0,1        0,0,1,0               0,1,0,0              1,0,0,0
                                     0,0,0,0
ILLUSTRATION (SBS)
                                                                                         Four Features – x1, x2, x3, x4
                                     1,1,1,1
                                                                                   1-xi is selected; 0-xi is not selected
            0,1,1,1        1,0,1,1               1,1,0,1              1,1,1,0                    x1, x2,
                                                                                                 x3
 0,1,0,1        0,1,0,1      1,0,0,1           0,1,1,0     1,0,1,0             1,1,0,0           x2,
                                                                                                 x3
           0,0,0,1        0,0,1,0               0,1,0,0              1,0,0,0                       x2
                                     0,0,0,0
ILLUSTRATION (SBS)
                                                                                         Four Features – x1, x2, x3, x4
                                     1,1,1,1
                                                                                   1-xi is selected; 0-xi is not selected
            0,1,1,1        1,0,1,1               1,1,0,1              1,1,1,0                    x1, x2,
                                                                                                 x3
 0,1,0,1        0,1,0,1      1,0,0,1           0,1,1,0     1,0,1,0             1,1,0,0           x2,
                                                                                                 x3
           0,0,0,1        0,0,1,0               0,1,0,0              1,0,0,0                       x2
                                     0,0,0,0
 BIDIRECTIONAL SEARCH (BDS)
 (HEURISTIC SEARCH)
◼ BDS applies SFS and SBS simultaneously:
  ◼ SFS is performed from the empty set
  ◼ SBS is performed from the full set
◼ To guarantee that SFS and SBS converge
  to the same solution
  ◼ Features already selected by SFS are not
     removed by SBS
  ◼ Features already removed by SBS are not
     selected by SFS
      BIDIRECTIONAL SEARCH (BDS)
SBS
             {x1, x2, x3,
             x4}
SFS
      BIDIRECTIONAL SEARCH (BDS)
SBS
                {x1, x2, x3,
                x4}
         {x1}   {x2}           {x3}   {x4}   J(x2) is maximum
                                             x2 is selected
SFS
      BIDIRECTIONAL SEARCH (BDS)
SBS
                   {x1, x2, x3,
                   x4}
      {x2, x3,        {x2, x1,            {x2, x2,
      x4}             x4}                 x3}
            {x1}   {x2}           {x3}   {x4}        J(x2) is maximum
                                                     x2 is selected
SFS
      BIDIRECTIONAL SEARCH (BDS)
SBS
                   {x1, x2, x3,
                   x4}
                                                     J(x2, x1, x4) is maximum
      {x2, x3,        {x2, x1,            {x2, x2,   x3 is removed
      x4}             x4}                 x3}
            {x1}   {x2}           {x3}   {x4}        J(x2) is maximum
                                                     x2 is selected
SFS
      BIDIRECTIONAL SEARCH (BDS)
SBS
                      {x1, x2, x3,
                      x4}
                                                        J(x2, x1, x4) is maximum
      {x2, x3,           {x2, x1,            {x2, x2,   x3 is removed
      x4}                x4}                 x3}
                   {x2,          {x2,
                   x1}           x4}
            {x1}      {x2}           {x3}   {x4}        J(x2) is maximum
                                                        x2 is selected
SFS
      BIDIRECTIONAL SEARCH (BDS)
SBS
                      {x1, x2, x3,
                      x4}
                                                        J(x2, x1, x4) is maximum
      {x2, x3,           {x2, x1,            {x2, x2,   x3 is removed
      x4}                x4}                 x3}
                   {x2,          {x2,                   J(x2, x4) is maximum
                                                        x4 is selected
                   x1}           x4}
            {x1}      {x2}           {x3}   {x4}        J(x2) is maximum
                                                        x2 is selected
SFS
      BIDIRECTIONAL SEARCH (BDS)
SBS
                      {x1, x2, x3,
                      x4}
                                                        J(x2, x1, x4) is maximum
      {x2, x3,           {x2, x1,            {x2, x2,   x3 is removed
      x4}                x4}                 x3}
                   {x2,          {x2,                   J(x2, x4) is maximum
                                                        x4 is selected
                   x1}           x4}
            {x1}      {x2}           {x3}   {x4}        J(x2) is maximum
                                                        x2 is selected
SFS
      BIDIRECTIONAL SEARCH (BDS)
SBS
                      {x1, x2, x3,
                      x4}
                                                        J(x2, x1, x4) is maximum
      {x2, x3,           {x2, x1,            {x2, x2,   x3 is removed
      x4}                x4}                 x3}
                   {x2,          {x2,                   J(x2, x4) is maximum
                                                        x4 is selected
                   x1}           x4}
            {x1}      {x2}           {x3}   {x4}        J(x2) is maximum
                                                        x2 is selected
SFS
      BIDIRECTIONAL SEARCH (BDS)
SBS
                      {x1, x2, x3,
                      x4}
                                                        J(x2, x1, x4) is maximum
      {x2, x3,           {x2, x1,            {x2, x2,   x3 is removed
      x4}                x4}                 x3}
                   {x2,          {x2,                   J(x2, x4) is maximum
                                                        x4 is selected
                   x1}           x4}
            {x1}      {x2}           {x3}   {x4}        J(x2) is maximum
                                                        x2 is selected
SFS
ILLUSTRATION (BDS)
                                                                                         Four Features – x1, x2, x3, x4
                                     1,1,1,1
                                                                                   1-xi is selected; 0-xi is not selected
            0,1,1,1        1,0,1,1               1,1,0,1              1,1,1,0
 0,1,0,1        0,1,0,1      1,0,0,1           0,1,1,0     1,0,1,0             1,1,0,0
           0,0,0,1        0,0,1,0               0,1,0,0              1,0,0,0
                                     0,0,0,0
ILLUSTRATION (BDS)
                                                                                         Four Features – x1, x2, x3, x4
                                     1,1,1,1
                                                                                   1-xi is selected; 0-xi is not selected
            0,1,1,1        1,0,1,1               1,1,0,1              1,1,1,0
 0,1,0,1        0,1,0,1      1,0,0,1           0,1,1,0     1,0,1,0             1,1,0,0
           0,0,0,1        0,0,1,0               0,1,0,0              1,0,0,0
                                     0,0,0,0
ILLUSTRATION (BDS)
                                                                                         Four Features – x1, x2, x3, x4
                                     1,1,1,1
                                                                                   1-xi is selected; 0-xi is not selected
            0,1,1,1        1,0,1,1               1,1,0,1              1,1,1,0
 0,1,0,1        0,1,0,1      1,0,0,1           0,1,1,0     1,0,1,0             1,1,0,0
           0,0,0,1        0,0,1,0               0,1,0,0              1,0,0,0                       x2
                                     0,0,0,0
ILLUSTRATION (BDS)
                                                                                         Four Features – x1, x2, x3, x4
                                     1,1,1,1
                                                                                   1-xi is selected; 0-xi is not selected
            0,1,1,1        1,0,1,1               1,1,0,1              1,1,1,0
                                                                                                  x2, x1,
                                                                                                  x4
 0,1,0,1        0,1,0,1      1,0,0,1           0,1,1,0     1,0,1,0             1,1,0,0
           0,0,0,1        0,0,1,0               0,1,0,0              1,0,0,0                       x2
                                     0,0,0,0
ILLUSTRATION (BDS)
                                                                                         Four Features – x1, x2, x3, x4
                                     1,1,1,1
                                                                                   1-xi is selected; 0-xi is not selected
            0,1,1,1        1,0,1,1               1,1,0,1              1,1,1,0
                                                                                                  x2, x1,
                                                                                                  x4
 0,1,0,1        0,1,0,1      1,0,0,1           0,1,1,0     1,0,1,0             1,1,0,0
                                                                                                  x2,
                                                                                                  x4
           0,0,0,1        0,0,1,0               0,1,0,0              1,0,0,0                       x2
                                     0,0,0,0
 FEATURE SELECTION METHODS
Filter:
                                        Selected
    All Features                                    Supervised    Classifier
                                        Features
                        Filter                       Learning
                                                     Algorithm    Selected
                                                                  Features
Wrapper:
                            Feature                            Feature    Classifier
    All Features                       Supervised Classifier
                            Subset                             Evaluation
                   Search              Learning
                                       Algorithm               Criterion  Selected
                                                                               Features
                                  Criterion Value
INTRODUCTION TO MACHINE LEARNING AND DATA MINING,
CARLA BRODLEY
                          Feature                            Feature
  All Features                       Supervised Classifier
                          Subset                             Evaluation
                 Search              Learning
                                     Algorithm               Criterion  Selected
                                                                        Features
                                Criterion Value
INTRODUCTION TO MACHINE LEARNING AND DATA MINING,
CARLA BRODLEY
                              Feature                               Feature
  All Features                              Supervised Classifier
                              Subset                                Evaluation
                 Search                     Learning
                                            Algorithm               Criterion  Selected
                                                                               Features
                                    Criterion Value
   Search Method: sequential forward search
                          A             B          C          D
       A, B                     B, C                    B, D
                 A, B, C                               B, C, D
INTRODUCTION TO MACHINE LEARNING AND DATA MINING,
CARLA BRODLEY
                          Feature                            Feature
  All Features                       Supervised Classifier
                          Subset                             Evaluation
                 Search              Learning
                                     Algorithm               Criterion  Selected
                                                                        Features
                                Criterion Value
     Search Method: sequential backward elimination
      ABC                 ABD                ACD                 BCD
         AB                   AD                    BD
                          A         D
INTRODUCTION TO MACHINE LEARNING AND DATA MINING,
CARLA BRODLEY
 “PLUS-L, MINUS-R” SELECTION
 (LRS) (HEURISTIC SEARCH)
❖ A generalization of SFS and SBS
   ◼ If L>R, LRS starts from the empty set and:
     ◼ Repeatedly add L features
     ◼ Repeatedly remove R features
   ◼ If L<R, LRS starts from the full set and:
     ◼ Repeatedly removes R features
     ◼ Repeatedly add L features
❖ LRS attempts to compensate for the weaknesses of
   SFS and SBS with some backtracking capabilities.
“PLUS-L, MINUS-R” SELECTION
(LRS) (HEURISTIC SEARCH)