Introduction to Computational Analysis

Pay Notebook Creator: Roy Hyunjin Han0
Set Container: Numerical CPU with TINY Memory for 10 Minutes 0

Select informative features

Often, datasets contain features that are irrelevant to the current problem. Feature selection is the process of reducing the number of features in your dataset. The benefit is that the required size of a dataset shrinks, decreasing both training and prediction time while increasing accuracy.

Use recursive feature elimination

The scikit-learn package contains one implementation that requires you to specify the number of features to select and another implementation that tunes the number of features automatically through cross-validation.

The following example is based on

In [1]:
# Synthesize a classification dataset with 25 total features,
# 3 informative features, 2 redundant features
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=25, n_informative=3,
    n_redundant=2, n_repeated=0, n_classes=8, n_clusters_per_class=1,
In [2]:
# Select features
from sklearn.feature_selection import RFECV
from sklearn.svm import SVC
from sklearn.cross_validation import StratifiedKFold
from sklearn.metrics import zero_one
featureSelector = RFECV(estimator=SVC(kernel='linear'), step=1,
    cv=StratifiedKFold(y, 2), loss_func=zero_one), y)
In [3]:
# Look at fitted parameters of featureSelector
[x for x in dir(featureSelector) if not x.startswith('_') and x.endswith('_')]
In [4]:
# Check the number of features
In [5]:
# Look at a specific sample of features
In [6]:
# Look at how the features have been ranked
In [7]:
# Get a boolean index array marking which features are informative
In [8]:
# Count the number of features that have been ranked as informative
print sum(featureSelector.ranking_ == 1)
print sum(featureSelector.support_)
print featureSelector.n_features_
In [9]:
# Look at how the performance of the classifier changes as
# features are included in the dataset in order of informative rank;
# note that the cross-validation score is the number of
# misclassifications because we chose the zero_one loss function
print featureSelector.cv_scores_
In [10]:
# Plot the above information;
# note that after including the third feature,
# the performance of the classifier does not improve
import pylab as pl
pl.title('Cross-validation scores after recursive feature elimination')
pl.xlabel('Number of features selected')
pl.ylabel('Number of misclassifications')
pl.plot(xrange(1, len(featureSelector.cv_scores_) + 1),