Often, datasets contain features that are irrelevant to the current problem. Feature selection is the process of reducing the number of features in your dataset. The benefit is that the required size of a dataset shrinks, decreasing both training and prediction time while increasing accuracy.
The scikit-learn package contains one implementation that requires you to specify the number of features to select and another implementation that tunes the number of features automatically through cross-validation.
The following example is based on http://scikit-learn.org/dev/auto_examples/plot_rfe_with_cross_validation.html
# Synthesize a classification dataset with 25 total features, # 3 informative features, 2 redundant features from sklearn.datasets import make_classification X, y = make_classification(n_samples=1000, n_features=25, n_informative=3, n_redundant=2, n_repeated=0, n_classes=8, n_clusters_per_class=1, random_state=0)
# Select features from sklearn.feature_selection import RFECV from sklearn.svm import SVC from sklearn.cross_validation import StratifiedKFold from sklearn.metrics import zero_one featureSelector = RFECV(estimator=SVC(kernel='linear'), step=1, cv=StratifiedKFold(y, 2), loss_func=zero_one) featureSelector.fit(X, y)
# Look at fitted parameters of featureSelector [x for x in dir(featureSelector) if not x.startswith('_') and x.endswith('_')]
# Check the number of features len(X)
# Look at a specific sample of features X
# Look at how the features have been ranked featureSelector.ranking_
# Get a boolean index array marking which features are informative featureSelector.support_
# Count the number of features that have been ranked as informative print sum(featureSelector.ranking_ == 1) print sum(featureSelector.support_) print featureSelector.n_features_
# Look at how the performance of the classifier changes as # features are included in the dataset in order of informative rank; # note that the cross-validation score is the number of # misclassifications because we chose the zero_one loss function print featureSelector.cv_scores_
# Plot the above information; # note that after including the third feature, # the performance of the classifier does not improve import pylab as pl pl.figure() pl.title('Cross-validation scores after recursive feature elimination') pl.xlabel('Number of features selected') pl.ylabel('Number of misclassifications') pl.plot(xrange(1, len(featureSelector.cv_scores_) + 1), featureSelector.cv_scores_) pl.show()