logistic regression for feature selection

In this article we show how to retrieve a set of good features via logistic regression. Logistic regression is a linear classifier whose parameters are weights, usually in terms of weight vector w, and lambda, the regularization parameter. After training logistic regression, w is estimated, and we show that the value of each weight represent how important that weight is to classification of the train set. Here we compare the weights and the mutual information (mi) at each feature. The following is our experiment:

0) Dataset

We generate our own data set consisting of 3 classes, each of which contains 100 examples, generated by adding a certain noise level to the class template.

The template of the 3 classes, each pixel's value is 0 or 1, are shown below:

An example for each class is generated by adding some noise to the corresponding template. The examples for each class with zero-mean Gaussian noise with standard deviation = 0.5, 2 are shown below:

noise level = 0.5

noise level = 2

Notice that when standard deviation = 2, the SNR is very low and, in fact, the noise level is even higher than the signal level.

1) Train the logistic regression classifier

We train linear-kernel logistic regression using lambda = 0.1. The dataset is separated into train set and test set randomly by the ratio of 6:4. Each feature is z-scored before applying the classifier in order to help with the convergence speed. After having the logistic regression trained, the weight vector is used for predicting the class label for each example in the test set

2) Prediction accuracy

Using the weight vector obtained from the training step, the accuracies for the two noise-level settings are 100% and 80% for noise level = 0.5 and 2 respectively.

3) Calculate the mutual information of each feature

In order to compare with the logistic regression weight for each feature, we calculate the mutual information (mi) for each feature independently as well.

4) Compare the weight vs mi value

noise level = 0.5

noise level = 2

5) Simple feature selection by eye-ball threshold selection

In order to select a set of good feature, we can simply pick a weight threshold and keep only the features whose values exceed the set threshold. We first plot the cumulative distribution of the weight value of the weights of class 1.

Observe the abrupt change from plateau on the right end of the cumulative curve. The change in weight implies the group of weights standout from the rest, which also means the correspond area might be good features. So, we pick threshold = -8.5, and we get the region below from class1 weight map.

Note that the selected region with threshold >= -8.5 does not include the small rectangles despite they also contribute to the classification boundary. There are 2 things worthy telling here: 1) the big top-right rectangular is sufficient to distinguish class 1 from the rest, and so the small rectangles receive small weights, which is not picked up from the histogram. 2) Eye-ball picking threshold might not be sufficiently sensitive to pick up those small weights. Now, we decrease the threshold to -10 in the hope that the small rectangles will be retrieved.

When we decrease the threshold to -11, we totally recover the small rectangles, but the noisy features are included as well.

Discussion

The weights obtained from logistic regression show how well each feature can distinguish class1 (2 and 3) from the rest. For instance, we can see that the bottom-left rectangle does not help to distinguish any class from others, so the weight in the area is pretty low, that is, not important. On the other hand, the top-right rectangle helps distinguish {class1 vs 2} and {class1 vs 3}, but not {class2 and 3}, so the weights in the area are pretty high. Plus, the weights in the area of two small rectangles around the middle of the weight map are pretty high too because the area can help to separate class 3 from the rest.
Also note that the weight for each important area is not the same for each class. In particular, the weights for each class mean how important the features are for that class.
The weights obtained from logistic regression are similar to those obtained from mi, however, we don't quite see the different level of the feature importance in mi.
Since we use one-vs-all strategy to handle multi-class classification, each weigh vector would represent the class-specific feature to distinguish that particular class fro the others. So, this approach can be used as class-specific feature.

Summary

Weights obtained from training logistic regression can be used as a feature selection method.
This methodology can be used as class-specific feature selection.
The trend of the weight is consistent with those from calculating mutual information for each voxel.

Download MATLAB code

MATLAB code is made available here. The code will need the the following packages: 1) logistic regression and 2) miscellaneous package, which can be downloaded from here. Note that this approach can be used with other linear classifier like SVM too.