Study notes for feature engineering

1. Definitions
Feature selection is the process of selecting a subset of relevant features for use in model construction.
Feature extraction creates new features from functions of the original features. It is domain-specific.
A feature selection algorithm can be seen as the combination of a search technique for proposing new feature subsets, along with an evaluation measure which scores the different feature subsets. The choice of evaluation metric heavily influences the algorithm, and it is these evaluation metrics which distinguish between the three main categories of feature selection algorithms: wrappers, filters and embedded methods.
2. Benefits of Feature Selection
The central assumption is that the data contains many redundant or irrelevant features. Redundantfeatures are those which provide no more information than the currently selected features (they are highly correlated features). Irrelevantfeatures provide no useful information in any context. Noisy features: signal-to-noise ratio is too low to be useful for discriminating outcome.

Why does feature selection? Consider two cases:

Case 1: We are interested in features; we want to know which are relevant; we don't necessarily want to do prediction. If we fit a model, it should be interpretable. e.g. what the reasons (features) causing lung cancer? It is needed to do feature selection.
Case 2: We are interested in prediction; features are not interesting in themselves, we just want to build a good classifier (or other kind of predictors). If the only concern is accuracy, and the whole data set can be processed, feature selection is not needed (as long as there is regularization). If computational complexity is critical (e.g., embedded device, web-scale data, fancy learning algorithm), consider using feature selection.
What are the benefits of feature selection?

Alleviate the curse of dimensionality. As the number of features increase, the volume of feature space increases exponentially. Data becomes increasingly sparse in the feature space.
Improved model interpretability. One motivation is to find a simple, "parsimonious" model. Occam's razor: simplest explanation that accounts for the data is best.
Shorter training times. Training with all features may be too expensive. In many practice, it is not unusual to come up with more than 10^6 features.
Enhanced generalization by reducing over fitting due to many features (and less examples). The presence of irrelevant features hurt generalization. Two morals are:
Moral 1: In the presence of many irrelevant features, we might just fit noise.
Moral 2: Training error can lead us astray.
3. Feature Subset - Search and Evaluation
Filter methods use a proxy measure instead of the error rate to score a feature subset.Features are selected before machine learning algorithm is run.This measure is chosen to be fast to compute, whilst still capturing the usefulness of the feature set. Filters are usually less computationally intensive than wrappers, but they produce a feature set which is not tuned to a specific type of predictive model. Many filters provide a feature ranking rather than an explicit best feature subset, and the cut off point in the ranking is chosen via cross-validation. A number of filter metrics are listed as below, to name a few.

Regression: Mutual information, Correlation
Classification with categorical data: Chi-squared, information gain, document frequency
Inter-class distance
Error probability
Probabilistic distance
Entropy
Minimum-redundancy-maximum-relevance (mRMR)
The advantage of this kind of methods is that it is fast and simple to apply. The disadvantage is that it does not take into account interactions between features, for example, apparently useless features can be useful when grouped with others.
Wrapper methods use a predictive model to score feature subsets.Use machine learning algorithm as a black box to select best subset of features. Each new subset is used to train a model, which is tested on a hold-out set. Counting the number of mistakes made on that hold-out set (the error rate of the model) gives the score for that subset. As wrapper methods train a new model for each subset, they are very computationally intensive, but usually provide the best performing feature set for that particular type of model. In conclusion, wrappers can be computationally expensive and have a risk of over fitting to the model.
Exhaustive search subsets: too expensive to be used in practice.
Greedy search is common and effective, including random selection, forward selection andbackward elimination, where backward elimination tends to find better subset features, but it is frequently too expensive to fit the large subset features at the beginning of the search. However, they are too greedy and ignore the relationships between features. The source codes of back elimination algorithm could be:
Initialize S = {1, 2, ..., n}
Do
remove feature from S which improve the performance most using cross validation
while performance can be improved
Best-first search
Stochastic search
Embedded methods are a catch-all group of techniques which perform feature selection as part of the model construction process.Feature selection occurs naturally as part of the
machine learning algorithm. The exemplar of this approach is the LASSO method for constructing a linear model, which penalizes the regression coefficients, shrinking many of them to zero. Any features which have non-zero regression coefficients are 'selected' by the LASSO algorithm. One other popular approach is the Recursive Feature Elimination algorithm, commonly used with Support Vector Machines to repeatedly construct a model and remove features with low weights. These approaches tend to be between filters and wrappers in terms of computational complexity.
L1-regularization: Linear models penalized with the L1 norm have sparse solutions: many of their estimated coefficients are zero. Hence, we can choose the non-zero coefficients. In particular, for regression, we can use sparse regression, LASSO; for classification, we can use logistic regression or linearSVC.
Decision tree
Regularized trees, i.e., regularized random forest
Many other machine learning methods applying a pruning step

补充：综合编程 , 其他综合 ,