Feature engineering

Feature engineering is the process of creating new features from the existing ones, in order to capture relationships with the target column that the first set of features didn't have on their own. This process is very important to improve the performance of machine learning algorithms. Although feature engineering works best when the data scientist applies use-case specific transformations, there are ways to do this in an automated manner, without prior domain knowledge. One of the problems of creating new features without human expert intervention, is that many of the newly created features can be useless, i.e. they do not help the algorithm to make better predictions. Even worse, having useless features can drop your performance. To avoid this, we perform feature selection, a process in which we select the relevant features in the dataset. See the Feature engineering example.

Note

All of atom's feature engineering methods automatically adopt the relevant transformer attributes (n_jobs, verbose, logger, random_state) from atom. A different choice can be added as parameter to the method call, e.g. atom.feature_selection("PCA", n_features=10, random_state=2).

Note

Like the add method, the feature engineering methods accept the columns parameter to only transform a subset of the dataset's features, e.g. atom.feature_selection("PCA", n_features=10, columns=slice(5, 15)).

Extracting datetime features

Features that contain dates or timestamps can not be directly ingested by models since they are not strictly numerical. Encoding them as categorical features is not an option since the encoding does not capture the relationship between the different moments in time. The FeatureExtractor class creates new features extracting datetime elements (e.g. day, month, year, hour...) from the columns. It can be accessed from atom through the feature_extraction method. The new features are named equally to the column from which they are extracted, followed by an underscore and the datetime element they create, e.g. Feature 1_day for the day element of Feature 1.

Note that many time features have a cyclic pattern, e.g. after Sunday comes Monday. This means that if we would encode the days of the week from 0 to 6, we would lose that relation. A common method used to encode cyclical features is to transform the data into two dimensions using a sine and cosine transformation:

\[ x_{sin} = sin\left(\frac{2\pi * x}{max(x)}\right) \]

\[ x_{cos} = cos\left(\frac{2\pi * x}{max(x)}\right) \]

The resulting features have their names followed by sin or cos, e.g. Feature 1_day_sin and Feature 1_day_cos. The datetime elements that can be encoded in a cyclic fashion are: microsecond, second, minute, hour, weekday, day, day_of_year, month and quarter. Note that decision trees based algorithms build their split rules according to one feature at a time. This means that they will fail to correctly process cyclic features since the sin/cos values are expected to be considered as one single coordinate system.

Use the fmt parameter to specify your feature's format in case the column is categorical. The FeatureExtractor class will convert the column to the datetime dtype before extracting the specified features. Click here for an overview of the available formats.

Generating new features

The FeatureGenerator class creates new non-linear features based on the original feature set. It can be accessed from atom through the feature_generation method. You can choose between two strategies: Deep Feature Synthesis and Genetic Feature Generation.

Deep Feature Synthesis
Deep feature synthesis (DFS) applies the selected operators on the features in the dataset. For example, if the operator is "log", it will create the new feature LOG(old_feature) and if the operator is "mul", it will create the new feature old_feature_1 x old_feature_2. The operators can be chosen through the operators parameter. Available options are:

add: Sum two features together.
sub: Subtract two features from each other.
mul: Multiply two features with each other.
div: Divide two features with each other.
srqt: Take the square root of a feature.
log: Take the logarithm of a feature.
sin: Calculate the sine of a feature.
cos: Calculate the cosine of a feature.
tan: Calculate the tangent of a feature.

ATOM's implementation of DFS uses the featuretools package.

Tip

DFS can create many new features and not all of them will be useful. Use FeatureSelector to reduce the number of features!

Warning

Using the div, log or sqrt operators can return new features with inf or NaN values. Check the warnings that may pop up or use atom's missing property.

Warning

When using DFS with n_jobs>1, make sure to protect your code with if __name__ == "__main__". Featuretools uses dask, which uses python multiprocessing for parallelization. The spawn method on multiprocessing starts a new python process, which requires it to import the __main__ module before it can do its task.

Genetic Feature Generation
Genetic feature generation (GFG) uses genetic programming, a branch of evolutionary programming, to determine which features are successful and create new ones based on them. Where DFS can be seen as some kind of "brute force" for feature engineering, GFG tries to improve its features with every generation of the algorithm. GFG uses the same operators as DFS, but instead of only applying the transformations once, it evolves them further, creating complicated non-linear combinations of features with many transformations. The new features are given the name Feature N for the N-th feature. You can access the genetic feature's fitness and description (how they are calculated) through the genetic_features attribute.

ATOM uses the SymbolicTransformer class from the gplearn package for the genetic algorithm. Read more about this implementation here.

Warning

GFG can be slow for very large populations!

Selecting useful features

The FeatureSelector class provides tooling to select the relevant features from a dataset. It can be accessed from atom through the feature_selection method. The following strategies are implemented: univariate, PCA, SFM, RFE, RFECV and SFS.

Tip

Use the plot_feature_importance method to examine how much a specific feature contributes to the final predictions. If the model doesn't have a feature_importances_ attribute, use plot_permutation_importance instead.

Warning

The RFE and RFECV strategies don't work when the solver is a CatBoost model due to incompatibility of the APIs.

Univariate
Univariate feature selection works by selecting the best features based on univariate statistical F-test. The test is provided via the solver parameter. It takes any function taking two arrays (X, y), and returning arrays (scores, p-values). Read more in sklearn's documentation.

Principal Components Analysis
Applying PCA will reduce the dimensionality of the dataset by maximizing the variance of each dimension. The new features are called Component 1, Component 2, etc... The data is scaled to mean=0 and std=1 before fitting the transformer (if it wasn't already). Read more in sklearn's documentation.

Selection from model
SFM uses an estimator with feature_importances_ or coef_ attributes to select the best features in a dataset based on importance weights. The estimator is provided through the solver parameter and can be already fitted. ATOM allows you to use one its predefined models, e.g. solver="RF". If you didn't call the FeatureSelector through atom, don't forget to indicate the estimator's task adding _class or _reg after the name, e.g. RF_class to use a random forest classifier. Read more in sklearn's documentation.

Recursive feature elimination
Select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features, and the importance of each feature is obtained either through a coef_ or through a feature_importances_ attribute. Then, the least important features are pruned from current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached. Note that, since RFE needs to fit the model again every iteration, this method can be fairly slow.

RFECV applies the same algorithm as RFE but uses a cross-validated metric (under the scoring parameter, see RFECV) to assess every step's performance. Also, where RFE returns the number of features selected by n_features, RFECV returns the number of features that achieved the optimal score on the specified metric. Note that this is not always equal to the amount specified by n_features. Read more in sklearn's documentation.

Sequential Feature Selection
Sequential feature selection adds (forward selection) or removes (backward selection) features to form a feature subset in a greedy fashion. At each stage, this estimator chooses the best feature to add or remove based on the cross-validation score of an estimator. Read more in sklearn's documentation.

Removing features with low variance
Variance is the expectation of the squared deviation of a random variable from its mean. Features with low variance have many values repeated, which means the model will not learn much from them. FeatureSelector removes all features where the same value is repeated in at least max_frac_repeated fraction of the rows. The default option is to remove a feature if all values in it are the same. Read more in sklearn's documentation.

Removing features with multi-collinearity
Two features that are highly correlated are redundant, i.e. two will not contribute more to the model than only one of them. FeatureSelector will drop a feature that has a Pearson correlation coefficient larger than max_correlation with another feature. A correlation of 1 means the two columns are equal. A dataframe of the removed features and their correlation values can be accessed through the collinear attribute.