Skip to content

FeatureExtractor


class atom.feature_engineering.FeatureExtractor(features=["day", "month", "year"], fmt=None, encoding_type="ordinal", drop_columns=True, verbose=0, logger=None) [source]

Create new features extracting datetime elements (day, month, year, etc...) from the provided columns. Columns of dtype datetime64 are used as is. Categorical columns that can be successfully converted to a datetime format (less than 30% NaT values after conversion) are also used. This class can be accessed from atom through the feature_extraction method. Read more in the user guide.

Parameters:

features: str or sequence, optional (default=["day", "month", "year"])
Features to create from the datetime columns. Note that created features with zero variance (e.g. the feature hour in a column that only contains dates) are ignored. Allowed values are datetime attributes from `pandas.Series.dt`.

fmt: str, sequence or None, optional (default=None)
Format (strptime) of the categorical columns that need to be converted to datetime. If sequence, the n-th format corresponds to the n-th categorical column that can be successfully converted. If None, the format is inferred automatically from the first non NaN value. Values that can not be converted are returned as NaT.

encoding_type: str, optional (default="ordinal")
Type of encoding to use. Choose from:
  • "ordinal": Encode features in increasing order.
  • "cyclic": Encode features using sine and cosine to capture their cyclic nature. Note that this creates two columns for every feature. Non-cyclic features still use ordinal encoding.

drop_columns: bool, optional (default=True)
Whether to drop the original columns after extracting the features from it.

verbose: int, optional (default=0)
Verbosity level of the class. Possible values are:
  • 0 to not print anything.
  • 1 to print basic information.
  • 2 to print detailed information.
logger: str, Logger or None, optional (default=None)
  • If None: Doesn't save a logging file.
  • If str: Name of the log file. Use "auto" for automatic naming.
  • Else: Python logging.Logger instance.

Warning

Decision trees based algorithms build their split rules according to one feature at a time. This means that they will fail to correctly process cyclic features since the cos/sin features should be considered one single coordinate system.


Methods

fit_transform Same as transform.
get_params Get parameters for this estimator.
log Write information to the logger and print to stdout.
save Save the instance to a pickle file.
set_params Set the parameters of this estimator.
transform Transform the data.


method fit_transform(X, y=None) [source]

Extract the new features.

Parameters:

X: dict, list, tuple, np.ndarray or pd.DataFrame
Feature set with shape=(n_samples, n_features).

y: int, str, sequence or None, optional (default=None)
Does nothing. Implemented for continuity of the API.

Returns: X: pd.DataFrame
Dataframe containing the new features.


method get_params(deep=True) [source]

Get parameters for this estimator.

Parameters:

deep: bool, optional (default=True)
If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns: params: dict
Dictionary of the parameter names mapped to their values.


method log(msg, level=0) [source]

Write a message to the logger and print it to stdout.

Parameters:

msg: str
Message to write to the logger and print to stdout.

level: int, optional (default=0)
Minimum verbosity level to print the message.


method save(filename="auto") [source]

Save the instance to a pickle file.

Parameters: filename: str, optional (default="auto")
Name of the file. Use "auto" for automatic naming.


method set_params(**params) [source]

Set the parameters of this estimator.

Parameters: **params: dict
Estimator parameters.
Returns: self: FeatureExtractor
Estimator instance.


method transform(X, y=None) [source]

Extract the new features.

Parameters:

X: dict, list, tuple, np.ndarray or pd.DataFrame
Feature set with shape=(n_samples, n_features).

y: int, str, sequence or None, optional (default=None)
Does nothing. Implemented for continuity of the API.

Returns: X: pd.DataFrame
Dataframe containing the new features.


Example

from atom import ATOMClassifier

atom = ATOMClassifier(X, y)
atom.feature_extraction(features=["day", "month"], fmt="%d/%m/%Y")
or
from atom.feature_engineering import FeatureExtractor

feature_extractor = FeatureExtractor(features=["day", "month"], fmt="%d/%m/%Y")
X = feature_extractor.transform(X)

Back to top