Nomenclature
This documentation consistently uses terms to refer to certain concepts related to this package. The most frequent terms are described hereunder.
ATOM
Refers to this package.
Instance of the ATOMClassifier, ATOMForecaster or ATOMRegressor classes (note that the examples use it as the default variable name).
Refers to all columns of type object
, category
, string
or boolean
.
Unique value in a column, e.g., a binary classifier has two classes in the target column.
Two-dimensional, size-mutable, potentially heterogeneous tabular data. The type is usually pd.DataFrame, but could potentially be any of the dataframe types backed by the selected data engine.
Any type object from which a pd.DataFrame can be created. This includes an iterable, a dict whose values are 1d-arrays, a two-dimensional list, tuple, np.ndarray or sps.csr_matrix, or any object that follows the dataframe interchange protocol. This is the standard input format for any dataset.
Additionally, you can provide a callable whose output is any of the aforementioned types. This is useful when the dataset is very large and you are performing parallel operations, since it can avoid broadcasting a large dataset from the driver to the workers.
An object which manages the estimation and decoding of an algorithm.
The algorithm is estimated as a deterministic function of a set of
parameters, a dataset and a random state. Should implement a fit
method. Often used interchangeably with predictor because of user
preference.
All values in the missing
attribute, as
well as None
, NaN
, +inf
and -inf
.
Sample that contains one or more outlier values. Note that the Pruner class can use a different definition for outliers depending on the chosen strategy.
Value that lies further than 3 times the standard deviation away from the mean of its column, i.e., |z-score| > 3.
An estimator implementing a predict
method.
A non-estimator callable object which evaluates an estimator on given test data, returning a number. Unlike evaluation metrics, a greater returned number must correspond with a better score. See sklearn's documentation.
Subset (segment) of a sequence, whether through slicing or generating a range of values. When given as a parameter type, it includes both range and slice.
A one-dimensional, indexable array of type sequence (except string), np.ndarray, pd.Index or series. This is the standard input format for a dataset's target column.
One-dimensional ndarray with axis labels. The type is usually pd.Series, but could potentially be any of the series types backed by the selected data engine.
The dependent variable in a supervised learning task. Passed as y
to
an estimator's fit method.
One of the supervised machine learning approaches that ATOM supports:
An estimator implementing a transform
method. This encompasses all
data cleaning and feature engineering classes.