Vectorizer

class atom.nlp.Vectorizer(strategy="bow", return_sparse=True, device="cpu", engine=None, verbose=0, **kwargs)[source]

Vectorize text data.

Transform the corpus into meaningful vectors of numbers. The transformation is applied on the column named corpus. If there is no column with that name, an exception is raised.

If strategy="bow" or "tfidf", the transformed columns are named after the word they are embedding with the prefix corpus_. If strategy="hashing", the columns are named hash[N], where N stands for the n-th hashed column.

This class can be accessed from atom through the vectorize method. Read more in the user guide.

Parameters

strategy: str, default="bow"

Strategy with which to vectorize the text. Choose from:

"bow": Bag of Words.
"tfidf": Term Frequency - Inverse Document Frequency.
"hashing": Vectorize to a matrix of token occurrences.

return_sparse: bool, default=True

Whether to return the transformation output as a dataframe of sparse arrays. Must be False when there are other columns in X (besides corpus) that are non-sparse.

device: str, default="cpu"

Device on which to run the estimators. Use any string that follows the SYCL_DEVICE_FILTER filter selector, e.g. device="gpu" to use the GPU. Read more in the user guide.

engine: str or None, default=None

Execution engine to use for estimators. If None, the default value is used. Choose from:

"sklearn" (default)
"cuml"

verbose: int, default=0

Verbosity level of the class. Choose from:

0 to not print anything.
1 to print basic information.
2 to print detailed information.

**kwargs

Additional keyword arguments for the strategy estimator.

Attributes

[strategy]_: sklearn transformer

Estimator instance (lowercase strategy) used to vectorize the corpus, e.g., vectorizer.tfidf for the tfidf strategy.

feature_names_in_: np.ndarray

Names of features seen during fit.

n_features_in_: int

Number of features seen during fit.

Example

atomstand-alone

>>> from atom import ATOMClassifier

>>> X = [
...    ["I àm in ne'w york"],
...    ["New york is nice"],
...    ["new york"],
...    ["hi there this is a test!"],
...    ["another line..."],
...    ["new york is larger than washington"],
...    ["running the test"],
...    ["this is a test"],
... ]
>>> y = [1, 0, 0, 1, 1, 1, 0, 0]

>>> atom = ATOMClassifier(X, y, test_size=2, random_state=1)
>>> print(atom.dataset)

                               corpus  target
0                            new york       0
1                     another line...       1
2                    New york is nice       0
3  new york is larger than washington       1
4                    running the test       0
5                   I àm in ne'w york       1
6                      this is a test       0
7            hi there this is a test!       1


>>> atom.vectorize(strategy="tfidf", verbose=2)

Fitting Vectorizer...
Vectorizing the corpus...


>>> print(atom.dataset)

   corpus_another  corpus_in  corpus_is  corpus_larger  corpus_line  corpus_ne  corpus_new  corpus_nice  corpus_running  corpus_test  corpus_than  corpus_the  corpus_washington  corpus_york  corpus_àm  target
0        0.000000   0.000000   0.000000       0.000000     0.000000   0.000000    0.759339     0.000000         0.00000     0.000000     0.000000     0.00000           0.000000     0.650696   0.000000       0
1        0.707107   0.000000   0.000000       0.000000     0.707107   0.000000    0.000000     0.000000         0.00000     0.000000     0.000000     0.00000           0.000000     0.000000   0.000000       1
2        0.000000   0.000000   0.518242       0.000000     0.000000   0.000000    0.437535     0.631991         0.00000     0.000000     0.000000     0.00000           0.000000     0.374934   0.000000       0
3        0.000000   0.000000   0.386401       0.471212     0.000000   0.000000    0.326226     0.000000         0.00000     0.000000     0.471212     0.00000           0.471212     0.279551   0.000000       1
4        0.000000   0.000000   0.000000       0.000000     0.000000   0.000000    0.000000     0.000000         0.57735     0.577350     0.000000     0.57735           0.000000     0.000000   0.000000       0
5        0.000000   0.546199   0.000000       0.000000     0.000000   0.546199    0.000000     0.000000         0.00000     0.000000     0.000000     0.00000           0.000000     0.324037   0.546199       1
6        0.000000   0.000000   0.634086       0.000000     0.000000   0.000000    0.000000     0.000000         0.00000     0.773262     0.000000     0.00000           0.000000     0.000000   0.000000       0
7        0.000000   0.000000   0.634086       0.000000     0.000000   0.000000    0.000000     0.000000         0.00000     0.773262     0.000000     0.00000           0.000000     0.000000   0.000000       1

>>> from atom.nlp import Vectorizer

>>> X = [
...    ["I àm in ne'w york"],
...    ["New york is nice"],
...    ["new york"],
...    ["hi there this is a test!"],
...    ["another line..."],
...    ["new york is larger than washington"],
...    ["running the test"],
...    ["this is a test"],
... ]

>>> vectorizer = Vectorizer(strategy="tfidf", verbose=2)
>>> X = vectorizer.fit_transform(X)

Fitting Vectorizer...
Vectorizing the corpus...


>>> print(X)

   corpus_another  corpus_hi  corpus_in  corpus_is  corpus_larger  corpus_line  corpus_ne  corpus_new  corpus_nice  corpus_running  corpus_test  corpus_than  corpus_the  corpus_there  corpus_this  corpus_washington  corpus_york  corpus_àm
0        0.000000   0.000000   0.542162   0.000000       0.000000     0.000000   0.542162    0.000000     0.000000        0.000000     0.000000     0.000000    0.000000      0.000000     0.000000           0.000000     0.343774   0.542162
1        0.000000   0.000000   0.000000   0.415657       0.000000     0.000000   0.000000    0.474072     0.655527        0.000000     0.000000     0.000000    0.000000      0.000000     0.000000           0.000000     0.415657   0.000000
2        0.000000   0.000000   0.000000   0.000000       0.000000     0.000000   0.000000    0.751913     0.000000        0.000000     0.000000     0.000000    0.000000      0.000000     0.000000           0.000000     0.659262   0.000000
3        0.000000   0.525049   0.000000   0.332923       0.000000     0.000000   0.000000    0.000000     0.000000        0.000000     0.379712     0.000000    0.000000      0.525049     0.440032           0.000000     0.000000   0.000000
4        0.707107   0.000000   0.000000   0.000000       0.000000     0.707107   0.000000    0.000000     0.000000        0.000000     0.000000     0.000000    0.000000      0.000000     0.000000           0.000000     0.000000   0.000000
5        0.000000   0.000000   0.000000   0.304821       0.480729     0.000000   0.000000    0.347660     0.000000        0.000000     0.000000     0.480729    0.000000      0.000000     0.000000           0.480729     0.304821   0.000000
6        0.000000   0.000000   0.000000   0.000000       0.000000     0.000000   0.000000    0.000000     0.000000        0.629565     0.455297     0.000000    0.629565      0.000000     0.000000           0.000000     0.000000   0.000000
7        0.000000   0.000000   0.000000   0.497041       0.000000     0.000000   0.000000    0.000000     0.000000        0.000000     0.566893     0.000000    0.000000      0.000000     0.656949           0.000000     0.000000   0.000000

Methods

fit	Fit to data.
fit_transform	Fit to data, then transform it.
get_feature_names_out	Get output feature names for transformation.
get_params	Get parameters for this estimator.
inverse_transform	Do nothing.
set_output	Set output container.
set_params	Set the parameters of this estimator.
transform	Vectorize the text.

method fit(X, y=None)[source]

Fit to data.

Parameters	X: dataframe-like Feature set with shape=(n_samples, n_features). If X is not a dataframe, it should be composed of a single feature containing the text documents. y: sequence, dataframe-like or None, default=None Do nothing. Implemented for continuity of the API.
Returns	Self Estimator instance.

method fit_transform(X=None, y=None, **fit_params)[source]

Fit to data, then transform it.

Parameters	X: dataframe-like or None, default=None Feature set with shape=(n_samples, n_features). If None, `X` is ignored. y: sequence, dataframe-like or None, default=None Target column(s) corresponding to `X`. If None, `y` is ignored. **fit_params Additional keyword arguments for the fit method.
Returns	dataframe Transformed feature set. Only returned if provided. series or dataframe Transformed target column. Only returned if provided.

method get_feature_names_out(input_features=None)[source]

Get output feature names for transformation.

Parameters	input_features: sequence or None, default=None Only used to validate feature names with the names seen in `fit`.
Returns	np.ndarray Transformed feature names.

method get_params(deep=True)[source]

Get parameters for this estimator.

Parameters	deep : bool, default=True If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns	params : dict Parameter names mapped to their values.

method inverse_transform(X=None, y=None, **fit_params)[source]

Do nothing.

Returns the input unchanged. Implemented for continuity of the API.

Parameters	X: dataframe-like or None, default=None Feature set with shape=(n_samples, n_features). If None, `X` is ignored. y: sequence, dataframe-like or None, default=None Target column(s) corresponding to `X`. If None, `y` is ignored.
Returns	dataframe Feature set. Only returned if provided. series or dataframe Target column(s). Only returned if provided.

method set_output(transform=None)[source]

Set output container.

See sklearn's user guide on how to use the set_output API. See here a description of the choices.

Parameters	transform: str or None, default=None Configure the output of the `transform`, `fit_transform`, and `inverse_transform` method. If None, the configuration is not changed. Choose from: "numpy" "pandas" (default) "pandas-pyarrow" "polars" "polars-lazy" "pyarrow" "modin" "dask" "pyspark" "pyspark-pandas"
Returns	Self Estimator instance.

method set_params(**params)[source]

Set the parameters of this estimator.

Parameters	**params : dict Estimator parameters.
Returns	self : estimator instance Estimator instance.

method transform(X, y=None)[source]

Vectorize the text.

Parameters	X: dataframe-like Feature set with shape=(n_samples, n_features). If X is not a dataframe, it should be composed of a single feature containing the text documents. y: sequence, dataframe-like or None, default=None Do nothing. Implemented for continuity of the API.
Returns	dataframe Transformed corpus.