Vectorizer
class atom.nlp.Vectorizer(strategy="bow", return_sparse=True, device="cpu", engine="sklearn", verbose=0, logger=None, **kwargs)[source]
Vectorize text data.
Transform the corpus into meaningful vectors of numbers. The
transformation is applied on the column named corpus
. If
there is no column with that name, an exception is raised.
If strategy="bow" or "tfidf", the transformed columns are named
after the word they are embedding with the prefix corpus_
. If
strategy="hashing", the columns are named hash[N], where N stands
for the n-th hashed column.
This class can be accessed from atom through the vectorize method. Read more in the user guide.
Parameters | strategy: str, default="bow"
Strategy with which to vectorize the text. Choose from:
return_sparse: bool, default=True
Whether to return the transformation output as a dataframe
of sparse arrays. Must be False when there are other columns
in X (besides device: str, default="cpu"corpus ) that are non-sparse.
Device on which to train the estimators. Use any string
that follows the SYCL_DEVICE_FILTER filter selector,
e.g. engine: str, default="sklearn"device="gpu" to use the GPU. Read more in the
user guide.
Execution engine to use for the estimators. Refer to the
user guide for an explanation
regarding every choice. Choose from:
verbose: int, default=0
Verbosity level of the class. Choose from:
logger: str, Logger or None, default=None
Additional keyword arguments for the strategy estimator.
|
Attributes | [strategy]: sklearn transformer
Estimator instance (lowercase strategy) used to vectorize the
corpus, e.g. feature_names_in_: np.arrayvectorizer.tfidf for the tfidf strategy.
Names of features seen during fit.
n_features_in_: int
Number of features seen during fit.
|
See Also
Applies standard text cleaning to the corpus.
Normalize the corpus.
Tokenize the corpus.
Example
>>> from atom import ATOMClassifier
>>> X = [
... ["I àm in ne'w york"],
... ["New york is nice"],
... ["new york"],
... ["hi there this is a test!"],
... ["another line..."],
... ["new york is larger than washington"],
... ["running the test"],
... ["this is a test"],
... ]
>>> y = [1, 0, 0, 1, 1, 1, 0, 0]
>>> atom = ATOMClassifier(X, y)
>>> print(atom.dataset)
corpus target
0 new york 0
1 I àm in ne'w york 1
2 this is a test 0
3 running the test 0
4 another line... 1
5 hi there this is a test! 1
6 New york is nice 0
7 new york is larger than washington 1
>>> atom.vectorize(strategy="tfidf", verbose=2)
Fitting Vectorizer...
Vectorizing the corpus...
>>> print(atom.dataset)
corpus_another corpus_in corpus_is ... corpus_york corpus_àm target
0 0.000000 0.000000 0.000000 ... 0.627914 0.000000 0
1 0.000000 0.523358 0.000000 ... 0.422242 0.523358 1
2 0.000000 0.000000 0.614189 ... 0.000000 0.000000 0
3 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0
4 0.707107 0.000000 0.000000 ... 0.000000 0.000000 1
5 0.000000 0.000000 0.614189 ... 0.000000 0.000000 1
6 0.000000 0.000000 0.614189 ... 0.495524 0.000000 0
7 0.000000 0.000000 0.614189 ... 0.495524 0.000000 1
[8 rows x 13 columns]
>>> from atom.nlp import Vectorizer
>>> X = [
... ["I àm in ne'w york"],
... ["New york is nice"],
... ["new york"],
... ["hi there this is a test!"],
... ["another line..."],
... ["new york is larger than washington"],
... ["running the test"],
... ["this is a test"],
... ]
>>> y = [1, 0, 0, 1, 1, 1, 0, 0]
>>> vectorizer = Vectorizer(strategy="tfidf", verbose=2)
>>> X = vectorizer.fit_transform(X)
Fitting Vectorizer...
Vectorizing the corpus...
>>> print(X)
corpus_another corpus_hi ... corpus_york corpus_àm
0 0.000000 0.000000 ... 0.343774 0.542162
1 0.000000 0.000000 ... 0.415657 0.000000
2 0.000000 0.000000 ... 0.659262 0.000000
3 0.000000 0.525049 ... 0.000000 0.000000
4 0.707107 0.000000 ... 0.000000 0.000000
5 0.000000 0.000000 ... 0.304821 0.000000
6 0.000000 0.000000 ... 0.000000 0.000000
7 0.000000 0.000000 ... 0.000000 0.000000
[8 rows x 18 columns]
Methods
fit | Fit to data. |
fit_transform | Fit to data, then transform it. |
get_params | Get parameters for this estimator. |
inverse_transform | Does nothing. |
log | Print message and save to log file. |
save | Save the instance to a pickle file. |
set_params | Set the parameters of this estimator. |
transform | Vectorize the text. |
method fit(X, y=None)[source]
Fit to data.
method fit_transform(X=None, y=None, **fit_params)[source]
Fit to data, then transform it.
method get_params(deep=True)[source]
Get parameters for this estimator.
Parameters | deep : bool, default=True
If True, will return the parameters for this estimator and
contained subobjects that are estimators.
|
Returns | params : dict
Parameter names mapped to their values.
|
method inverse_transform(X=None, y=None)[source]
Does nothing.
method log(msg, level=0, severity="info")[source]
Print message and save to log file.
method save(filename="auto", save_data=True)[source]
Save the instance to a pickle file.
Parameters | filename: str, default="auto"
Name of the file. Use "auto" for automatic naming.
save_data: bool, default=True
Whether to save the dataset with the instance. This parameter
is ignored if the method is not called from atom. If False,
add the data to the load method.
|
method set_params(**params)[source]
Set the parameters of this estimator.
Parameters | **params : dict
Estimator parameters.
|
Returns | self : estimator instance
Estimator instance.
|
method transform(X, y=None)[source]
Vectorize the text.