Vectorizer
Vectorize text data.
Transform the corpus into meaningful vectors of numbers. The
transformation is applied on the column named corpus
. If
there is no column with that name, an exception is raised.
If strategy="bow" or "tfidf", the transformed columns are named
after the word they are embedding with the prefix corpus_
. If
strategy="hashing", the columns are named hash[N], where N stands
for the n-th hashed column.
This class can be accessed from atom through the vectorize method. Read more in the user guide.
Parameters |
strategy: str, default="bow"
Strategy with which to vectorize the text. Choose from:
return_sparse: bool, default=True
Whether to return the transformation output as a dataframe
of sparse arrays. Must be False when there are other columns
in X (besides
device: str, default="cpu"corpus ) that are non-sparse.
Device on which to run the estimators. Use any string that
follows the SYCL_DEVICE_FILTER filter selector, e.g.
engine: str or None, default=Nonedevice="gpu" to use the GPU. Read more in the
user guide.
Execution engine to use for estimators.
If None, the default value is used. Choose from:
verbose: int, default=0
Verbosity level of the class. Choose from:
**kwargs
Additional keyword arguments for the strategy estimator.
|
Attributes | {#vectorizer-[strategy]}
[strategy]: sklearn transformer
Estimator instance (lowercase strategy) used to vectorize the
corpus, e.g.,
feature_names_in_: np.ndarrayvectorizer.tfidf for the tfidf strategy.
Names of features seen during
n_features_in_: intfit .
Number of features seen during fit .
|
See Also
Example
>>> from atom import ATOMClassifier
>>> X = [
... ["I àm in ne'w york"],
... ["New york is nice"],
... ["new york"],
... ["hi there this is a test!"],
... ["another line..."],
... ["new york is larger than washington"],
... ["running the test"],
... ["this is a test"],
... ]
>>> y = [1, 0, 0, 1, 1, 1, 0, 0]
>>> atom = ATOMClassifier(X, y, test_size=2, random_state=1)
>>> print(atom.dataset)
corpus target
0 new york 0
1 another line... 1
2 New york is nice 0
3 new york is larger than washington 1
4 running the test 0
5 I àm in ne'w york 1
6 this is a test 0
7 hi there this is a test! 1
>>> atom.vectorize(strategy="tfidf", verbose=2)
Fitting Vectorizer...
Vectorizing the corpus...
>>> print(atom.dataset)
corpus_another corpus_in corpus_is corpus_larger corpus_line corpus_ne corpus_new corpus_nice corpus_running corpus_test corpus_than corpus_the corpus_washington corpus_york corpus_àm target
0 0 0 0 0 0 0 0.759339 0 0 0 0 0 0 0.650696 0 0
1 0.707107 0 0 0 0.707107 0 0 0 0 0 0 0 0 0 0 1
2 0 0 0.518242 0 0 0 0.437535 0.631991 0 0 0 0 0 0.374934 0 0
3 0 0 0.386401 0.471212 0 0 0.326226 0 0 0 0.471212 0 0.471212 0.279551 0 1
4 0 0 0 0 0 0 0 0 0.57735 0.57735 0 0.57735 0 0 0 0
5 0 0.546199 0 0 0 0.546199 0 0 0 0 0 0 0 0.324037 0.546199 1
6 0 0 0.634086 0 0 0 0 0 0 0.773262 0 0 0 0 0 0
7 0 0 0.634086 0 0 0 0 0 0 0.773262 0 0 0 0 0 1
>>> from atom.nlp import Vectorizer
>>> X = [
... ["I àm in ne'w york"],
... ["New york is nice"],
... ["new york"],
... ["hi there this is a test!"],
... ["another line..."],
... ["new york is larger than washington"],
... ["running the test"],
... ["this is a test"],
... ]
>>> vectorizer = Vectorizer(strategy="tfidf", verbose=2)
>>> X = vectorizer.fit_transform(X)
Fitting Vectorizer...
Vectorizing the corpus...
>>> print(X)
corpus_another corpus_hi corpus_in corpus_is corpus_larger corpus_line corpus_ne corpus_new corpus_nice corpus_running corpus_test corpus_than corpus_the corpus_there corpus_this corpus_washington corpus_york corpus_àm
0 0 0 0.542162 0 0 0 0.542162 0 0 0 0 0 0 0 0 0 0.343774 0.542162
1 0 0 0 0.415657 0 0 0 0.474072 0.655527 0 0 0 0 0 0 0 0.415657 0
2 0 0 0 0 0 0 0 0.751913 0 0 0 0 0 0 0 0 0.659262 0
3 0 0.525049 0 0.332923 0 0 0 0 0 0 0.379712 0 0 0.525049 0.440032 0 0 0
4 0.707107 0 0 0 0 0.707107 0 0 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0.304821 0.480729 0 0 0.34766 0 0 0 0.480729 0 0 0 0.480729 0.304821 0
6 0 0 0 0 0 0 0 0 0 0.629565 0.455297 0 0.629565 0 0 0 0 0
7 0 0 0 0.497041 0 0 0 0 0 0 0.566893 0 0 0 0.656949 0 0 0
Methods
fit | Fit to data. |
fit_transform | Fit to data, then transform it. |
get_feature_names_out | Get output feature names for transformation. |
get_params | Get parameters for this estimator. |
inverse_transform | Do nothing. |
set_output | Set output container. |
set_params | Set the parameters of this estimator. |
transform | Vectorize the text. |
Fit to data.
Fit to data, then transform it.
Get output feature names for transformation.
Parameters |
input_features: sequence or None, default=None
Only used to validate feature names with the names seen in
fit .
|
Returns |
np.ndarray
Transformed feature names.
|
Get parameters for this estimator.
Parameters |
deep : bool, default=True
If True, will return the parameters for this estimator and
contained subobjects that are estimators.
|
Returns |
params : dict
Parameter names mapped to their values.
|
Do nothing.
Returns the input unchanged. Implemented for continuity of the API.
Set output container.
See sklearn's user guide on how to use the
set_output
API. See here a description
of the choices.
Set the parameters of this estimator.
Parameters |
**params : dict
Estimator parameters.
|
Returns |
self : estimator instance
Estimator instance.
|
Vectorize the text.