Tokenizer
Convert documents into sequences of words. Additionally, create n-grams
(represented by words united with underscores, e.g. "New_York") based
on their frequency in the corpus. The transformations are applied on
the column named Corpus
. If there is no column with that name, an
exception is raised. This class can be accessed from atom through the
tokenize method. Read more
in the user guide.
Parameters: |
bigram_freq: int, float or None, optional (default=None) Frequency threshold for bigram creation.
Frequency threshold for trigram creation.
Frequency threshold for quadgram creation.
Verbosity level of the class. Possible values are:
|
Attributes
Attributes: |
bigrams: pd.DataFrame
trigrams: pd.DataFrame
quadgrams: pd.DataFrame |
Methods
fit_transform | Same as transform. |
get_params | Get parameters for this estimator. |
log | Write information to the logger and print to stdout. |
save | Save the instance to a pickle file. |
set_params | Set the parameters of this estimator. |
transform | Transform the text. |
Tokenize the text.
Parameters: |
X: dict, list, tuple, np.ndarray or pd.DataFrame Does nothing. Implemented for continuity of the API. |
Returns: |
X: pd.DataFrame |
Get parameters for this estimator.
Parameters: |
deep: bool, optional (default=True) |
Returns: |
params: dict Dictionary of the parameter names mapped to their values. |
Write a message to the logger and print it to stdout.
Parameters: |
msg: str
level: int, optional (default=0) |
Save the instance to a pickle file.
Parameters: |
filename: str, optional (default="auto") Name of the file. Use "auto" for automatic naming. |
Set the parameters of this estimator.
Parameters: |
**params: dict Estimator parameters. |
Returns: |
self: Tokenizer Estimator instance. |
Tokenize the text.
Parameters: |
X: dict, list, tuple, np.ndarray or pd.DataFrame Does nothing. Implemented for continuity of the API. |
Returns: |
X: pd.DataFrame |
Example
from atom import ATOMClassifier
atom = ATOMClassifier(X, y)
atom.tokenize(bigram_freq=0.01)
from atom.nlp import Tokenizer
tokenizer = Tokenizer(bigram_freq=0.01)
X = tokenizer.transform(X)