Tokenizer
Convert documents into sequences of words. Additionally, create n-grams
(represented by words united with underscores, e.g. "New_York") based
on their frequency in the corpus. The transformations are applied on
the column named Corpus. If there is no column with that name, an
exception is raised. This class can be accessed from atom through the
tokenize method. Read more
in the user guide.
| Parameters: |
bigram_freq: int, float or None, optional (default=None) Frequency threshold for bigram creation.
Frequency threshold for trigram creation.
Frequency threshold for quadgram creation.
Verbosity level of the class. Possible values are:
|
Attributes
| Attributes: |
bigrams: pd.DataFrame
trigrams: pd.DataFrame
quadgrams: pd.DataFrame |
Methods
| fit_transform | Same as transform. |
| get_params | Get parameters for this estimator. |
| log | Write information to the logger and print to stdout. |
| save | Save the instance to a pickle file. |
| set_params | Set the parameters of this estimator. |
| transform | Transform the text. |
Tokenize the text.
| Parameters: |
X: dict, list, tuple, np.ndarray or pd.DataFrame Does nothing. Implemented for continuity of the API. |
| Returns: |
X: pd.DataFrame |
Get parameters for this estimator.
| Parameters: |
deep: bool, optional (default=True) |
| Returns: |
params: dict Dictionary of the parameter names mapped to their values. |
Write a message to the logger and print it to stdout.
| Parameters: |
msg: str
level: int, optional (default=0) |
Save the instance to a pickle file.
| Parameters: |
filename: str, optional (default="auto") Name of the file. Use "auto" for automatic naming. |
Set the parameters of this estimator.
| Parameters: |
**params: dict Estimator parameters. |
| Returns: |
self: Tokenizer Estimator instance. |
Tokenize the text.
| Parameters: |
X: dict, list, tuple, np.ndarray or pd.DataFrame Does nothing. Implemented for continuity of the API. |
| Returns: |
X: pd.DataFrame |
Example
from atom import ATOMClassifier
atom = ATOMClassifier(X, y)
atom.tokenize(bigram_freq=0.01)
from atom.nlp import Tokenizer
tokenizer = Tokenizer(bigram_freq=0.01)
X = tokenizer.transform(X)