TextCleaner
Applies standard text cleaning to the corpus. Transformations include
normalizing characters and dropping noise from the text (emails, HTML
tags, URLs, etc...). The transformations are applied on the column
named Corpus
, in the same order the parameters are presented. If
there is no column with that name, an exception is raised. This class
can be accessed from atom through the textclean
method. Read more in the user guide.
Parameters: |
decode: bool, optional (default=True)
lower_case: bool, optional (default=True)
drop_email: bool, optional (default=True)
regex_email: str, optional (default=None)
drop_url: bool, optional (default=True)
regex_url: str, optional (default=None)
drop_html: bool, optional (default=True)
regex_html: str, optional (default=None)
drop_emoji: bool, optional (default=True)
regex_emoji: str, optional (default=None)
drop_number: bool, optional (default=False)
regex_number: str, optional (default=None)
drop_punctuation: bool, optional (default=True) Verbosity level of the class. Possible values are:
|
Attributes
Attributes: |
drops: pd.DataFrame Encountered regex matches. The row indices correspond to the document index from which the occurrence was dropped. |
Methods
fit_transform | Same as transform. |
get_params | Get parameters for this estimator. |
log | Write information to the logger and print to stdout. |
save | Save the instance to a pickle file. |
set_params | Set the parameters of this estimator. |
transform | Transform the text. |
Apply text cleaning.
Parameters: |
X: dict, list, tuple, np.ndarray or pd.DataFrame Does nothing. Implemented for continuity of the API. |
Returns: |
X: pd.DataFrame |
Get parameters for this estimator.
Parameters: |
deep: bool, optional (default=True) |
Returns: |
params: dict Dictionary of the parameter names mapped to their values. |
Write a message to the logger and print it to stdout.
Parameters: |
msg: str
level: int, optional (default=0) |
Save the instance to a pickle file.
Parameters: |
filename: str, optional (default="auto") Name of the file. Use "auto" for automatic naming. |
Set the parameters of this estimator.
Parameters: |
**params: dict Estimator parameters. |
Returns: |
self: TextCleaner Estimator instance. |
Apply text cleaning.
Parameters: |
X: dict, list, tuple, np.ndarray or pd.DataFrame Does nothing. Implemented for continuity of the API. |
Returns: |
X: pd.DataFrame |
Example
from atom import ATOMClassifier
atom = ATOMClassifier(X, y)
atom.textclean()
from atom.nlp import TextCleaner
cleaner = TextCleaner()
X = cleaner.transform(X)