Skip to content

TextCleaner


class atom.nlp.TextCleaner(decode=True, lower_case=True, drop_email=True, regex_email=None, drop_url=True, regex_url=None, drop_html=True, regex_html=None, drop_emoji, regex_emoji=None, drop_number=True, regex_number=None, drop_punctuation=True, verbose=0, logger=None) [source]

Applies standard text cleaning to the corpus. Transformations include normalizing characters and dropping noise from the text (emails, HTML tags, URLs, etc...). The transformations are applied on the column named corpus, in the same order the parameters are presented. If there is no column with that name, an exception is raised. This class can be accessed from atom through the textclean method. Read more in the user guide.

Parameters:

decode: bool, optional (default=True)
Whether to decode unicode characters to their ascii representations.

lower_case: bool, optional (default=True)
Whether to convert all characters to lower case.

drop_email: bool, optional (default=True)
Whether to drop email addresses from the text.

regex_email: str, optional (default=None)
Regex used to search for email addresses. If None, it uses r"[\w.-]+@[\w-]+\.[\w.-]+".

drop_url: bool, optional (default=True)
Whether to drop URL links from the text.

regex_url: str, optional (default=None)
Regex used to search for URLs. If None, it uses r"https?://\S+|www\.\S+".

drop_html: bool, optional (default=True)
Whether to drop HTML tags from the text. This option is particularly useful if the data was scraped from a website.

regex_html: str, optional (default=None)
Regex used to search for html tags. If None, it uses r"<.*?>".

drop_emoji: bool, optional (default=True)
Whether to drop emojis from the text.

regex_emoji: str, optional (default=None)
Regex used to search for emojis. If None, it uses r":[a-z_]+:".

drop_number: bool, optional (default=False)
Whether to drop numbers from the text.

regex_number: str, optional (default=None)
Regex used to search for numbers. If None, it uses r"\b\d+\b". Note that numbers adjacent to letters are not removed.

drop_punctuation: bool, optional (default=True)
Whether to drop punctuations from the text. Characters considered punctuation are !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~.

verbose: int, optional (default=0)
Verbosity level of the class. Possible values are:
  • 0 to not print anything.
  • 1 to print basic information.
  • 2 to print detailed information.
logger: str, Logger or None, optional (default=None)
  • If None: Doesn't save a logging file.
  • If str: Name of the log file. Use "auto" for automatic naming.
  • Else: Python logging.Logger instance.


Attributes

Attributes: drops: pd.DataFrame
Encountered regex matches. The row indices correspond to the document index from which the occurrence was dropped.


Methods

fit_transform Same as transform.
get_params Get parameters for this estimator.
log Write information to the logger and print to stdout.
save Save the instance to a pickle file.
set_params Set the parameters of this estimator.
transform Transform the text.


method fit_transform(X, y=None) [source]

Apply text cleaning.

Parameters:

X: dataframe-like
Feature set with shape=(n_samples, n_features). If X is not a pd.DataFrame, it should be composed of a single feature containing the text documents.

y: int, str, sequence or None, optional (default=None)
Does nothing. Implemented for continuity of the API.
Returns:

X: pd.DataFrame
Transformed corpus.


method get_params(deep=True) [source]

Get parameters for this estimator.

Parameters:

deep: bool, optional (default=True)
If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns: dict
Parameter names mapped to their values.


method log(msg, level=0) [source]

Write a message to the logger and print it to stdout.

Parameters:

msg: str
Message to write to the logger and print to stdout.

level: int, optional (default=0)
Minimum verbosity level to print the message.


method save(filename="auto") [source]

Save the instance to a pickle file.

Parameters: filename: str, optional (default="auto")
Name of the file. Use "auto" for automatic naming.


method set_params(**params) [source]

Set the parameters of this estimator.

Parameters: **params: dict
Estimator parameters.
Returns: TextCleaner
Estimator instance.


method transform(X, y=None) [source]

Apply text cleaning.

Parameters:

X: dataframe-like
Feature set with shape=(n_samples, n_features). If X is not a pd.DataFrame, it should be composed of a single feature containing the text documents.

y: int, str, sequence or None, optional (default=None)
Does nothing. Implemented for continuity of the API.
Returns:

X: pd.DataFrame
Transformed corpus.


Example

from atom import ATOMClassifier

atom = ATOMClassifier(X, y)
atom.textclean()
from atom.nlp import TextCleaner

cleaner = TextCleaner()
X = cleaner.transform(X)
Back to top