TextCleaner
Applies standard text cleaning to the corpus.
Transformations include normalizing characters and dropping
noise from the text (emails, HTML tags, URLs, etc...). The
transformations are applied on the column named corpus
, in
the same order the parameters are presented. If there is no
column with that name, an exception is raised.
This class can be accessed from atom through the textclean method. Read more in the user guide.
See Also
Example
>>> import numpy as np
>>> from atom import ATOMClassifier
>>> from sklearn.datasets import fetch_20newsgroups
>>> X, y = fetch_20newsgroups(
... return_X_y=True,
... categories=["alt.atheism", "sci.med", "comp.windows.x"],
... shuffle=True,
... random_state=1,
... )
>>> X = np.array(X).reshape(-1, 1)
>>> atom = ATOMClassifier(X, y, random_state=1)
>>> print(atom.dataset)
corpus target
0 From: fabian@vivian.w.open.de (Fabian Hoppe)\n... 1
1 From: nyeda@cnsvax.uwec.edu (David Nye)\nSubje... 0
2 From: urathi@net4.ICS.UCI.EDU (Unmesh Rathi)\n... 1
3 From: inoue@crd.yokogawa.co.jp (Inoue Takeshi)... 1
4 From: sandvik@newton.apple.com (Kent Sandvik)\... 0
... ... ...
1662 From: kutluk@ccl.umist.ac.uk (Kutluk Ozguven)\... 0
1663 From: dmp1@ukc.ac.uk (D.M.Procida)\nSubject: R... 2
1664 From: tdunbar@vtaix.cc.vt.edu (Thomas Dunbar)\... 1
1665 From: dmp@fig.citib.com (Donna M. Paino)\nSubj... 2
1666 From: cdm@pmafire.inel.gov (Dale Cook)\nSubjec... 2
[1667 rows x 2 columns]
>>> atom.textclean(verbose=2)
Fitting TextCleaner...
Cleaning the corpus...
--> Decoding unicode characters to ascii.
--> Converting text to lower case.
--> Dropping emails from documents.
--> Dropping URL links from documents.
--> Dropping HTML tags from documents.
--> Dropping emojis from documents.
--> Dropping numbers from documents.
--> Dropping punctuation from the text.
>>> print(atom.dataset)
corpus target
0 from fabian hoppe\nsubject searching cadsoftw... 1
1 from david nye\nsubject re after years can w... 0
2 from unmesh rathi\nsubject motif and intervie... 1
3 from inoue takeshi\nsubject how to see charac... 1
4 from kent sandvik\nsubject re slavery was re ... 0
... ... ...
1662 from kutluk ozguven\nsubject re jewish settle... 0
1663 from dmprocida\nsubject re homeopathy a respe... 2
1664 from thomas dunbar\nsubject re x toolkits\nsu... 1
1665 from donna m paino\nsubject psoriatic arthrit... 2
1666 from dale cook\nsubject re morbus meniere is... 2
[1667 rows x 2 columns]
>>> import numpy as np
>>> from atom.nlp import TextCleaner
>>> from sklearn.datasets import fetch_20newsgroups
>>> X, y = fetch_20newsgroups(
... return_X_y=True,
... categories=["alt.atheism", "sci.med", "comp.windows.x"],
... shuffle=True,
... random_state=1,
... )
>>> X = np.array(X).reshape(-1, 1)
>>> textcleaner = TextCleaner(verbose=2)
>>> X = textcleaner.transform(X)
Cleaning the corpus...
--> Decoding unicode characters to ascii.
--> Converting text to lower case.
--> Dropping emails from documents.
--> Dropping URL links from documents.
--> Dropping HTML tags from documents.
--> Dropping emojis from documents.
--> Dropping numbers from documents.
--> Dropping punctuation from the text.
>>> print(X)
corpus
0 from mark a deloura\nsubject looking for x wi...
1 from der mouse\nsubject re creating bit wind...
2 from keith m ryan\nsubject re where are they ...
3 from steven grimm\nsubject re opinions on all...
4 from peter kaminski\nsubject re krillean phot...
... ...
1662 from donald mackie \nsubject re seeking advice...
1663 from gordon banks\nsubject re update help was...
1664 from keith m ryan\nsubject re political athei...
1665 from benedikt rosenau\nsubject re biblical ra...
1666 from derrick j brashear \nsubject mouseless op...
[1667 rows x 1 columns]
Methods
fit | Do nothing. |
fit_transform | Fit to data, then transform it. |
get_feature_names_out | Get output feature names for transformation. |
get_params | Get parameters for this estimator. |
inverse_transform | Do nothing. |
set_output | Set output container. |
set_params | Set the parameters of this estimator. |
transform | Apply the transformations to the data. |
Do nothing.
Implemented for continuity of the API.
Fit to data, then transform it.
Get output feature names for transformation.
Get parameters for this estimator.
Parameters |
deep : bool, default=True
If True, will return the parameters for this estimator and
contained subobjects that are estimators.
|
Returns |
params : dict
Parameter names mapped to their values.
|
Do nothing.
Returns the input unchanged. Implemented for continuity of the API.
Set output container.
See sklearn's user guide on how to use the
set_output
API. See here a description
of the choices.
Set the parameters of this estimator.
Parameters |
**params : dict
Estimator parameters.
|
Returns |
self : estimator instance
Estimator instance.
|
Apply the transformations to the data.