TextCleaner
class atom.nlp.TextCleaner(decode=True, lower_case=True, drop_email=True, regex_email=None, drop_url=True, regex_url=None, drop_html=True, regex_html=None, drop_emoji=True, regex_emoji=None, drop_number=True, regex_number=None, drop_punctuation=True, verbose=0, logger=None)[source]
Applies standard text cleaning to the corpus.
Transformations include normalizing characters and dropping
noise from the text (emails, HTML tags, URLs, etc...). The
transformations are applied on the column named corpus
, in
the same order the parameters are presented. If there is no
column with that name, an exception is raised.
This class can be accessed from atom through the textclean method. Read more in the user guide.
See Also
Normalize the corpus.
Tokenize the corpus.
Vectorize text data.
Example
>>> from atom import ATOMClassifier
>>> from sklearn.datasets import fetch_20newsgroups
>>> X, y = fetch_20newsgroups(
... return_X_y=True,
... categories=[
... 'alt.atheism',
... 'sci.med',
... 'comp.windows.x',
... ],
... shuffle=True,
... random_state=1,
... )
>>> X = np.array(X).reshape(-1, 1)
>>> atom = ATOMClassifier(X, y)
>>> print(atom.dataset)
0 From: thssjxy@iitmax.iit.edu (Smile) Subject:... 2
1 From: nancyo@fraser.sfu.ca (Nancy Patricia O'C... 0
2 From: beck@irzr17.inf.tu-dresden.de (Andre Bec... 1
3 From: keith@cco.caltech.edu (Keith Allan Schne... 0
4 From: strom@Watson.Ibm.Com (Rob Strom) Subjec... 0
... ...
2841 From: dreitman@oregon.uoregon.edu (Daniel R. R... 3
2842 From: ethan@cs.columbia.edu (Ethan Solomita) ... 1
2843 From: r0506048@cml3 (Chun-Hung Lin) Subject: ... 1
2844 From: eshneken@ux4.cso.uiuc.edu (Edward A Shne... 2
2845 From: ibeshir@nyx.cs.du.edu (Ibrahim) Subject... 2
[2846 rows x 2 columns]
>>> atom.textclean(verbose=2)
Fitting TextCleaner...
Cleaning the corpus...
--> Decoding unicode characters to ascii.
--> Converting text to lower case.
--> Dropping 10012 emails from 2830 documents.
--> Dropping 0 URL links from 0 documents.
--> Dropping 2214 HTML tags from 1304 documents.
--> Dropping 2 emojis from 1 documents.
--> Dropping 31222 numbers from 2843 documents.
--> Dropping punctuation from the text.
>>> print(atom.dataset)
corpus target
0 from smile subject forsale used guitar amp... 2
1 from nancy patricia oconnor subject re amusi... 0
2 from andre beck subject re animation with xp... 1
3 from keith allan schneider subject re moralt... 0
4 from rob strom subject re socmotss et al pri... 0
... ...
2841 from daniel r reitman attorney to be subject... 3
2842 from ethan solomita subject forcing a window... 1
2843 from r0506048cml3 chunhung lin subject re xma... 1
2844 from edward a shnekendorf subject airline ti... 2
2845 from ibrahim subject terminal for sale orga... 2
[2846 rows x 2 columns]
>>> from atom.nlp import TextCleaner
>>> from sklearn.datasets import fetch_20newsgroups
>>> X, y = fetch_20newsgroups(
... return_X_y=True,
... categories=[
... 'alt.atheism',
... 'sci.med',
... 'comp.windows.x',
... ],
... shuffle=True,
... random_state=1,
... )
>>> X = np.array(X).reshape(-1, 1)
>>> textcleaner = TextCleaner(verbose=2)
>>> X = textcleaner.transform(X)
Cleaning the corpus...
--> Decoding unicode characters to ascii.
--> Converting text to lower case.
--> Dropping 10012 emails from 2830 documents.
--> Dropping 0 URL links from 0 documents.
--> Dropping 2214 HTML tags from 1304 documents.
--> Dropping 2 emojis from 1 documents.
--> Dropping 31222 numbers from 2843 documents.
--> Dropping punctuation from the text.
>>> print(X)
corpus
0 from donald mackie subject re barbecued food...
1 from david stockton subject re krillean phot...
2 from julia miller subject posix message cata...
3 from subject re yet more rushdie re islamic...
4 from joseph a muller subject jfk autograph f...
...
2841 from joel reymont subject motif maling list\...
2842 from daniel paul checkman subject re is msg ...
2843 from ad absurdum per aspera subject re its a...
2844 from ralf subject items for sale organizati...
2845 from walter g seefeld subject klipsch kg1 sp...
[2846 rows x 1 columns]
Methods
fit | Does nothing. |
fit_transform | Fit to data, then transform it. |
get_params | Get parameters for this estimator. |
inverse_transform | Does nothing. |
log | Print message and save to log file. |
save | Save the instance to a pickle file. |
set_params | Set the parameters of this estimator. |
transform | Apply the transformations to the data. |
method fit(X=None, y=None, **fit_params)[source]
Does nothing.
Implemented for continuity of the API.
method fit_transform(X=None, y=None, **fit_params)[source]
Fit to data, then transform it.
method get_params(deep=True)[source]
Get parameters for this estimator.
Parameters | deep : bool, default=True
If True, will return the parameters for this estimator and
contained subobjects that are estimators.
|
Returns | params : dict
Parameter names mapped to their values.
|
method inverse_transform(X=None, y=None)[source]
Does nothing.
method log(msg, level=0, severity="info")[source]
Print message and save to log file.
method save(filename="auto", save_data=True)[source]
Save the instance to a pickle file.
Parameters | filename: str, default="auto"
Name of the file. Use "auto" for automatic naming.
save_data: bool, default=True
Whether to save the dataset with the instance. This
parameter is ignored if the method is not called from
atom. If False, remember to add the data to ATOMLoader
when loading the file.
|
method set_params(**params)[source]
Set the parameters of this estimator.
Parameters | **params : dict
Estimator parameters.
|
Returns | self : estimator instance
Estimator instance.
|
method transform(X, y=None)[source]
Apply the transformations to the data.