Example: NLP¶

This example shows how to use ATOM to quickly go from raw text data to model predictions.

Import the 20 newsgroups text dataset from sklearn.datasets. The dataset comprises around 18000 articles on 20 topics. The goal is to predict the topic of every article.

Load the data¶

In [1]:

                
                    Copied!
                    
import numpy as np
from atom import ATOMClassifier
from sklearn.datasets import fetch_20newsgroups
import numpy as np
from atom import ATOMClassifier
from sklearn.datasets import fetch_20newsgroups

UserWarning: The pandas version installed (1.5.3) does not match the supported pandas version in Modin (1.5.2). This may cause undesired side effects!

In [2]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Use only a subset of the available topics for faster processing
X_text, y_text = fetch_20newsgroups(
    return_X_y=True,
    categories=[
        'sci.med',
        'comp.windows.x',
        'misc.forsale',
        'rec.autos',
    ],
    shuffle=True,
    random_state=1,
)
X_text = np.array(X_text).reshape(-1, 1)
# Use only a subset of the available topics for faster processing
X_text, y_text = fetch_20newsgroups(
    return_X_y=True,
    categories=[
        'sci.med',
        'comp.windows.x',
        'misc.forsale',
        'rec.autos',
    ],
    shuffle=True,
    random_state=1,
)
X_text = np.array(X_text).reshape(-1, 1)

Run the pipeline¶

In [3]:

                
                    Copied!
                    
atom = ATOMClassifier(X_text, y_text, index=True, test_size=0.3, verbose=2, random_state=1)
atom = ATOMClassifier(X_text, y_text, index=True, test_size=0.3, verbose=2, random_state=1)

<< ================== ATOM ================== >>
Algorithm task: multiclass classification.

Dataset stats ==================== >>
Shape: (2366, 2)
Train set size: 1657
Test set size: 709
-------------------------------------
Memory: 4.14 MB
Scaled: False
Categorical features: 1 (100.0%)

In [4]:

                
                    Copied!
                    
atom.dataset  # Note that the feature is automatically named 'corpus'
atom.dataset  # Note that the feature is automatically named 'corpus'

Out[4]:

	corpus	target
1731	From: rlm@helen.surfcty.com (Robert L. McMilli...	0
1496	From: carl@SOL1.GPS.CALTECH.EDU (Carl J Lydick...	3
1290	From: thssjxy@iitmax.iit.edu (Smile)\nSubject:...	1
2021	From: c23st@kocrsv01.delcoelect.com (Spiros Tr...	2
142	From: ginkgo@ecsvax.uncecs.edu (J. Geary Morto...	1
...	...	...
510	From: mary@uicsl.csl.uiuc.edu (Mary E. Allison...	3
1948	From: ndd@sunbar.mc.duke.edu (Ned Danieley)\nS...	0
798	From: kk@unisql.UUCP (Kerry Kimbrough)\nSubjec...	0
2222	From: hamachi@adobe.com (Gordon Hamachi)\nSubj...	2
2215	From: mobasser@vu-vlsi.ee.vill.edu (Bijan Moba...	2

2366 rows × 2 columns

In [5]:

                
                    Copied!
                    
# Let's have a look at the first document
atom.corpus[0]
# Let's have a look at the first document
atom.corpus[0]

Out[5]:

'From: caf@omen.UUCP (Chuck Forsberg WA7KGX)\nSubject: Re: My New Diet --> IT WORKS GREAT !!!!\nOrganization: Omen Technology INC, Portland Rain Forest\nLines: 32\n\nIn article <1qk6v3INNrm6@lynx.unm.edu> bhjelle@carina.unm.edu () writes:\n>\n>Gordon Banks:\n>\n>>a lot to keep from going back to morbid obesity.  I think all\n>>of us cycle.  One\'s success depends on how large the fluctuations\n>>in the cycle are.  Some people can cycle only 5 pounds.  Unfortunately,\n>>I\'m not one of them.\n>>\n>>\n>This certainly describes my situation perfectly. For me there is\n>a constant dynamic between my tendency to eat, which appears to\n>be totally limitless, and the purely conscious desire to not\n>put on too much weight. When I get too fat, I just diet/exercise\n>more (with varying degrees of success) to take off the\n>extra weight. Usually I cycle within a 15 lb range, but\n>smaller and larger cycles occur as well. I\'m always afraid\n>that this method will stop working someday, but usually\n>I seem to be able to hold the weight gain in check.\n>This is one reason I have a hard time accepting the notion\n>of some metabolic derangement associated with cycle dieting\n>(that results in long-term weight gain). I have been cycle-\n>dieting for at least 20 years without seeing such a change.\n\nAs mentioned in Adiposity 101, only some experience weight\nrebound.  The fact that you don\'t doesn\'t prove it doesn\'t\nhappen to others.\n-- \nChuck Forsberg WA7KGX          ...!tektronix!reed!omen!caf \nAuthor of YMODEM, ZMODEM, Professional-YAM, ZCOMM, and DSZ\n  Omen Technology Inc    "The High Reliability Software"\n17505-V NW Sauvie IS RD   Portland OR 97231   503-621-3406\n'

In [6]:

                
                    Copied!
                    
# Clean the documents from noise (emails, numbers, etc...)
atom.textclean()
# Clean the documents from noise (emails, numbers, etc...)
atom.textclean()

Fitting TextCleaner...
Cleaning the corpus...
 --> Decoding unicode characters to ascii.
 --> Converting text to lower case.
 --> Dropping 8115 emails from 2352 documents.
 --> Dropping 0 URL links from 0 documents.
 --> Dropping 1619 HTML tags from 964 documents.
 --> Dropping 2 emojis from 1 documents.
 --> Dropping 29292 numbers from 2363 documents.
 --> Dropping punctuation from the text.

In [7]:

                
                    Copied!
                    
# Have a look at the removed items
atom.drops
# Have a look at the removed items
atom.drops

Out[7]:

	email	url	html	emoji	number
1731	[rlm@helen.surfcty.com, rlm@helen.surfcty.com]	NaN	[<std.disclaimer.h>]	NaN	[8]
1496	[carl@sol1.gps.caltech.edu, carl@sol1.gps.calt...	NaN	[<>]	NaN	[28]
1290	[thssjxy@iitmax.iit.edu, thssjxy@iitmax.acc.ii...	NaN	NaN	NaN	[223158, 15645, 14, 80, 150]
2021	[c23st@kocrsv01.delcoelect.com, c4wjgq.a40@con...	NaN	[<>]	NaN	[10, 21, 6, 317, 451, 0815, 46904]
142	[ginkgo@ecsvax.uncecs.edu, ginkgo@uncecs.edu]	NaN	[<>]	NaN	[95, 17, 95, 95, 95, 100, 00, 919, 851, 6565, ...
...	...	...	...	...	...
403	NaN	NaN	NaN	NaN	[223, 250, 10, 8, 8, 2002, 1600]
1634	NaN	NaN	NaN	NaN	[15, 1, 1]
1262	NaN	NaN	NaN	NaN	[38, 84]
1360	NaN	NaN	NaN	NaN	[27, 15, 27, 225, 250, 412, 624, 6115, 371, 0154]
211	NaN	NaN	NaN	NaN	[13, 93, 212, 274, 0646, 1097, 08836, 908, 563...

2366 rows × 5 columns

In [8]:

                
                    Copied!
                    
# Check how the first document changed
atom.corpus[0]
# Check how the first document changed
atom.corpus[0]

Out[8]:

'from  chuck forsberg wa7kgx\nsubject re my new diet  it works great \norganization omen technology inc portland rain forest\nlines \n\nin article    writes\n\ngordon banks\n\na lot to keep from going back to morbid obesity  i think all\nof us cycle  ones success depends on how large the fluctuations\nin the cycle are  some people can cycle only  pounds  unfortunately\nim not one of them\n\n\nthis certainly describes my situation perfectly for me there is\na constant dynamic between my tendency to eat which appears to\nbe totally limitless and the purely conscious desire to not\nput on too much weight when i get too fat i just dietexercise\nmore with varying degrees of success to take off the\nextra weight usually i cycle within a  lb range but\nsmaller and larger cycles occur as well im always afraid\nthat this method will stop working someday but usually\ni seem to be able to hold the weight gain in check\nthis is one reason i have a hard time accepting the notion\nof some metabolic derangement associated with cycle dieting\nthat results in longterm weight gain i have been cycle\ndieting for at least  years without seeing such a change\n\nas mentioned in adiposity  only some experience weight\nrebound  the fact that you dont doesnt prove it doesnt\nhappen to others\n \nchuck forsberg wa7kgx          tektronixreedomencaf \nauthor of ymodem zmodem professionalyam zcomm and dsz\n  omen technology inc    the high reliability software\nv nw sauvie is rd   portland or    \n'

In [9]:

                
                    Copied!
                    
# Convert the strings to a sequence of words
atom.tokenize()
# Convert the strings to a sequence of words
atom.tokenize()

Fitting Tokenizer...
Tokenizing the corpus...

In [10]:

                
                    Copied!
                    
# Print the first few words of the first document
atom.corpus[0][:7]
# Print the first few words of the first document
atom.corpus[0][:7]

Out[10]:

['from', 'chuck', 'forsberg', 'wa7kgx', 'subject', 're', 'my']

In [11]:

                
                    Copied!
                    
# Normalize the text to a predefined standard
atom.textnormalize(stopwords="english", lemmatize=True)
# Normalize the text to a predefined standard
atom.textnormalize(stopwords="english", lemmatize=True)

Fitting TextNormalizer...
Normalizing the corpus...
 --> Dropping stopwords.
 --> Applying lemmatization.

In [12]:

                
                    Copied!
                    
atom.corpus[0][:7]  # Check changes...
atom.corpus[0][:7]  # Check changes...

Out[12]:

['chuck', 'forsberg', 'wa7kgx', 'subject', 'new', 'diet', 'work']

In [13]:

                
                    Copied!
                    
# Visualize the most common words with a wordcloud
atom.plot_wordcloud(figsize=(700, 500))
# Visualize the most common words with a wordcloud
atom.plot_wordcloud(figsize=(700, 500))

In [14]:

                
                    Copied!
                    
# Have a look at the most frequent bigrams
atom.plot_ngrams(2)
# Have a look at the most frequent bigrams
atom.plot_ngrams(2)

In [15]:

                
                    Copied!
                    
# Create the bigrams using the tokenizer
atom.tokenize(bigram_freq=215)
# Create the bigrams using the tokenizer
atom.tokenize(bigram_freq=215)

Fitting Tokenizer...
Tokenizing the corpus...
 --> Creating 7 bigrams on 3125 locations.

In [16]:

                
                    Copied!
                    
atom.bigrams
atom.bigrams

Out[16]:

	bigram	frequency
0	x_x	1169
1	line_article	531
2	line_nntppostinghost	386
3	organization_university	331
4	gordon_bank	266
5	distribution_usa	227
6	line_distribution	215

In [17]:

                
                    Copied!
                    
# As a last step before modelling, convert the words to vectors
atom.vectorize(strategy="tfidf")
# As a last step before modelling, convert the words to vectors
atom.vectorize(strategy="tfidf")

Fitting Vectorizer...
Vectorizing the corpus...

In [18]:

                
                    Copied!
                    
# The dimensionality of the dataset has increased a lot!
atom.shape
# The dimensionality of the dataset has increased a lot!
atom.shape

Out[18]:

(2366, 24344)

In [19]:

                
                    Copied!
                    
# Note that the data is sparse and the columns are named
# after the words they are embedding
atom.dtypes
# Note that the data is sparse and the columns are named
# after the words they are embedding
atom.dtypes

Out[19]:

corpus_00          Sparse[float64, 0]
corpus_000         Sparse[float64, 0]
corpus_000000e5    Sparse[float64, 0]
corpus_00000ee5    Sparse[float64, 0]
corpus_000010af    Sparse[float64, 0]
                          ...        
corpus_zurich      Sparse[float64, 0]
corpus_zvi         Sparse[float64, 0]
corpus_zx          Sparse[float64, 0]
corpus_zz          Sparse[float64, 0]
target                          int64
Length: 24344, dtype: object

In [20]:

                
                    Copied!
                    
# When the dataset is sparse, stats() shows the density
atom.stats()
# When the dataset is sparse, stats() shows the density
atom.stats()

Dataset stats ==================== >>
Shape: (2366, 24344)
Train set size: 1657
Test set size: 709
-------------------------------------
Memory: 2.54 MB
Sparse: True
Density: 0.35%

In [21]:

                
                    Copied!
                    
# Check which models have support for sparse matrices
atom.available_models()[["acronym", "model", "accepts_sparse"]]
# Check which models have support for sparse matrices
atom.available_models()[["acronym", "model", "accepts_sparse"]]

Out[21]:

	acronym	model	accepts_sparse
0	AdaB	AdaBoost	True
1	Bag	Bagging	True
2	BNB	BernoulliNB	True
3	CatB	CatBoost	True
4	CatNB	CategoricalNB	True
5	CNB	ComplementNB	True
6	Tree	DecisionTree	True
7	Dummy	Dummy	False
8	ETree	ExtraTree	True
9	ET	ExtraTrees	True
10	GNB	GaussianNB	False
11	GP	GaussianProcess	False
12	GBM	GradientBoosting	True
13	hGBM	HistGradientBoosting	False
14	KNN	KNearestNeighbors	True
15	LGB	LightGBM	True
16	LDA	LinearDiscriminantAnalysis	False
17	lSVM	LinearSVM	True
18	LR	LogisticRegression	True
19	MLP	MultiLayerPerceptron	True
20	MNB	MultinomialNB	True
21	PA	PassiveAggressive	True
22	Perc	Perceptron	False
23	QDA	QuadraticDiscriminantAnalysis	False
24	RNN	RadiusNearestNeighbors	True
25	RF	RandomForest	True
26	Ridge	Ridge	True
27	SGD	StochasticGradientDescent	True
28	SVM	SupportVectorMachine	True
29	XGB	XGBoost	True

In [22]:

                
                    Copied!
                    
# Train the model
atom.run(models="RF", metric="f1_weighted")
# Train the model
atom.run(models="RF", metric="f1_weighted")

Training ========================= >>
Models: RF
Metric: f1_weighted


Results for RandomForest:
Fit ---------------------------------------------
Train evaluation --> f1_weighted: 1.0
Test evaluation --> f1_weighted: 0.9237
Time elapsed: 41.038s
-------------------------------------------------
Total time: 41.038s


Final results ==================== >>
Total time: 41.039s
-------------------------------------
RandomForest --> f1_weighted: 0.9237

Analyze the results¶

In [23]:

                
                    Copied!
                    
atom.evaluate()
atom.evaluate()

Out[23]:

	balanced_accuracy	f1_weighted	jaccard_weighted	matthews_corrcoef	precision_weighted	recall_weighted
RF	0.9239	0.9237	0.8583	0.8994	0.9266	0.9238

In [24]:

                
                    Copied!
                    
atom.plot_confusion_matrix(figsize=(700, 600))
atom.plot_confusion_matrix(figsize=(700, 600))

In [25]:

                
                    Copied!
                    
atom.plot_shap_decision(index=0, show=15)
atom.plot_shap_decision(index=0, show=15)

In [26]:

                
                    Copied!
                    
atom.plot_shap_beeswarm(target=0, show=15)
atom.plot_shap_beeswarm(target=0, show=15)

100%|===================| 2822/2836 [02:39<00:00]