Example: NLP¶
This example shows how to use ATOM to quickly go from raw text data to model predictions.
Import the 20 newsgroups text dataset from sklearn.datasets. The dataset comprises around 18000 articles on 20 topics. The goal is to predict the topic of every article.
Load the data¶
In [1]:
Copied!
import numpy as np
from atom import ATOMClassifier
from sklearn.datasets import fetch_20newsgroups
import numpy as np
from atom import ATOMClassifier
from sklearn.datasets import fetch_20newsgroups
UserWarning: The pandas version installed (1.5.3) does not match the supported pandas version in Modin (1.5.2). This may cause undesired side effects!
In [2]:
Copied!
# Use only a subset of the available topics for faster processing
X_text, y_text = fetch_20newsgroups(
return_X_y=True,
categories=[
'sci.med',
'comp.windows.x',
'misc.forsale',
'rec.autos',
],
shuffle=True,
random_state=1,
)
X_text = np.array(X_text).reshape(-1, 1)
# Use only a subset of the available topics for faster processing
X_text, y_text = fetch_20newsgroups(
return_X_y=True,
categories=[
'sci.med',
'comp.windows.x',
'misc.forsale',
'rec.autos',
],
shuffle=True,
random_state=1,
)
X_text = np.array(X_text).reshape(-1, 1)
Run the pipeline¶
In [3]:
Copied!
atom = ATOMClassifier(X_text, y_text, index=True, test_size=0.3, verbose=2, random_state=1)
atom = ATOMClassifier(X_text, y_text, index=True, test_size=0.3, verbose=2, random_state=1)
<< ================== ATOM ================== >> Algorithm task: multiclass classification. Dataset stats ==================== >> Shape: (2366, 2) Train set size: 1657 Test set size: 709 ------------------------------------- Memory: 4.14 MB Scaled: False Categorical features: 1 (100.0%)
In [4]:
Copied!
atom.dataset # Note that the feature is automatically named 'corpus'
atom.dataset # Note that the feature is automatically named 'corpus'
Out[4]:
corpus | target | |
---|---|---|
1731 | From: rlm@helen.surfcty.com (Robert L. McMilli... | 0 |
1496 | From: carl@SOL1.GPS.CALTECH.EDU (Carl J Lydick... | 3 |
1290 | From: thssjxy@iitmax.iit.edu (Smile)\nSubject:... | 1 |
2021 | From: c23st@kocrsv01.delcoelect.com (Spiros Tr... | 2 |
142 | From: ginkgo@ecsvax.uncecs.edu (J. Geary Morto... | 1 |
... | ... | ... |
510 | From: mary@uicsl.csl.uiuc.edu (Mary E. Allison... | 3 |
1948 | From: ndd@sunbar.mc.duke.edu (Ned Danieley)\nS... | 0 |
798 | From: kk@unisql.UUCP (Kerry Kimbrough)\nSubjec... | 0 |
2222 | From: hamachi@adobe.com (Gordon Hamachi)\nSubj... | 2 |
2215 | From: mobasser@vu-vlsi.ee.vill.edu (Bijan Moba... | 2 |
2366 rows × 2 columns
In [5]:
Copied!
# Let's have a look at the first document
atom.corpus[0]
# Let's have a look at the first document
atom.corpus[0]
Out[5]:
'From: caf@omen.UUCP (Chuck Forsberg WA7KGX)\nSubject: Re: My New Diet --> IT WORKS GREAT !!!!\nOrganization: Omen Technology INC, Portland Rain Forest\nLines: 32\n\nIn article <1qk6v3INNrm6@lynx.unm.edu> bhjelle@carina.unm.edu () writes:\n>\n>Gordon Banks:\n>\n>>a lot to keep from going back to morbid obesity. I think all\n>>of us cycle. One\'s success depends on how large the fluctuations\n>>in the cycle are. Some people can cycle only 5 pounds. Unfortunately,\n>>I\'m not one of them.\n>>\n>>\n>This certainly describes my situation perfectly. For me there is\n>a constant dynamic between my tendency to eat, which appears to\n>be totally limitless, and the purely conscious desire to not\n>put on too much weight. When I get too fat, I just diet/exercise\n>more (with varying degrees of success) to take off the\n>extra weight. Usually I cycle within a 15 lb range, but\n>smaller and larger cycles occur as well. I\'m always afraid\n>that this method will stop working someday, but usually\n>I seem to be able to hold the weight gain in check.\n>This is one reason I have a hard time accepting the notion\n>of some metabolic derangement associated with cycle dieting\n>(that results in long-term weight gain). I have been cycle-\n>dieting for at least 20 years without seeing such a change.\n\nAs mentioned in Adiposity 101, only some experience weight\nrebound. The fact that you don\'t doesn\'t prove it doesn\'t\nhappen to others.\n-- \nChuck Forsberg WA7KGX ...!tektronix!reed!omen!caf \nAuthor of YMODEM, ZMODEM, Professional-YAM, ZCOMM, and DSZ\n Omen Technology Inc "The High Reliability Software"\n17505-V NW Sauvie IS RD Portland OR 97231 503-621-3406\n'
In [6]:
Copied!
# Clean the documents from noise (emails, numbers, etc...)
atom.textclean()
# Clean the documents from noise (emails, numbers, etc...)
atom.textclean()
Fitting TextCleaner... Cleaning the corpus... --> Decoding unicode characters to ascii. --> Converting text to lower case. --> Dropping 8115 emails from 2352 documents. --> Dropping 0 URL links from 0 documents. --> Dropping 1619 HTML tags from 964 documents. --> Dropping 2 emojis from 1 documents. --> Dropping 29292 numbers from 2363 documents. --> Dropping punctuation from the text.
In [7]:
Copied!
# Have a look at the removed items
atom.drops
# Have a look at the removed items
atom.drops
Out[7]:
url | html | emoji | number | ||
---|---|---|---|---|---|
1731 | [rlm@helen.surfcty.com, rlm@helen.surfcty.com] | NaN | [<std.disclaimer.h>] | NaN | [8] |
1496 | [carl@sol1.gps.caltech.edu, carl@sol1.gps.calt... | NaN | [<>] | NaN | [28] |
1290 | [thssjxy@iitmax.iit.edu, thssjxy@iitmax.acc.ii... | NaN | NaN | NaN | [223158, 15645, 14, 80, 150] |
2021 | [c23st@kocrsv01.delcoelect.com, c4wjgq.a40@con... | NaN | [<>] | NaN | [10, 21, 6, 317, 451, 0815, 46904] |
142 | [ginkgo@ecsvax.uncecs.edu, ginkgo@uncecs.edu] | NaN | [<>] | NaN | [95, 17, 95, 95, 95, 100, 00, 919, 851, 6565, ... |
... | ... | ... | ... | ... | ... |
403 | NaN | NaN | NaN | NaN | [223, 250, 10, 8, 8, 2002, 1600] |
1634 | NaN | NaN | NaN | NaN | [15, 1, 1] |
1262 | NaN | NaN | NaN | NaN | [38, 84] |
1360 | NaN | NaN | NaN | NaN | [27, 15, 27, 225, 250, 412, 624, 6115, 371, 0154] |
211 | NaN | NaN | NaN | NaN | [13, 93, 212, 274, 0646, 1097, 08836, 908, 563... |
2366 rows × 5 columns
In [8]:
Copied!
# Check how the first document changed
atom.corpus[0]
# Check how the first document changed
atom.corpus[0]
Out[8]:
'from chuck forsberg wa7kgx\nsubject re my new diet it works great \norganization omen technology inc portland rain forest\nlines \n\nin article writes\n\ngordon banks\n\na lot to keep from going back to morbid obesity i think all\nof us cycle ones success depends on how large the fluctuations\nin the cycle are some people can cycle only pounds unfortunately\nim not one of them\n\n\nthis certainly describes my situation perfectly for me there is\na constant dynamic between my tendency to eat which appears to\nbe totally limitless and the purely conscious desire to not\nput on too much weight when i get too fat i just dietexercise\nmore with varying degrees of success to take off the\nextra weight usually i cycle within a lb range but\nsmaller and larger cycles occur as well im always afraid\nthat this method will stop working someday but usually\ni seem to be able to hold the weight gain in check\nthis is one reason i have a hard time accepting the notion\nof some metabolic derangement associated with cycle dieting\nthat results in longterm weight gain i have been cycle\ndieting for at least years without seeing such a change\n\nas mentioned in adiposity only some experience weight\nrebound the fact that you dont doesnt prove it doesnt\nhappen to others\n \nchuck forsberg wa7kgx tektronixreedomencaf \nauthor of ymodem zmodem professionalyam zcomm and dsz\n omen technology inc the high reliability software\nv nw sauvie is rd portland or \n'
In [9]:
Copied!
# Convert the strings to a sequence of words
atom.tokenize()
# Convert the strings to a sequence of words
atom.tokenize()
Fitting Tokenizer... Tokenizing the corpus...
In [10]:
Copied!
# Print the first few words of the first document
atom.corpus[0][:7]
# Print the first few words of the first document
atom.corpus[0][:7]
Out[10]:
['from', 'chuck', 'forsberg', 'wa7kgx', 'subject', 're', 'my']
In [11]:
Copied!
# Normalize the text to a predefined standard
atom.textnormalize(stopwords="english", lemmatize=True)
# Normalize the text to a predefined standard
atom.textnormalize(stopwords="english", lemmatize=True)
Fitting TextNormalizer... Normalizing the corpus... --> Dropping stopwords. --> Applying lemmatization.
In [12]:
Copied!
atom.corpus[0][:7] # Check changes...
atom.corpus[0][:7] # Check changes...
Out[12]:
['chuck', 'forsberg', 'wa7kgx', 'subject', 'new', 'diet', 'work']
In [13]:
Copied!
# Visualize the most common words with a wordcloud
atom.plot_wordcloud(figsize=(700, 500))
# Visualize the most common words with a wordcloud
atom.plot_wordcloud(figsize=(700, 500))
In [14]:
Copied!
# Have a look at the most frequent bigrams
atom.plot_ngrams(2)
# Have a look at the most frequent bigrams
atom.plot_ngrams(2)
In [15]:
Copied!
# Create the bigrams using the tokenizer
atom.tokenize(bigram_freq=215)
# Create the bigrams using the tokenizer
atom.tokenize(bigram_freq=215)
Fitting Tokenizer... Tokenizing the corpus... --> Creating 7 bigrams on 3125 locations.
In [16]:
Copied!
atom.bigrams
atom.bigrams
Out[16]:
bigram | frequency | |
---|---|---|
0 | x_x | 1169 |
1 | line_article | 531 |
2 | line_nntppostinghost | 386 |
3 | organization_university | 331 |
4 | gordon_bank | 266 |
5 | distribution_usa | 227 |
6 | line_distribution | 215 |
In [17]:
Copied!
# As a last step before modelling, convert the words to vectors
atom.vectorize(strategy="tfidf")
# As a last step before modelling, convert the words to vectors
atom.vectorize(strategy="tfidf")
Fitting Vectorizer... Vectorizing the corpus...
In [18]:
Copied!
# The dimensionality of the dataset has increased a lot!
atom.shape
# The dimensionality of the dataset has increased a lot!
atom.shape
Out[18]:
(2366, 24344)
In [19]:
Copied!
# Note that the data is sparse and the columns are named
# after the words they are embedding
atom.dtypes
# Note that the data is sparse and the columns are named
# after the words they are embedding
atom.dtypes
Out[19]:
corpus_00 Sparse[float64, 0] corpus_000 Sparse[float64, 0] corpus_000000e5 Sparse[float64, 0] corpus_00000ee5 Sparse[float64, 0] corpus_000010af Sparse[float64, 0] ... corpus_zurich Sparse[float64, 0] corpus_zvi Sparse[float64, 0] corpus_zx Sparse[float64, 0] corpus_zz Sparse[float64, 0] target int64 Length: 24344, dtype: object
In [20]:
Copied!
# When the dataset is sparse, stats() shows the density
atom.stats()
# When the dataset is sparse, stats() shows the density
atom.stats()
Dataset stats ==================== >> Shape: (2366, 24344) Train set size: 1657 Test set size: 709 ------------------------------------- Memory: 2.54 MB Sparse: True Density: 0.35%
In [21]:
Copied!
# Check which models have support for sparse matrices
atom.available_models()[["acronym", "model", "accepts_sparse"]]
# Check which models have support for sparse matrices
atom.available_models()[["acronym", "model", "accepts_sparse"]]
Out[21]:
acronym | model | accepts_sparse | |
---|---|---|---|
0 | AdaB | AdaBoost | True |
1 | Bag | Bagging | True |
2 | BNB | BernoulliNB | True |
3 | CatB | CatBoost | True |
4 | CatNB | CategoricalNB | True |
5 | CNB | ComplementNB | True |
6 | Tree | DecisionTree | True |
7 | Dummy | Dummy | False |
8 | ETree | ExtraTree | True |
9 | ET | ExtraTrees | True |
10 | GNB | GaussianNB | False |
11 | GP | GaussianProcess | False |
12 | GBM | GradientBoosting | True |
13 | hGBM | HistGradientBoosting | False |
14 | KNN | KNearestNeighbors | True |
15 | LGB | LightGBM | True |
16 | LDA | LinearDiscriminantAnalysis | False |
17 | lSVM | LinearSVM | True |
18 | LR | LogisticRegression | True |
19 | MLP | MultiLayerPerceptron | True |
20 | MNB | MultinomialNB | True |
21 | PA | PassiveAggressive | True |
22 | Perc | Perceptron | False |
23 | QDA | QuadraticDiscriminantAnalysis | False |
24 | RNN | RadiusNearestNeighbors | True |
25 | RF | RandomForest | True |
26 | Ridge | Ridge | True |
27 | SGD | StochasticGradientDescent | True |
28 | SVM | SupportVectorMachine | True |
29 | XGB | XGBoost | True |
In [22]:
Copied!
# Train the model
atom.run(models="RF", metric="f1_weighted")
# Train the model
atom.run(models="RF", metric="f1_weighted")
Training ========================= >> Models: RF Metric: f1_weighted Results for RandomForest: Fit --------------------------------------------- Train evaluation --> f1_weighted: 1.0 Test evaluation --> f1_weighted: 0.9237 Time elapsed: 41.038s ------------------------------------------------- Total time: 41.038s Final results ==================== >> Total time: 41.039s ------------------------------------- RandomForest --> f1_weighted: 0.9237
Analyze the results¶
In [23]:
Copied!
atom.evaluate()
atom.evaluate()
Out[23]:
balanced_accuracy | f1_weighted | jaccard_weighted | matthews_corrcoef | precision_weighted | recall_weighted | |
---|---|---|---|---|---|---|
RF | 0.9239 | 0.9237 | 0.8583 | 0.8994 | 0.9266 | 0.9238 |
In [24]:
Copied!
atom.plot_confusion_matrix(figsize=(700, 600))
atom.plot_confusion_matrix(figsize=(700, 600))
In [25]:
Copied!
atom.plot_shap_decision(index=0, show=15)
atom.plot_shap_decision(index=0, show=15)
In [26]:
Copied!
atom.plot_shap_beeswarm(target=0, show=15)
atom.plot_shap_beeswarm(target=0, show=15)
100%|===================| 2822/2836 [02:39<00:00]