Natural Language Processing¶
This example shows how to use ATOM to quickly go from raw text data to model predictions.
Import the 20 newsgroups text dataset from sklearn.datasets. The dataset comprises around 18000 articles on 20 topics. The goal is to predict the topic of every article.
Load the data¶
In [1]:
Copied!
import numpy as np
from atom import ATOMClassifier
from sklearn.datasets import fetch_20newsgroups
import numpy as np
from atom import ATOMClassifier
from sklearn.datasets import fetch_20newsgroups
In [2]:
Copied!
# Use only a subset of the available topics for faster processing
X_text, y_text = fetch_20newsgroups(
return_X_y=True,
categories=[
'alt.atheism',
'sci.med',
'comp.windows.x',
'misc.forsale',
'rec.autos',
],
shuffle=True,
random_state=1,
)
X_text = np.array(X_text).reshape(-1, 1)
# Use only a subset of the available topics for faster processing
X_text, y_text = fetch_20newsgroups(
return_X_y=True,
categories=[
'alt.atheism',
'sci.med',
'comp.windows.x',
'misc.forsale',
'rec.autos',
],
shuffle=True,
random_state=1,
)
X_text = np.array(X_text).reshape(-1, 1)
Run the pipeline¶
In [3]:
Copied!
atom = ATOMClassifier(X_text, y_text, index=True, test_size=0.3, verbose=2, random_state=1)
atom = ATOMClassifier(X_text, y_text, index=True, test_size=0.3, verbose=2, random_state=1)
<< ================== ATOM ================== >> Algorithm task: multiclass classification. Dataset stats ==================== >> Shape: (2846, 2) Memory: 5.15 MB Scaled: False Categorical features: 1 (100.0%) ------------------------------------- Train set size: 1993 Test set size: 853 ------------------------------------- | | dataset | train | test | | - | ----------- | ----------- | ----------- | | 0 | 480 (1.0) | 336 (1.0) | 144 (1.0) | | 1 | 593 (1.2) | 415 (1.2) | 178 (1.2) | | 2 | 585 (1.2) | 410 (1.2) | 175 (1.2) | | 3 | 594 (1.2) | 416 (1.2) | 178 (1.2) | | 4 | 594 (1.2) | 416 (1.2) | 178 (1.2) |
In [4]:
Copied!
atom.dataset # Note that the feature is automatically named 'corpus'
atom.dataset # Note that the feature is automatically named 'corpus'
Out[4]:
corpus | target | |
---|---|---|
2283 | From: geb@cs.pitt.edu (Gordon Banks)\nSubject:... | 4 |
961 | From: keith@cco.caltech.edu (Keith Allan Schne... | 0 |
796 | From: joes@telxon.mis.telxon.com (Joe Staudt)\... | 3 |
832 | From: rog@cdc.hp.com (Roger Haaheim)\nSubject:... | 4 |
978 | From: forman@ide.com (Bonnie Forman)\nSubject:... | 2 |
... | ... | ... |
2021 | From: pmoloney@maths.tcd.ie (Paul Moloney)\nSu... | 0 |
2606 | From: jmiller@network.com (Jeff J. Miller)\nSu... | 3 |
2320 | From: maher@kong.gsfc.nasa.gov (552)\nSubject:... | 1 |
513 | From: beck@irzr17.inf.tu-dresden.de (Andre Bec... | 1 |
733 | From: yozzo@watson.ibm.com (Ralph Yozzo)\nSubj... | 4 |
2846 rows × 2 columns
In [5]:
Copied!
# Let's have a look at the first document
atom.corpus[0]
# Let's have a look at the first document
atom.corpus[0]
Out[5]:
'From: Donald Mackie <Donald_Mackie@med.umich.edu>\nSubject: Re: Barbecued foods and health risk\nOrganization: UM Anesthesiology\nLines: 13\nDistribution: world\nNNTP-Posting-Host: 141.214.86.38\nX-UserAgent: Nuntius v1.1.1d9\nX-XXDate: Mon, 19 Apr 93 20:12:06 GMT\n\nIn article <1993Apr18.175802.28548@clpd.kodak.com> Rich Young,\nyoung@serum.kodak.com writes:\n\nStuff deleted\n\n>\t ... have to\n>\t consume unrealistically large quantities of barbecued meat at a\n>\t time."\n\nI have to confess that this is one of my few unfulfilled ambitions.\nNo matter how much I eat, it still seems realistic.\n\nDon Mackie - his opinion\n'
In [6]:
Copied!
# Clean the documents from noise (emails, numbers, etc...)
atom.textclean()
# Clean the documents from noise (emails, numbers, etc...)
atom.textclean()
Cleaning the corpus... --> Decoding unicode characters to ascii. --> Converting text to lower case. --> Dropping 10012 emails from 2830 documents. --> Dropping 0 URL links from 0 documents. --> Dropping 2214 HTML tags from 1304 documents. --> Dropping 2 emojis from 1 documents. --> Dropping 31222 numbers from 2843 documents. --> Dropping punctuation from the text.
In [7]:
Copied!
# Have a look at the removed items
atom.drops
# Have a look at the removed items
atom.drops
Out[7]:
url | html | emoji | number | ||
---|---|---|---|---|---|
2283 | [geb@cs.pitt.edu, geb@cs.pitt.edu, mo2@access.... | NaN | [<1q6rie$>] | NaN | [15, 1, 5, 400] |
961 | [keith@cco.caltech.edu, livesey@solntze.wpd.sg... | NaN | NaN | NaN | [16] |
796 | [joes@telxon.mis.telxon.com, 1993apr20.142818.... | NaN | [<>] | NaN | [45, 225, 54, 225, 2, 5582, 44334, 0582, 216, ... |
832 | [rog@cdc.hp.com, ls8139@albnyvms.bitnet] | NaN | NaN | NaN | [15, 1, 1] |
978 | [forman@ide.com, forman@ide.com] | NaN | NaN | NaN | [13, 2, 4, 40, 1, 800, 00, 510, 947, 6987] |
... | ... | ... | ... | ... | ... |
194 | NaN | NaN | NaN | NaN | [15, 1, 1, 1097, 08836, 908, 563, 9033, 908, 5... |
1035 | NaN | NaN | NaN | NaN | [47, 252, 4, 179, 34, 12] |
604 | NaN | NaN | NaN | NaN | [15, 1, 1] |
711 | NaN | NaN | NaN | NaN | [13, 93, 212, 274, 0646, 1097, 08836, 908, 563... |
1000 | NaN | NaN | NaN | NaN | [38, 84] |
2846 rows × 5 columns
In [8]:
Copied!
# Check how the first document changed
atom.corpus[0]
# Check how the first document changed
atom.corpus[0]
Out[8]:
'from donald mackie \nsubject re barbecued foods and health risk\norganization um anesthesiology\nlines \ndistribution world\nnntppostinghost \nxuseragent nuntius v11d9\nxxxdate mon apr gmt\n\nin article rich young\n writes\n\nstuff deleted\n\n\t have to\n\t consume unrealistically large quantities of barbecued meat at a\n\t time\n\ni have to confess that this is one of my few unfulfilled ambitions\nno matter how much i eat it still seems realistic\n\ndon mackie his opinion\n'
In [9]:
Copied!
# Convert the strings to a sequence of words
atom.tokenize()
# Convert the strings to a sequence of words
atom.tokenize()
Tokenizing the corpus...
In [10]:
Copied!
# Print the first few words of the first document
atom.corpus[0][:7]
# Print the first few words of the first document
atom.corpus[0][:7]
Out[10]:
['from', 'donald', 'mackie', 'subject', 're', 'barbecued', 'foods']
In [11]:
Copied!
# Normalize the text to a predefined standard
atom.textnormalize(stopwords="english", lemmatize=True)
# Normalize the text to a predefined standard
atom.textnormalize(stopwords="english", lemmatize=True)
Normalizing the corpus... --> Dropping stopwords. --> Applying lemmatization.
In [12]:
Copied!
atom.corpus[0][:7] # Check changes...
atom.corpus[0][:7] # Check changes...
Out[12]:
['donald', 'mackie', 'subject', 'barbecue', 'food', 'health', 'risk']
In [13]:
Copied!
# Visualize the most common words with a wordcloud
atom.plot_wordcloud()
# Visualize the most common words with a wordcloud
atom.plot_wordcloud()
In [14]:
Copied!
# Have a look at the most frequent bigrams
atom.plot_ngrams(2)
# Have a look at the most frequent bigrams
atom.plot_ngrams(2)
In [15]:
Copied!
# Create the bigrams using the tokenizer
atom.tokenize(bigram_freq=215)
# Create the bigrams using the tokenizer
atom.tokenize(bigram_freq=215)
Tokenizing the corpus... --> Creating 10 bigrams on 4178 locations.
In [16]:
Copied!
atom.bigrams
atom.bigrams
Out[16]:
bigram | frequency | |
---|---|---|
0 | x_x | 1169 |
1 | line_article | 714 |
2 | line_nntppostinghost | 493 |
3 | organization_university | 367 |
4 | gordon_bank | 266 |
5 | line_distribution | 258 |
6 | distribution_world | 249 |
7 | distribution_usa | 229 |
8 | usa_line | 217 |
9 | computer_science | 216 |
In [17]:
Copied!
# As a last step before modelling, convert the words to vectors
atom.vectorize(strategy="tfidf")
# As a last step before modelling, convert the words to vectors
atom.vectorize(strategy="tfidf")
Fitting Vectorizer... Vectorizing the corpus...
In [18]:
Copied!
# The dimensionality of the dataset has increased a lot!
atom.shape
# The dimensionality of the dataset has increased a lot!
atom.shape
Out[18]:
(2846, 27589)
In [19]:
Copied!
# Note that the data is sparse and the columns are named
# after the words they are embedding
atom.dtypes
# Note that the data is sparse and the columns are named
# after the words they are embedding
atom.dtypes
Out[19]:
corpus_00 Sparse[float64, 0] corpus_000 Sparse[float64, 0] corpus_000000e5 Sparse[float64, 0] corpus_00000ee5 Sparse[float64, 0] corpus_000010af Sparse[float64, 0] ... corpus_zvi Sparse[float64, 0] corpus_zx Sparse[float64, 0] corpus_zyklonb Sparse[float64, 0] corpus_zzzs Sparse[float64, 0] target int64 Length: 27589, dtype: object
In [20]:
Copied!
# When the dataset is sparse, stats() shows the density
atom.stats()
# When the dataset is sparse, stats() shows the density
atom.stats()
Dataset stats ==================== >> Shape: (2846, 27589) Memory: 3.18 MB Sparse: True Density: 0.33% ------------------------------------- Train set size: 1993 Test set size: 853 ------------------------------------- | | dataset | train | test | | - | ----------- | ----------- | ----------- | | 0 | 480 (1.0) | 336 (1.0) | 144 (1.0) | | 1 | 593 (1.2) | 415 (1.2) | 178 (1.2) | | 2 | 585 (1.2) | 410 (1.2) | 175 (1.2) | | 3 | 594 (1.2) | 416 (1.2) | 178 (1.2) | | 4 | 594 (1.2) | 416 (1.2) | 178 (1.2) |
In [21]:
Copied!
# Check which models have support for sparse matrices
atom.available_models()[["acronym", "fullname", "accepts_sparse"]]
# Check which models have support for sparse matrices
atom.available_models()[["acronym", "fullname", "accepts_sparse"]]
Out[21]:
acronym | fullname | accepts_sparse | |
---|---|---|---|
0 | Dummy | Dummy Estimator | False |
1 | GP | Gaussian Process | False |
2 | GNB | Gaussian Naive Bayes | False |
3 | MNB | Multinomial Naive Bayes | True |
4 | BNB | Bernoulli Naive Bayes | True |
5 | CatNB | Categorical Naive Bayes | True |
6 | CNB | Complement Naive Bayes | True |
7 | Ridge | Ridge Estimator | True |
8 | Perc | Perceptron | False |
9 | LR | Logistic Regression | True |
10 | LDA | Linear Discriminant Analysis | False |
11 | QDA | Quadratic Discriminant Analysis | False |
12 | KNN | K-Nearest Neighbors | True |
13 | RNN | Radius Nearest Neighbors | True |
14 | Tree | Decision Tree | True |
15 | Bag | Bagging | True |
16 | ET | Extra-Trees | True |
17 | RF | Random Forest | True |
18 | AdaB | AdaBoost | True |
19 | GBM | Gradient Boosting Machine | True |
20 | hGBM | HistGBM | False |
21 | XGB | XGBoost | True |
22 | LGB | LightGBM | True |
23 | CatB | CatBoost | True |
24 | lSVM | Linear SVM | True |
25 | kSVM | Kernel SVM | True |
26 | PA | Passive Aggressive | True |
27 | SGD | Stochastic Gradient Descent | True |
28 | MLP | Multi-layer Perceptron | True |
In [22]:
Copied!
# Train the model
atom.run(models="RF", metric="f1_weighted")
# Train the model
atom.run(models="RF", metric="f1_weighted")
Training ========================= >> Models: RF Metric: f1_weighted Results for Random Forest: Fit --------------------------------------------- Train evaluation --> f1_weighted: 1.0 Test evaluation --> f1_weighted: 0.9295 Time elapsed: 23.832s ------------------------------------------------- Total time: 23.832s Final results ==================== >> Duration: 23.832s ------------------------------------- Random Forest --> f1_weighted: 0.9295
Analyze results¶
In [23]:
Copied!
atom.evaluate()
atom.evaluate()
Out[23]:
balanced_accuracy | f1_weighted | jaccard_weighted | matthews_corrcoef | precision_weighted | recall_weighted | |
---|---|---|---|---|---|---|
RF | 0.931065 | 0.929523 | 0.869258 | 0.912317 | 0.930918 | 0.92966 |
In [24]:
Copied!
atom.plot_confusion_matrix(figsize=(10, 10))
atom.plot_confusion_matrix(figsize=(10, 10))
In [25]:
Copied!
atom.decision_plot(index=0, target=atom.predict(0), show=15)
atom.decision_plot(index=0, target=atom.predict(0), show=15)
In [26]:
Copied!
atom.beeswarm_plot(target=0, show=15)
atom.beeswarm_plot(target=0, show=15)
100%|===================| 4261/4265 [04:22<00:00]