Natural Language Processing¶
This example shows how to use ATOM to quickly go from raw text data to model predictions.
Import the 20 newsgroups text dataset from sklearn.datasets. The dataset comprises around 18000 articles on 20 topics. The goal is to predict the topic of every article.
Load the data¶
In [18]:
Copied!
import numpy as np
from atom import ATOMClassifier
from sklearn.datasets import fetch_20newsgroups
import numpy as np
from atom import ATOMClassifier
from sklearn.datasets import fetch_20newsgroups
In [19]:
Copied!
# Use only a subset of the available topics for faster processing
X_text, y_text = fetch_20newsgroups(
return_X_y=True,
categories=[
'alt.atheism',
'sci.med',
'comp.windows.x',
'misc.forsale',
'rec.autos',
],
shuffle=True,
random_state=1,
)
X_text = np.array(X_text).reshape(-1, 1)
# Use only a subset of the available topics for faster processing
X_text, y_text = fetch_20newsgroups(
return_X_y=True,
categories=[
'alt.atheism',
'sci.med',
'comp.windows.x',
'misc.forsale',
'rec.autos',
],
shuffle=True,
random_state=1,
)
X_text = np.array(X_text).reshape(-1, 1)
Run the pipeline¶
In [20]:
Copied!
atom = ATOMClassifier(X_text, y_text, test_size=0.3, verbose=2, warnings=False)
atom = ATOMClassifier(X_text, y_text, test_size=0.3, verbose=2, warnings=False)
<< ================== ATOM ================== >> Algorithm task: multiclass classification. Dataset stats ==================== >> Shape: (2846, 2) Scaled: False Categorical features: 1 (100.0%) ------------------------------------- Train set size: 1993 Test set size: 853 ------------------------------------- | | dataset | train | test | | -- | ----------- | ----------- | ----------- | | 0 | 480 (1.0) | 341 (1.0) | 139 (1.0) | | 1 | 593 (1.2) | 418 (1.2) | 175 (1.3) | | 2 | 585 (1.2) | 416 (1.2) | 169 (1.2) | | 3 | 594 (1.2) | 413 (1.2) | 181 (1.3) | | 4 | 594 (1.2) | 405 (1.2) | 189 (1.4) |
In [21]:
Copied!
atom.dataset # Note that the feature is automatically named 'Corpus'
atom.dataset # Note that the feature is automatically named 'Corpus'
Out[21]:
Corpus | target | |
---|---|---|
0 | From: mcdonald@aries.scs.uiuc.edu (J. D. McDon... | 4 |
1 | From: victor@hpfrcu03.FRance.hp.COM (Victor GA... | 1 |
2 | From: uabdpo.dpo.uab.edu!gila005 (Stephen Holl... | 4 |
3 | From: jbrown@batman.bmd.trw.com\nSubject: Re: ... | 0 |
4 | Organization: University of Maine System\nFrom... | 4 |
... | ... | ... |
2841 | From: jim.zisfein@factory.com (Jim Zisfein) \n... | 4 |
2842 | From: davewood@bruno.cs.colorado.edu (David Re... | 1 |
2843 | From: marc@comp.lancs.ac.uk (Marc Goldman)\nSu... | 2 |
2844 | From: mcovingt@aisun3.ai.uga.edu (Michael Covi... | 4 |
2845 | From: reznik@robios.me.wisc.edu (Dan S Reznik)... | 1 |
2846 rows × 2 columns
In [22]:
Copied!
# Let's have a look at the first document
atom.Corpus[0]
# Let's have a look at the first document
atom.Corpus[0]
Out[22]:
"From: mcdonald@aries.scs.uiuc.edu (J. D. McDonald)\nSubject: Re: jiggers\nArticle-I.D.: aries.mcdonald.895.734049502\nOrganization: UIUC SCS\nLines: 13\n\nIn article <78846@cup.portal.com> mmm@cup.portal.com (Mark Robert Thorson) writes:\n\n>This wouldn't happen to be the same thing as chiggers, would it?\n>A truly awful parasitic affliction, as I understand it. Tiny bugs\n>dig deeply into the skin, burying themselves. Yuck! They have these\n>things in Oklahoma.\n\nClose. My mother comes from Gainesville Tex, right across the border.\nThey claim to be the chigger capitol of the world, and I believe them.\nWhen I grew up in Fort Worth it was bad enough, but in Gainesville\nin the summer an attack was guaranteed.\n\nDoug McDonald\n"
In [23]:
Copied!
# Clean the documents from noise (emails, numbers, etc...)
atom.textclean()
# Clean the documents from noise (emails, numbers, etc...)
atom.textclean()
Filtering the corpus... --> Decoding unicode characters to ascii. --> Converting text to lower case. --> Dropping 10012 emails from 2830 documents. --> Dropping 0 URL links from 0 documents. --> Dropping 2214 HTML tags from 1304 documents. --> Dropping 2 emojis from 1 documents. --> Dropping 31222 numbers from 2843 documents. --> Dropping punctuation from the text.
In [24]:
Copied!
# Have a look at the removed items
atom.drops
# Have a look at the removed items
atom.drops
Out[24]:
url | html | emoji | number | ||
---|---|---|---|---|---|
0 | [mcdonald@aries.scs.uiuc.edu, 78846@cup.portal... | NaN | [<>] | NaN | [895, 734049502, 13] |
1 | [victor@hpfrcu03.france.hp.com, alf@st.nepean.... | NaN | [<key>] | NaN | [5000, 240, 61, 10, 3, 10, 3, 5, 10, 4] |
2 | [1993apr22.210631.13300@aio.jsc.nasa.gov, spen... | NaN | [<>] | NaN | [81] |
3 | [jbrown@batman.bmd.trw.com, 1993apr20.062328.1... | NaN | [<>] | NaN | [67, 2, 3, 5, 6, 2, 6] |
4 | [andy@maine.maine.edu, andy@maine.edu] | NaN | [<>] | NaN | [8] |
... | ... | ... | ... | ... | ... |
2841 | [jim.zisfein@factory.com, jim.zisfein@factory.... | NaN | [<g>] | NaN | [212, 274, 31, 2, 1] |
2842 | [davewood@bruno.cs.colorado.edu, davewood@cs.c... | NaN | [<x11/xlib.h>, <xm/xm.h>, <xm/pushb.h>, <stdio... | NaN | [91, 0, 0, 50, 50, 0, 25, 0, 0, 0, 0, 1, 0, 20... |
2843 | [marc@comp.lancs.ac.uk, marc@comp.lancs.ac.uk,... | NaN | NaN | NaN | [24, 3] |
2844 | [mcovingt@aisun3.ai.uga.edu, mcovingt@ai.uga.edu] | NaN | [<>] | NaN | [12, 706, 542, 0358, 30602, 7415] |
2845 | [reznik@robios.me.wisc.edu, reznik@robios5.me.... | NaN | NaN | NaN | [22, 93, 18, 22, 55, 13] |
2846 rows × 5 columns
In [25]:
Copied!
# Check how the first document changed
atom.Corpus[0]
# Check how the first document changed
atom.Corpus[0]
Out[25]:
'from j d mcdonald\nsubject re jiggers\narticleid ariesmcdonald\norganization uiuc scs\nlines \n\nin article mark robert thorson writes\n\nthis wouldnt happen to be the same thing as chiggers would it\na truly awful parasitic affliction as i understand it tiny bugs\ndig deeply into the skin burying themselves yuck they have these\nthings in oklahoma\n\nclose my mother comes from gainesville tex right across the border\nthey claim to be the chigger capitol of the world and i believe them\nwhen i grew up in fort worth it was bad enough but in gainesville\nin the summer an attack was guaranteed\n\ndoug mcdonald\n'
In [26]:
Copied!
# Convert the strings to a sequence of words
atom.tokenize()
# Convert the strings to a sequence of words
atom.tokenize()
Tokenizing the corpus...
In [27]:
Copied!
# Print the first few words of the first document
atom.Corpus[0][:7]
# Print the first few words of the first document
atom.Corpus[0][:7]
Out[27]:
['from', 'j', 'd', 'mcdonald', 'subject', 're', 'jiggers']
In [28]:
Copied!
# Normalize the text to a predefined standard
atom.normalize(stopwords="english", lemmatize=True)
# Normalize the text to a predefined standard
atom.normalize(stopwords="english", lemmatize=True)
Normalizing the corpus... --> Dropping stopwords. --> Applying lemmatization.
In [29]:
Copied!
atom.Corpus[0][:7] # Check changes...
atom.Corpus[0][:7] # Check changes...
Out[29]:
['j', 'mcdonald', 'subject', 'jigger', 'articleid', 'ariesmcdonald', 'organization']
In [30]:
Copied!
# Visualize the most common words with a wordcloud
atom.plot_wordcloud()
# Visualize the most common words with a wordcloud
atom.plot_wordcloud()
In [31]:
Copied!
# Have a look at the most frequent bigrams
atom.plot_ngrams(2)
# Have a look at the most frequent bigrams
atom.plot_ngrams(2)
In [32]:
Copied!
# Create the bigrams using the tokenizer
atom.tokenize(bigram_freq=215)
# Create the bigrams using the tokenizer
atom.tokenize(bigram_freq=215)
Tokenizing the corpus... --> Creating 10 bigrams on 4178 locations.
In [33]:
Copied!
atom.bigrams
atom.bigrams
Out[33]:
bigram | frequency | |
---|---|---|
9 | (x, x) | 1169 |
0 | (line, article) | 714 |
6 | (line, nntppostinghost) | 493 |
3 | (organization, university) | 367 |
8 | (gordon, bank) | 266 |
4 | (line, distribution) | 258 |
5 | (distribution, world) | 249 |
1 | (distribution, usa) | 229 |
2 | (usa, line) | 217 |
7 | (computer, science) | 216 |
In [34]:
Copied!
# As a last step before modelling, convert the words to vectors
atom.vectorize(strategy="tf-idf")
# As a last step before modelling, convert the words to vectors
atom.vectorize(strategy="tf-idf")
Vectorizing the corpus...
In [35]:
Copied!
# The dimensionality of the dataset has increased a lot!
atom.shape
# The dimensionality of the dataset has increased a lot!
atom.shape
Out[35]:
(2846, 28904)
In [36]:
Copied!
# Train the model
atom.run(models="MLP", metric="f1_weighted")
# Train the model
atom.run(models="MLP", metric="f1_weighted")
Training ========================= >> Models: MLP Metric: f1_weighted Results for Multi-layer Perceptron: Fit --------------------------------------------- Train evaluation --> f1_weighted: 1.0 Test evaluation --> f1_weighted: 0.8701 Time elapsed: 1m:02s ------------------------------------------------- Total time: 1m:02s Final results ==================== >> Duration: 1m:02s ------------------------------------- Multi-layer Perceptron --> f1_weighted: 0.8701
Analyze results¶
In [37]:
Copied!
atom.evaluate()
atom.evaluate()
Out[37]:
balanced_accuracy | f1_weighted | jaccard_weighted | matthews_corrcoef | precision_weighted | recall_weighted | |
---|---|---|---|---|---|---|
MLP | 0.873552 | 0.870075 | 0.771411 | 0.839562 | 0.873802 | 0.871043 |
In [38]:
Copied!
atom.plot_confusion_matrix(figsize=(10, 10))
atom.plot_confusion_matrix(figsize=(10, 10))