Natural Language Processing¶

This example shows how to use ATOM to quickly go from raw text data to model predictions.

Import the 20 newsgroups text dataset from sklearn.datasets. The dataset comprises around 18000 articles on 20 topics. The goal is to predict the topic of every article.

Load the data¶

In [18]:

            
                Copied!
                
import numpy as np
from atom import ATOMClassifier
from sklearn.datasets import fetch_20newsgroups
import numpy as np
from atom import ATOMClassifier
from sklearn.datasets import fetch_20newsgroups

In [19]:

            
                Copied!
                
                    
                    
                
                

        
# Use only a subset of the available topics for faster processing
X_text, y_text = fetch_20newsgroups(
    return_X_y=True,
    categories=[
        'alt.atheism',
        'sci.med',
        'comp.windows.x',
        'misc.forsale',
        'rec.autos',
    ],
    shuffle=True,
    random_state=1,
)
X_text = np.array(X_text).reshape(-1, 1)
# Use only a subset of the available topics for faster processing
X_text, y_text = fetch_20newsgroups(
    return_X_y=True,
    categories=[
        'alt.atheism',
        'sci.med',
        'comp.windows.x',
        'misc.forsale',
        'rec.autos',
    ],
    shuffle=True,
    random_state=1,
)
X_text = np.array(X_text).reshape(-1, 1)

Run the pipeline¶

In [20]:

            
                Copied!
                
atom = ATOMClassifier(X_text, y_text, test_size=0.3, verbose=2, warnings=False)
atom = ATOMClassifier(X_text, y_text, test_size=0.3, verbose=2, warnings=False)

<< ================== ATOM ================== >>
Algorithm task: multiclass classification.

Dataset stats ==================== >>
Shape: (2846, 2)
Scaled: False
Categorical features: 1 (100.0%)
-------------------------------------
Train set size: 1993
Test set size: 853
-------------------------------------
|    |     dataset |       train |        test |
| -- | ----------- | ----------- | ----------- |
| 0  |   480 (1.0) |   341 (1.0) |   139 (1.0) |
| 1  |   593 (1.2) |   418 (1.2) |   175 (1.3) |
| 2  |   585 (1.2) |   416 (1.2) |   169 (1.2) |
| 3  |   594 (1.2) |   413 (1.2) |   181 (1.3) |
| 4  |   594 (1.2) |   405 (1.2) |   189 (1.4) |

In [21]:

            
                Copied!
                
atom.dataset  # Note that the feature is automatically named 'Corpus'
atom.dataset  # Note that the feature is automatically named 'Corpus'

Out[21]:

	Corpus	target
0	From: mcdonald@aries.scs.uiuc.edu (J. D. McDon...	4
1	From: victor@hpfrcu03.FRance.hp.COM (Victor GA...	1
2	From: uabdpo.dpo.uab.edu!gila005 (Stephen Holl...	4
3	From: jbrown@batman.bmd.trw.com\nSubject: Re: ...	0
4	Organization: University of Maine System\nFrom...	4
...	...	...
2841	From: jim.zisfein@factory.com (Jim Zisfein) \n...	4
2842	From: davewood@bruno.cs.colorado.edu (David Re...	1
2843	From: marc@comp.lancs.ac.uk (Marc Goldman)\nSu...	2
2844	From: mcovingt@aisun3.ai.uga.edu (Michael Covi...	4
2845	From: reznik@robios.me.wisc.edu (Dan S Reznik)...	1

2846 rows × 2 columns

In [22]:

            
                Copied!
                
# Let's have a look at the first document
atom.Corpus[0]
# Let's have a look at the first document
atom.Corpus[0]

Out[22]:

"From: mcdonald@aries.scs.uiuc.edu (J. D. McDonald)\nSubject: Re: jiggers\nArticle-I.D.: aries.mcdonald.895.734049502\nOrganization: UIUC SCS\nLines: 13\n\nIn article <78846@cup.portal.com> mmm@cup.portal.com (Mark Robert Thorson) writes:\n\n>This wouldn't happen to be the same thing as chiggers, would it?\n>A truly awful parasitic affliction, as I understand it.  Tiny bugs\n>dig deeply into the skin, burying themselves.  Yuck!  They have these\n>things in Oklahoma.\n\nClose. My mother comes from Gainesville Tex, right across the border.\nThey claim to be the chigger capitol of the world, and I believe them.\nWhen I grew up in Fort Worth it was bad enough, but in Gainesville\nin the summer an attack was guaranteed.\n\nDoug McDonald\n"

In [23]:

            
                Copied!
                
# Clean the documents from noise (emails, numbers, etc...)
atom.textclean()
# Clean the documents from noise (emails, numbers, etc...)
atom.textclean()

Filtering the corpus...
 --> Decoding unicode characters to ascii.
 --> Converting text to lower case.
 --> Dropping 10012 emails from 2830 documents.
 --> Dropping 0 URL links from 0 documents.
 --> Dropping 2214 HTML tags from 1304 documents.
 --> Dropping 2 emojis from 1 documents.
 --> Dropping 31222 numbers from 2843 documents.
 --> Dropping punctuation from the text.

In [24]:

            
                Copied!
                
# Have a look at the removed items
atom.drops
# Have a look at the removed items
atom.drops

Out[24]:

	email	url	html	emoji	number
0	[mcdonald@aries.scs.uiuc.edu, 78846@cup.portal...	NaN	[<>]	NaN	[895, 734049502, 13]
1	[victor@hpfrcu03.france.hp.com, alf@st.nepean....	NaN	[<key>]	NaN	[5000, 240, 61, 10, 3, 10, 3, 5, 10, 4]
2	[1993apr22.210631.13300@aio.jsc.nasa.gov, spen...	NaN	[<>]	NaN	[81]
3	[jbrown@batman.bmd.trw.com, 1993apr20.062328.1...	NaN	[<>]	NaN	[67, 2, 3, 5, 6, 2, 6]
4	[andy@maine.maine.edu, andy@maine.edu]	NaN	[<>]	NaN	[8]
...	...	...	...	...	...
2841	[jim.zisfein@factory.com, jim.zisfein@factory....	NaN	[<g>]	NaN	[212, 274, 31, 2, 1]
2842	[davewood@bruno.cs.colorado.edu, davewood@cs.c...	NaN	[<x11/xlib.h>, <xm/xm.h>, <xm/pushb.h>, <stdio...	NaN	[91, 0, 0, 50, 50, 0, 25, 0, 0, 0, 0, 1, 0, 20...
2843	[marc@comp.lancs.ac.uk, marc@comp.lancs.ac.uk,...	NaN	NaN	NaN	[24, 3]
2844	[mcovingt@aisun3.ai.uga.edu, mcovingt@ai.uga.edu]	NaN	[<>]	NaN	[12, 706, 542, 0358, 30602, 7415]
2845	[reznik@robios.me.wisc.edu, reznik@robios5.me....	NaN	NaN	NaN	[22, 93, 18, 22, 55, 13]

2846 rows × 5 columns

In [25]:

            
                Copied!
                
# Check how the first document changed
atom.Corpus[0]
# Check how the first document changed
atom.Corpus[0]

Out[25]:

'from  j d mcdonald\nsubject re jiggers\narticleid ariesmcdonald\norganization uiuc scs\nlines \n\nin article   mark robert thorson writes\n\nthis wouldnt happen to be the same thing as chiggers would it\na truly awful parasitic affliction as i understand it  tiny bugs\ndig deeply into the skin burying themselves  yuck  they have these\nthings in oklahoma\n\nclose my mother comes from gainesville tex right across the border\nthey claim to be the chigger capitol of the world and i believe them\nwhen i grew up in fort worth it was bad enough but in gainesville\nin the summer an attack was guaranteed\n\ndoug mcdonald\n'

In [26]:

            
                Copied!
                
# Convert the strings to a sequence of words
atom.tokenize()
# Convert the strings to a sequence of words
atom.tokenize()

Tokenizing the corpus...

In [27]:

            
                Copied!
                
# Print the first few words of the first document
atom.Corpus[0][:7]
# Print the first few words of the first document
atom.Corpus[0][:7]

Out[27]:

['from', 'j', 'd', 'mcdonald', 'subject', 're', 'jiggers']

In [28]:

            
                Copied!
                
# Normalize the text to a predefined standard
atom.normalize(stopwords="english", lemmatize=True)
# Normalize the text to a predefined standard
atom.normalize(stopwords="english", lemmatize=True)

Normalizing the corpus...
 --> Dropping stopwords.
 --> Applying lemmatization.

In [29]:

            
                Copied!
                
atom.Corpus[0][:7]  # Check changes...
atom.Corpus[0][:7]  # Check changes...

Out[29]:

['j',
 'mcdonald',
 'subject',
 'jigger',
 'articleid',
 'ariesmcdonald',
 'organization']

In [30]:

            
                Copied!
                
# Visualize the most common words with a wordcloud
atom.plot_wordcloud()
# Visualize the most common words with a wordcloud
atom.plot_wordcloud()

In [31]:

            
                Copied!
                
# Have a look at the most frequent bigrams
atom.plot_ngrams(2)
# Have a look at the most frequent bigrams
atom.plot_ngrams(2)

In [32]:

            
                Copied!
                
# Create the bigrams using the tokenizer
atom.tokenize(bigram_freq=215)
# Create the bigrams using the tokenizer
atom.tokenize(bigram_freq=215)

Tokenizing the corpus...
 --> Creating 10 bigrams on 4178 locations.

In [33]:

            
                Copied!
                
atom.bigrams
atom.bigrams

Out[33]:

	bigram	frequency
9	(x, x)	1169
0	(line, article)	714
6	(line, nntppostinghost)	493
3	(organization, university)	367
8	(gordon, bank)	266
4	(line, distribution)	258
5	(distribution, world)	249
1	(distribution, usa)	229
2	(usa, line)	217
7	(computer, science)	216

In [34]:

            
                Copied!
                
# As a last step before modelling, convert the words to vectors
atom.vectorize(strategy="tf-idf")
# As a last step before modelling, convert the words to vectors
atom.vectorize(strategy="tf-idf")

Vectorizing the corpus...

In [35]:

            
                Copied!
                
# The dimensionality of the dataset has increased a lot!
atom.shape
# The dimensionality of the dataset has increased a lot!
atom.shape

Out[35]:

(2846, 28904)

In [36]:

            
                Copied!
                
# Train the model
atom.run(models="MLP", metric="f1_weighted")
# Train the model
atom.run(models="MLP", metric="f1_weighted")

Training ========================= >>
Models: MLP
Metric: f1_weighted

Results for Multi-layer Perceptron:
Fit ---------------------------------------------
Train evaluation --> f1_weighted: 1.0
Test evaluation --> f1_weighted: 0.8701
Time elapsed: 1m:02s
-------------------------------------------------
Total time: 1m:02s


Final results ==================== >>
Duration: 1m:02s
-------------------------------------
Multi-layer Perceptron --> f1_weighted: 0.8701

Analyze results¶

In [37]:

            
                Copied!
                
atom.evaluate()
atom.evaluate()

Out[37]:

	balanced_accuracy	f1_weighted	jaccard_weighted	matthews_corrcoef	precision_weighted	recall_weighted
MLP	0.873552	0.870075	0.771411	0.839562	0.873802	0.871043

In [38]:

            
                Copied!
                
atom.plot_confusion_matrix(figsize=(10, 10))
atom.plot_confusion_matrix(figsize=(10, 10))