Natural Language Processing¶
This example shows how to use ATOM to quickly go from raw text data to model predictions.
Import the 20 newsgroups text dataset from sklearn.datasets. The dataset comprises around 18000 articles on 20 topics. The goal is to predict the topic of every article.
Load the data¶
In [1]:
Copied!
import numpy as np
from atom import ATOMClassifier
from sklearn.datasets import fetch_20newsgroups
import numpy as np
from atom import ATOMClassifier
from sklearn.datasets import fetch_20newsgroups
In [2]:
Copied!
# Use only a subset of the available topics for faster processing
X_text, y_text = fetch_20newsgroups(
return_X_y=True,
categories=[
'alt.atheism',
'sci.med',
'comp.windows.x',
'misc.forsale',
'rec.autos',
],
shuffle=True,
random_state=1,
)
X_text = np.array(X_text).reshape(-1, 1)
# Use only a subset of the available topics for faster processing
X_text, y_text = fetch_20newsgroups(
return_X_y=True,
categories=[
'alt.atheism',
'sci.med',
'comp.windows.x',
'misc.forsale',
'rec.autos',
],
shuffle=True,
random_state=1,
)
X_text = np.array(X_text).reshape(-1, 1)
Run the pipeline¶
In [3]:
Copied!
atom = ATOMClassifier(X_text, y_text, test_size=0.3, verbose=2, random_state=1)
atom = ATOMClassifier(X_text, y_text, test_size=0.3, verbose=2, random_state=1)
<< ================== ATOM ================== >> Algorithm task: multiclass classification. Dataset stats ====================== >> Shape: (2846, 2) Scaled: False Categorical features: 1 (100.0%) --------------------------------------- Train set size: 1993 Test set size: 853 --------------------------------------- | | dataset | train | test | |---:|:----------|:----------|:----------| | 0 | 480 (1.0) | 336 (1.0) | 144 (1.0) | | 1 | 593 (1.2) | 414 (1.2) | 179 (1.2) | | 2 | 585 (1.2) | 400 (1.2) | 185 (1.3) | | 3 | 594 (1.2) | 440 (1.3) | 154 (1.1) | | 4 | 594 (1.2) | 403 (1.2) | 191 (1.3) |
In [4]:
Copied!
atom.dataset # Note that the feature is automatically named 'Corpus'
atom.dataset # Note that the feature is automatically named 'Corpus'
Out[4]:
Corpus | Target | |
---|---|---|
0 | From: lehr@austin.ibm.com (Ted Lehr)\nSubject:... | 4 |
1 | From: 02106@ravel.udel.edu (Samuel Ross)\nSubj... | 2 |
2 | From: callison@uokmax.ecn.uoknor.edu (James P.... | 3 |
3 | From: wrat@unisql.UUCP (wharfie)\nSubject: Re:... | 3 |
4 | From: ip02@ns1.cc.lehigh.edu (Danny Phornpraph... | 3 |
... | ... | ... |
2841 | From: marc@ccvi.ccv.FR (Marc Bassini)\nSubject... | 1 |
2842 | From: mwbg9715@uxa.cso.uiuc.edu (Mark Wayne Bl... | 3 |
2843 | From: sasghm@theseus.unx.sas.com (Gary Merrill... | 4 |
2844 | From: jimf@centerline.com (Jim Frost)\nSubject... | 3 |
2845 | From: tjo@scr.siemens.com (Tom Ostrand)\nSubje... | 3 |
2846 rows × 2 columns
In [5]:
Copied!
# Let's have a look at the first document
atom.Corpus[0]
# Let's have a look at the first document
atom.Corpus[0]
Out[5]:
'From: lehr@austin.ibm.com (Ted Lehr)\nSubject: Re: Science and methodology (was: Homeopathy ... tradition?)\nOriginator: lehr@jan.austin.ibm.com\nDistribution: inet\nOrganization: IBM Austin\nLines: 47\n\n\nGary Merrill writes:\n> .. Not every wild flight of fancy serves\n> (or can serve) in the appropriate relation to a hypothesis. It is\n> somewhat interesting that when anyone is challanged to provide an\n> example of this sort the *only* one they come up with is the one about\n> Kekule. Surely, there must be others. But apparently this is regarded\n> as an *extreme* example of a "non-rational" process in science whereby\n> a successful hypothesis was proposed. But how non-rational is it?\n\nIndeed, an extreme example. It came "out of nowhere." The connection\nKekule saw between it and his problem is fortunate but not extraordinary.\nI, for example, often receive/conjure solutions (hypotheses for solutions) \nto my everyday problems at moments when I appear to myself to be occupied \nwith activities quite removed. Algorithms for that new software feature come\nwhen I trample the meadow on my occasional runs. Alternative (better>) ways \nto instruct and rear my sons arrive while I weed the garden. I\'ll swear I am \nnot thinking about any of it when ideas come. \n\nThese ideas are not the stuff of "great" discoveries, of course, but my\nconnecting them to particular problems is fraught with deliberation and\noccasional fits of rationality.\n\n> Surely it wasn\'t the *only* daydream [Kekule] had. What was special about\n> *this* one? Could it have had something to do with a perceived\n> *analogy* between the geometry of the snakes and problems concerning\n> geometry of molecules? \n\nYes. And he was lucky to have such a colorful, vivid image. I, alas, will\nnever figure out why returning worms to the loose soil of my garden brought, \n"have him count objects instead of merely count" to mind regarding my 2 \nyear-old\'s fledging arithmetic skills.\n\n> ... Upon close examination,\n> is there a non-rational mystical leap taking place, or is it perhaps\n> closer to a formal (though often incomplete) analogy or model?\n\nThe latter. Worms wiggling around in the dirt fascinate my son.\n\nRegards,\n\nTed \n-- \nTed Lehr | "...my thoughts, opinions and questions..."\nFuture Systems Technology Group, AWS | \nIBM \t\t\t\t | Internet: lehr@futserv.austin.ibm.com\nAustin, TX 78758\t\t | \n'
In [6]:
Copied!
# Clean the documents from noise (emails, numbers, etc...)
atom.textclean()
# Clean the documents from noise (emails, numbers, etc...)
atom.textclean()
Filtering the corpus... --> Decoding unicode characters to ascii. --> Converting text to lower case. --> Dropping 10012 emails from 2830 documents. --> Dropping 0 URL links from 0 documents. --> Dropping 2214 HTML tags from 1304 documents. --> Dropping 2 emojis from 1 documents. --> Dropping 31222 numbers from 2843 documents. --> Dropping punctuation from the text.
In [7]:
Copied!
# Have a look at the removed items
atom.drops
# Have a look at the removed items
atom.drops
Out[7]:
url | html | emoji | number | ||
---|---|---|---|---|---|
0 | [lehr@austin.ibm.com, lehr@jan.austin.ibm.com,... | NaN | NaN | NaN | [47, 2, 78758] |
1 | [02106@ravel.udel.edu, 02106@chopin.udel.edu, ... | NaN | NaN | NaN | [25, 1986, 720] |
2 | [callison@uokmax.ecn.uoknor.edu, 1993apr15.223... | NaN | [<>, <>] | NaN | [26, 89] |
3 | [wrat@unisql.uucp, c5r43y.f0d@mentor.cc.purdue... | NaN | [<>] | NaN | [12, 80, 55, 65, 80, 80, 1958, 80, 1993] |
4 | [ip02@ns1.cc.lehigh.edu, ip02@lehigh.edu] | NaN | NaN | NaN | [15, 215, 758, 4141] |
... | ... | ... | ... | ... | ... |
2841 | [marc@ccvi.ccv.fr, jb@sgihbtn.sierra.com, jb@s... | NaN | NaN | NaN | [16, 82, 75008, 40, 08, 07, 07, 43, 87, 35, 99] |
2842 | [mwbg9715@uxa.cso.uiuc.edu, zowie@daedalus.sta... | NaN | NaN | NaN | [12, 5, 7, 40, 40, 5, 30, 10, 30, 20, 50] |
2843 | [sasghm@theseus.unx.sas.com, sasghm@theseus.un... | NaN | [<>] | NaN | [55, 400, 27513, 919, 677, 8000] |
2844 | [jimf@centerline.com, tcorkum@bnr.ca, jimf@cen... | NaN | NaN | NaN | [14, 140, 239, 3, 202, 5] |
2845 | [tjo@scr.siemens.com, tjo@scr.siemens.com] | NaN | NaN | NaN | [19, 1984, 609, 734, 6569, 755, 609, 734, 6565... |
2846 rows × 5 columns
In [8]:
Copied!
# Check how the first document changed
atom.Corpus[0]
# Check how the first document changed
atom.Corpus[0]
Out[8]:
'from ted lehr\nsubject re science and methodology was homeopathy tradition\noriginator \ndistribution inet\norganization ibm austin\nlines \n\n\ngary merrill writes\n not every wild flight of fancy serves\n or can serve in the appropriate relation to a hypothesis it is\n somewhat interesting that when anyone is challanged to provide an\n example of this sort the only one they come up with is the one about\n kekule surely there must be others but apparently this is regarded\n as an extreme example of a nonrational process in science whereby\n a successful hypothesis was proposed but how nonrational is it\n\nindeed an extreme example it came out of nowhere the connection\nkekule saw between it and his problem is fortunate but not extraordinary\ni for example often receiveconjure solutions hypotheses for solutions \nto my everyday problems at moments when i appear to myself to be occupied \nwith activities quite removed algorithms for that new software feature come\nwhen i trample the meadow on my occasional runs alternative better ways \nto instruct and rear my sons arrive while i weed the garden ill swear i am \nnot thinking about any of it when ideas come \n\nthese ideas are not the stuff of great discoveries of course but my\nconnecting them to particular problems is fraught with deliberation and\noccasional fits of rationality\n\n surely it wasnt the only daydream kekule had what was special about\n this one could it have had something to do with a perceived\n analogy between the geometry of the snakes and problems concerning\n geometry of molecules \n\nyes and he was lucky to have such a colorful vivid image i alas will\nnever figure out why returning worms to the loose soil of my garden brought \nhave him count objects instead of merely count to mind regarding my \nyearolds fledging arithmetic skills\n\n upon close examination\n is there a nonrational mystical leap taking place or is it perhaps\n closer to a formal though often incomplete analogy or model\n\nthe latter worms wiggling around in the dirt fascinate my son\n\nregards\n\nted \n \nted lehr my thoughts opinions and questions\nfuture systems technology group aws \nibm \t\t\t\t internet \naustin tx \t\t \n'
In [9]:
Copied!
# Convert the strings to a sequence of words
atom.tokenize()
# Convert the strings to a sequence of words
atom.tokenize()
Tokenizing the corpus...
In [10]:
Copied!
# Print the first few words of the first document
atom.Corpus[0][:7]
# Print the first few words of the first document
atom.Corpus[0][:7]
Out[10]:
['from', 'ted', 'lehr', 'subject', 're', 'science', 'and']
In [11]:
Copied!
# Normalize the text to a predefined standard
atom.normalize(stopwords="english", lemmatize=True)
# Normalize the text to a predefined standard
atom.normalize(stopwords="english", lemmatize=True)
Normalizing the corpus... --> Dropping stopwords. --> Applying lemmatization.
In [12]:
Copied!
atom.Corpus[0][:7] # Check changes...
atom.Corpus[0][:7] # Check changes...
Out[12]:
['ted', 'lehr', 'subject', 'science', 'methodology', 'homeopathy', 'tradition']
In [13]:
Copied!
# Visualize the most common words with a wordcloud
atom.plot_wordcloud()
# Visualize the most common words with a wordcloud
atom.plot_wordcloud()
In [14]:
Copied!
# Have a look at the most frequent bigrams
atom.plot_ngrams(2)
# Have a look at the most frequent bigrams
atom.plot_ngrams(2)
In [15]:
Copied!
# Create the bigrams using the tokenizer
atom.tokenize(bigram_freq=215)
# Create the bigrams using the tokenizer
atom.tokenize(bigram_freq=215)
Tokenizing the corpus... --> Creating 10 bigrams on 4178 locations.
In [16]:
Copied!
atom.bigrams
atom.bigrams
Out[16]:
bigram | frequency | |
---|---|---|
9 | (x, x) | 1169 |
3 | (line, article) | 714 |
4 | (line, nntppostinghost) | 493 |
0 | (organization, university) | 367 |
8 | (gordon, bank) | 266 |
5 | (line, distribution) | 258 |
6 | (distribution, world) | 249 |
1 | (distribution, usa) | 229 |
2 | (usa, line) | 217 |
7 | (computer, science) | 216 |
In [17]:
Copied!
# As a last step before modelling, convert the words to vectors
atom.vectorize(strategy="tf-idf")
# As a last step before modelling, convert the words to vectors
atom.vectorize(strategy="tf-idf")
Vectorizing the corpus...
In [18]:
Copied!
# The dimensionality of the dataset has increased a lot!
atom.shape
# The dimensionality of the dataset has increased a lot!
atom.shape
Out[18]:
(2846, 28630)
In [19]:
Copied!
# Train the model
atom.run(models="MLP", metric="f1_weighted")
# Train the model
atom.run(models="MLP", metric="f1_weighted")
Training ===================================== >> Models: MLP Metric: f1_weighted Results for Multi-layer Perceptron: Fit --------------------------------------------- Train evaluation --> f1_weighted: 1.0 Test evaluation --> f1_weighted: 0.8743 Time elapsed: 42.985s ------------------------------------------------- Total time: 42.985s Final results ========================= >> Duration: 42.985s ------------------------------------------ Multi-layer Perceptron --> f1_weighted: 0.8743
Analyze results¶
In [21]:
Copied!
atom.evaluate()
atom.evaluate()
Out[21]:
balanced_accuracy | f1_weighted | jaccard_weighted | matthews_corrcoef | precision_weighted | recall_weighted | |
---|---|---|---|---|---|---|
MLP | 0.881896 | 0.87426 | 0.779983 | 0.84769 | 0.879316 | 0.876905 |
In [22]:
Copied!
atom.plot_confusion_matrix(figsize=(10, 10))
atom.plot_confusion_matrix(figsize=(10, 10))