Example: Utilities¶
This example shows various useful utilities that can be used to improve atom's pipelines.
The data used is a variation on the Australian weather dataset from Kaggle. You can download it from here. The goal of this dataset is to predict whether or not it will rain tomorrow training a binary classifier on target RainTomorrow
.
Load the data¶
In [1]:
Copied!
# Import packages
import pandas as pd
from sklearn.metrics import fbeta_score
from atom import ATOMClassifier, ATOMLoader
# Import packages
import pandas as pd
from sklearn.metrics import fbeta_score
from atom import ATOMClassifier, ATOMLoader
In [2]:
Copied!
# Load data
X = pd.read_csv("./datasets/weatherAUS.csv")
# Let's have a look
X.head()
# Load data
X = pd.read_csv("./datasets/weatherAUS.csv")
# Let's have a look
X.head()
Out[2]:
Location | MinTemp | MaxTemp | Rainfall | Evaporation | Sunshine | WindGustDir | WindGustSpeed | WindDir9am | WindDir3pm | ... | Humidity9am | Humidity3pm | Pressure9am | Pressure3pm | Cloud9am | Cloud3pm | Temp9am | Temp3pm | RainToday | RainTomorrow | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | MelbourneAirport | 18.0 | 26.9 | 21.4 | 7.0 | 8.9 | SSE | 41.0 | W | SSE | ... | 95.0 | 54.0 | 1019.5 | 1017.0 | 8.0 | 5.0 | 18.5 | 26.0 | Yes | 0 |
1 | Adelaide | 17.2 | 23.4 | 0.0 | NaN | NaN | S | 41.0 | S | WSW | ... | 59.0 | 36.0 | 1015.7 | 1015.7 | NaN | NaN | 17.7 | 21.9 | No | 0 |
2 | Cairns | 18.6 | 24.6 | 7.4 | 3.0 | 6.1 | SSE | 54.0 | SSE | SE | ... | 78.0 | 57.0 | 1018.7 | 1016.6 | 3.0 | 3.0 | 20.8 | 24.1 | Yes | 0 |
3 | Portland | 13.6 | 16.8 | 4.2 | 1.2 | 0.0 | ESE | 39.0 | ESE | ESE | ... | 76.0 | 74.0 | 1021.4 | 1020.5 | 7.0 | 8.0 | 15.6 | 16.0 | Yes | 1 |
4 | Walpole | 16.4 | 19.9 | 0.0 | NaN | NaN | SE | 44.0 | SE | SE | ... | 78.0 | 70.0 | 1019.4 | 1018.9 | NaN | NaN | 17.4 | 18.1 | No | 0 |
5 rows × 22 columns
Use the utility attributes¶
In [3]:
Copied!
atom = ATOMClassifier(X, random_state=1)
atom.clean()
# Quickly check what columns have missing values
print(f"Columns with missing values:\n{atom.nans}")
# Or what columns are categorical
print(f"\nCategorical columns: {atom.categorical}")
# Or if the dataset is scaled
print(f"\nIs the dataset scaled? {atom.scaled}")
atom = ATOMClassifier(X, random_state=1)
atom.clean()
# Quickly check what columns have missing values
print(f"Columns with missing values:\n{atom.nans}")
# Or what columns are categorical
print(f"\nCategorical columns: {atom.categorical}")
# Or if the dataset is scaled
print(f"\nIs the dataset scaled? {atom.scaled}")
Columns with missing values: MinTemp 637 MaxTemp 322 Rainfall 1406 Evaporation 60843 Sunshine 67816 WindGustDir 9330 WindGustSpeed 9270 WindDir9am 10013 WindDir3pm 3778 WindSpeed9am 1348 WindSpeed3pm 2630 Humidity9am 1774 Humidity3pm 3610 Pressure9am 14014 Pressure3pm 13981 Cloud9am 53657 Cloud3pm 57094 Temp9am 904 Temp3pm 2726 RainToday 1406 dtype: int64 Categorical columns: Index(['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday'], dtype='object') Is the dataset scaled? False
Use the stats method to assess changes in the dataset¶
In [4]:
Copied!
# Note the number of missing values and categorical columns
atom.stats()
# Note the number of missing values and categorical columns
atom.stats()
Dataset stats ==================== >> Shape: (142193, 22) Memory: 61.69 MB Scaled: False Missing values: 316559 (10.1%) Categorical features: 5 (23.8%) Duplicate samples: 45 (0.0%) ------------------------------------- Train set size: 113755 Test set size: 28438 ------------------------------------- | | dataset | train | test | | - | --------- | --------- | --------- | | 0 | 0 (0.0) | 0 (0.0) | 0 (0.0) | | 1 | 0 (0.0) | 0 (0.0) | 0 (0.0) |
In [5]:
Copied!
# Now, let's impute and encode the dataset...
atom.impute()
atom.encode()
# ... and the values are gone
atom.stats()
# Now, let's impute and encode the dataset...
atom.impute()
atom.encode()
# ... and the values are gone
atom.stats()
Dataset stats ==================== >> Shape: (56420, 22) Memory: 9.93 MB Scaled: False Outlier values: 3203 (0.3%) ------------------------------------- Train set size: 45075 Test set size: 11345 ------------------------------------- | | dataset | train | test | | - | --------- | --------- | --------- | | 0 | 0 (0.0) | 0 (0.0) | 0 (0.0) | | 1 | 0 (0.0) | 0 (0.0) | 0 (0.0) |
Inspect feature distributions¶
In [6]:
Copied!
# Compare the relationship of multiple columns with a scatter maxtrix
atom.plot_relationships(columns=slice(0, 5))
# Compare the relationship of multiple columns with a scatter maxtrix
atom.plot_relationships(columns=slice(0, 5))