Example: Data engines¶

This example shows how ATOM interacts with other data engines than pandas, for example polars.

Import the breast cancer dataset from sklearn.datasets. This is a small and easy to train dataset whose goal is to predict whether a patient has breast cancer or not.

Load the data¶

In [1]:

Copied!





# Import packages
import polars as pl
from sklearn.datasets import load_breast_cancer
from atom import ATOMClassifier
# Import packages
import polars as pl
from sklearn.datasets import load_breast_cancer
from atom import ATOMClassifier

In [2]:

Copied!

# Load the data and convert to polars for demonstration purposes
X, y = load_breast_cancer(return_X_y=True, as_frame=True)

X = pl.from_pandas(X)
y = pl.from_pandas(y)

X.head()
# Load the data and convert to polars for demonstration purposes
X, y = load_breast_cancer(return_X_y=True, as_frame=True)

X = pl.from_pandas(X)
y = pl.from_pandas(y)

X.head()

Out[2]:

shape: (5, 30)

mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	radius error	texture error	perimeter error	area error	smoothness error	compactness error	concavity error	concave points error	symmetry error	fractal dimension error	worst radius	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension
f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64
17.99	10.38	122.8	1001.0	0.1184	0.2776	0.3001	0.1471	0.2419	0.07871	1.095	0.9053	8.589	153.4	0.006399	0.04904	0.05373	0.01587	0.03003	0.006193	25.38	17.33	184.6	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.1189
20.57	17.77	132.9	1326.0	0.08474	0.07864	0.0869	0.07017	0.1812	0.05667	0.5435	0.7339	3.398	74.08	0.005225	0.01308	0.0186	0.0134	0.01389	0.003532	24.99	23.41	158.8	1956.0	0.1238	0.1866	0.2416	0.186	0.275	0.08902
19.69	21.25	130.0	1203.0	0.1096	0.1599	0.1974	0.1279	0.2069	0.05999	0.7456	0.7869	4.585	94.03	0.00615	0.04006	0.03832	0.02058	0.0225	0.004571	23.57	25.53	152.5	1709.0	0.1444	0.4245	0.4504	0.243	0.3613	0.08758
11.42	20.38	77.58	386.1	0.1425	0.2839	0.2414	0.1052	0.2597	0.09744	0.4956	1.156	3.445	27.23	0.00911	0.07458	0.05661	0.01867	0.05963	0.009208	14.91	26.5	98.87	567.7	0.2098	0.8663	0.6869	0.2575	0.6638	0.173
20.29	14.34	135.1	1297.0	0.1003	0.1328	0.198	0.1043	0.1809	0.05883	0.7572	0.7813	5.438	94.44	0.01149	0.02461	0.05688	0.01885	0.01756	0.005115	22.54	16.67	152.2	1575.0	0.1374	0.205	0.4	0.1625	0.2364	0.07678

Run the pipeline¶

In [3]:

Copied!

# Specify the data engine in the constructor
# Note that atom accepts any dataframe-like object to create the dataset
atom = ATOMClassifier(X, y, engine="polars", verbose=2, random_state=1)
# Specify the data engine in the constructor
# Note that atom accepts any dataframe-like object to create the dataset
atom = ATOMClassifier(X, y, engine="polars", verbose=2, random_state=1)

<< ================== ATOM ================== >>

Configuration ==================== >>
Algorithm task: Binary classification.
Data engine: polars

Dataset stats ==================== >>
Shape: (569, 31)
Train set size: 456
Test set size: 113
-------------------------------------
Memory: 138.97 kB
Scaled: False
Outlier values: 167 (1.2%)

In [4]:

Copied!

# The data attributes return now polars types
atom.X.head(5)
# The data attributes return now polars types
atom.X.head(5)

Out[4]:

shape: (5, 30)

mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	radius error	texture error	perimeter error	area error	smoothness error	compactness error	concavity error	concave points error	symmetry error	fractal dimension error	worst radius	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension
f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64
13.48	20.82	88.4	559.2	0.1016	0.1255	0.1063	0.05439	0.172	0.06419	0.213	0.5914	1.545	18.52	0.005367	0.02239	0.03049	0.01262	0.01377	0.003187	15.53	26.02	107.3	740.4	0.161	0.4225	0.503	0.2258	0.2807	0.1071
18.31	20.58	120.8	1052.0	0.1068	0.1248	0.1569	0.09451	0.186	0.05941	0.5449	0.9225	3.218	67.36	0.006176	0.01877	0.02913	0.01046	0.01559	0.002725	21.86	26.2	142.2	1493.0	0.1492	0.2536	0.3759	0.151	0.3074	0.07863
17.93	24.48	115.2	998.9	0.08855	0.07027	0.05699	0.04744	0.1538	0.0551	0.4212	1.433	2.765	45.81	0.005444	0.01169	0.01622	0.008522	0.01419	0.002751	20.92	34.69	135.1	1320.0	0.1315	0.1806	0.208	0.1136	0.2504	0.07948
15.13	29.81	96.71	719.5	0.0832	0.04605	0.04686	0.02739	0.1852	0.05294	0.4681	1.627	3.043	45.38	0.006831	0.01427	0.02489	0.009087	0.03151	0.00175	17.26	36.91	110.1	931.4	0.1148	0.09866	0.1547	0.06575	0.3233	0.06165
8.95	15.76	58.74	245.2	0.09462	0.1243	0.09263	0.02308	0.1305	0.07163	0.3132	0.9789	3.28	16.94	0.01835	0.0676	0.09263	0.02308	0.02384	0.005601	9.414	17.07	63.34	270.0	0.1179	0.1879	0.1544	0.03846	0.1652	0.07722

In [5]:

Copied!

atom.y.head(5)
atom.y.head(5)

Out[5]:

shape: (5,)

target
i32
0
0
0
0
1

In [6]:

Copied!

atom.run("LR")
atom.run("LR")

Training ========================= >>
Models: LR
Metric: f1


Results for LogisticRegression:
Fit ---------------------------------------------
Train evaluation --> f1: 0.9913
Test evaluation --> f1: 0.9861
Time elapsed: 0.129s
-------------------------------------------------
Time: 0.129s


Final results ==================== >>
Total time: 0.132s
-------------------------------------
LogisticRegression --> f1: 0.9861

Analyze the results¶

In [7]:

Copied!

# The prediction methods also return types of the requested data engine
atom.lr.predict(X)
# The prediction methods also return types of the requested data engine
atom.lr.predict(X)

Out[7]:

shape: (569,)

target
i64
0
0
0
0
0
0
0
0
0
0
0
0
…
1
1
1
1
1
0
0
0
0
0
0
1

In [8]:

Copied!

atom.lr.engine = "pandas-pyarrow"
atom.lr.predict(X.head(5))
atom.lr.engine = "pandas-pyarrow"
atom.lr.predict(X.head(5))

Out[8]:

0    0
1    0
2    0
3    0
4    0
Name: target, dtype: int64[pyarrow]

In [9]:

Copied!

atom.lr.engine = "dask"
atom.lr.predict(X.head(5))
atom.lr.engine = "dask"
atom.lr.predict(X.head(5))

Out[9]:

Dask Series Structure:
npartitions=1
0    int64
4      ...
Name: target, dtype: int64
Dask Name: from_pandas, 1 graph layer

In [10]:

Copied!

atom.lr.engine = "pyarrow"
atom.lr.predict(X.head(5))
atom.lr.engine = "pyarrow"
atom.lr.predict(X.head(5))

Out[10]:

<pyarrow.lib.Int64Array object at 0x0000016E06BCD1E0>
[
  0,
  0,
  0,
  0,
  0
]