{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Example: Data engines\n",
"-----------------------\n",
"\n",
"This example shows how ATOM interacts with other data engines than pandas, for example [polars](https://pola.rs/).\n",
"\n",
"Import the breast cancer dataset from [sklearn.datasets](https://scikit-learn.org/stable/datasets/index.html#wine-dataset). This is a small and easy to train dataset whose goal is to predict whether a patient has breast cancer or not."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load the data"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Import packages\n",
"import polars as pl\n",
"from sklearn.datasets import load_breast_cancer\n",
"from atom import ATOMClassifier"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
shape: (5, 30)mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | radius error | texture error | perimeter error | area error | smoothness error | compactness error | concavity error | concave points error | symmetry error | fractal dimension error | worst radius | worst texture | worst perimeter | worst area | worst smoothness | worst compactness | worst concavity | worst concave points | worst symmetry | worst fractal dimension |
---|
f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 |
17.99 | 10.38 | 122.8 | 1001.0 | 0.1184 | 0.2776 | 0.3001 | 0.1471 | 0.2419 | 0.07871 | 1.095 | 0.9053 | 8.589 | 153.4 | 0.006399 | 0.04904 | 0.05373 | 0.01587 | 0.03003 | 0.006193 | 25.38 | 17.33 | 184.6 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.1189 |
20.57 | 17.77 | 132.9 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | 0.1812 | 0.05667 | 0.5435 | 0.7339 | 3.398 | 74.08 | 0.005225 | 0.01308 | 0.0186 | 0.0134 | 0.01389 | 0.003532 | 24.99 | 23.41 | 158.8 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.186 | 0.275 | 0.08902 |
19.69 | 21.25 | 130.0 | 1203.0 | 0.1096 | 0.1599 | 0.1974 | 0.1279 | 0.2069 | 0.05999 | 0.7456 | 0.7869 | 4.585 | 94.03 | 0.00615 | 0.04006 | 0.03832 | 0.02058 | 0.0225 | 0.004571 | 23.57 | 25.53 | 152.5 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.243 | 0.3613 | 0.08758 |
11.42 | 20.38 | 77.58 | 386.1 | 0.1425 | 0.2839 | 0.2414 | 0.1052 | 0.2597 | 0.09744 | 0.4956 | 1.156 | 3.445 | 27.23 | 0.00911 | 0.07458 | 0.05661 | 0.01867 | 0.05963 | 0.009208 | 14.91 | 26.5 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.173 |
20.29 | 14.34 | 135.1 | 1297.0 | 0.1003 | 0.1328 | 0.198 | 0.1043 | 0.1809 | 0.05883 | 0.7572 | 0.7813 | 5.438 | 94.44 | 0.01149 | 0.02461 | 0.05688 | 0.01885 | 0.01756 | 0.005115 | 22.54 | 16.67 | 152.2 | 1575.0 | 0.1374 | 0.205 | 0.4 | 0.1625 | 0.2364 | 0.07678 |
"
],
"text/plain": [
"shape: (5, 30)\n",
"┌─────────────┬──────────────┬────────────────┬───────────┬───┬─────────────────┬──────────────────────┬────────────────┬─────────────────────────┐\n",
"│ mean radius ┆ mean texture ┆ mean perimeter ┆ mean area ┆ … ┆ worst concavity ┆ worst concave points ┆ worst symmetry ┆ worst fractal dimension │\n",
"│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │\n",
"│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ ┆ f64 ┆ f64 ┆ f64 ┆ f64 │\n",
"╞═════════════╪══════════════╪════════════════╪═══════════╪═══╪═════════════════╪══════════════════════╪════════════════╪═════════════════════════╡\n",
"│ 17.99 ┆ 10.38 ┆ 122.8 ┆ 1001.0 ┆ … ┆ 0.7119 ┆ 0.2654 ┆ 0.4601 ┆ 0.1189 │\n",
"│ 20.57 ┆ 17.77 ┆ 132.9 ┆ 1326.0 ┆ … ┆ 0.2416 ┆ 0.186 ┆ 0.275 ┆ 0.08902 │\n",
"│ 19.69 ┆ 21.25 ┆ 130.0 ┆ 1203.0 ┆ … ┆ 0.4504 ┆ 0.243 ┆ 0.3613 ┆ 0.08758 │\n",
"│ 11.42 ┆ 20.38 ┆ 77.58 ┆ 386.1 ┆ … ┆ 0.6869 ┆ 0.2575 ┆ 0.6638 ┆ 0.173 │\n",
"│ 20.29 ┆ 14.34 ┆ 135.1 ┆ 1297.0 ┆ … ┆ 0.4 ┆ 0.1625 ┆ 0.2364 ┆ 0.07678 │\n",
"└─────────────┴──────────────┴────────────────┴───────────┴───┴─────────────────┴──────────────────────┴────────────────┴─────────────────────────┘"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Load the data and convert to polars for demonstration purposes\n",
"X, y = load_breast_cancer(return_X_y=True, as_frame=True)\n",
"\n",
"X = pl.from_pandas(X)\n",
"y = pl.from_pandas(y)\n",
"\n",
"X.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Run the pipeline"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<< ================== ATOM ================== >>\n",
"\n",
"Configuration ==================== >>\n",
"Algorithm task: Binary classification.\n",
"Data engine: polars\n",
"\n",
"Dataset stats ==================== >>\n",
"Shape: (569, 31)\n",
"Train set size: 456\n",
"Test set size: 113\n",
"-------------------------------------\n",
"Memory: 138.97 kB\n",
"Scaled: False\n",
"Outlier values: 167 (1.2%)\n",
"\n"
]
}
],
"source": [
"# Specify the data engine in the constructor\n",
"# Note that atom accepts any dataframe-like object to create the dataset\n",
"atom = ATOMClassifier(X, y, engine=\"polars\", verbose=2, random_state=1)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
shape: (5, 30)mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | radius error | texture error | perimeter error | area error | smoothness error | compactness error | concavity error | concave points error | symmetry error | fractal dimension error | worst radius | worst texture | worst perimeter | worst area | worst smoothness | worst compactness | worst concavity | worst concave points | worst symmetry | worst fractal dimension |
---|
f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 |
13.48 | 20.82 | 88.4 | 559.2 | 0.1016 | 0.1255 | 0.1063 | 0.05439 | 0.172 | 0.06419 | 0.213 | 0.5914 | 1.545 | 18.52 | 0.005367 | 0.02239 | 0.03049 | 0.01262 | 0.01377 | 0.003187 | 15.53 | 26.02 | 107.3 | 740.4 | 0.161 | 0.4225 | 0.503 | 0.2258 | 0.2807 | 0.1071 |
18.31 | 20.58 | 120.8 | 1052.0 | 0.1068 | 0.1248 | 0.1569 | 0.09451 | 0.186 | 0.05941 | 0.5449 | 0.9225 | 3.218 | 67.36 | 0.006176 | 0.01877 | 0.02913 | 0.01046 | 0.01559 | 0.002725 | 21.86 | 26.2 | 142.2 | 1493.0 | 0.1492 | 0.2536 | 0.3759 | 0.151 | 0.3074 | 0.07863 |
17.93 | 24.48 | 115.2 | 998.9 | 0.08855 | 0.07027 | 0.05699 | 0.04744 | 0.1538 | 0.0551 | 0.4212 | 1.433 | 2.765 | 45.81 | 0.005444 | 0.01169 | 0.01622 | 0.008522 | 0.01419 | 0.002751 | 20.92 | 34.69 | 135.1 | 1320.0 | 0.1315 | 0.1806 | 0.208 | 0.1136 | 0.2504 | 0.07948 |
15.13 | 29.81 | 96.71 | 719.5 | 0.0832 | 0.04605 | 0.04686 | 0.02739 | 0.1852 | 0.05294 | 0.4681 | 1.627 | 3.043 | 45.38 | 0.006831 | 0.01427 | 0.02489 | 0.009087 | 0.03151 | 0.00175 | 17.26 | 36.91 | 110.1 | 931.4 | 0.1148 | 0.09866 | 0.1547 | 0.06575 | 0.3233 | 0.06165 |
8.95 | 15.76 | 58.74 | 245.2 | 0.09462 | 0.1243 | 0.09263 | 0.02308 | 0.1305 | 0.07163 | 0.3132 | 0.9789 | 3.28 | 16.94 | 0.01835 | 0.0676 | 0.09263 | 0.02308 | 0.02384 | 0.005601 | 9.414 | 17.07 | 63.34 | 270.0 | 0.1179 | 0.1879 | 0.1544 | 0.03846 | 0.1652 | 0.07722 |
"
],
"text/plain": [
"shape: (5, 30)\n",
"┌─────────────┬──────────────┬────────────────┬───────────┬───┬─────────────────┬──────────────────────┬────────────────┬─────────────────────────┐\n",
"│ mean radius ┆ mean texture ┆ mean perimeter ┆ mean area ┆ … ┆ worst concavity ┆ worst concave points ┆ worst symmetry ┆ worst fractal dimension │\n",
"│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │\n",
"│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ ┆ f64 ┆ f64 ┆ f64 ┆ f64 │\n",
"╞═════════════╪══════════════╪════════════════╪═══════════╪═══╪═════════════════╪══════════════════════╪════════════════╪═════════════════════════╡\n",
"│ 13.48 ┆ 20.82 ┆ 88.4 ┆ 559.2 ┆ … ┆ 0.503 ┆ 0.2258 ┆ 0.2807 ┆ 0.1071 │\n",
"│ 18.31 ┆ 20.58 ┆ 120.8 ┆ 1052.0 ┆ … ┆ 0.3759 ┆ 0.151 ┆ 0.3074 ┆ 0.07863 │\n",
"│ 17.93 ┆ 24.48 ┆ 115.2 ┆ 998.9 ┆ … ┆ 0.208 ┆ 0.1136 ┆ 0.2504 ┆ 0.07948 │\n",
"│ 15.13 ┆ 29.81 ┆ 96.71 ┆ 719.5 ┆ … ┆ 0.1547 ┆ 0.06575 ┆ 0.3233 ┆ 0.06165 │\n",
"│ 8.95 ┆ 15.76 ┆ 58.74 ┆ 245.2 ┆ … ┆ 0.1544 ┆ 0.03846 ┆ 0.1652 ┆ 0.07722 │\n",
"└─────────────┴──────────────┴────────────────┴───────────┴───┴─────────────────┴──────────────────────┴────────────────┴─────────────────────────┘"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# The data attributes return now polars types\n",
"atom.X.head(5)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
""
],
"text/plain": [
"shape: (5,)\n",
"Series: 'target' [i32]\n",
"[\n",
"\t0\n",
"\t0\n",
"\t0\n",
"\t0\n",
"\t1\n",
"]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"atom.y.head(5)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Training ========================= >>\n",
"Models: LR\n",
"Metric: f1\n",
"\n",
"\n",
"Results for LogisticRegression:\n",
"Fit ---------------------------------------------\n",
"Train evaluation --> f1: 0.9913\n",
"Test evaluation --> f1: 0.9861\n",
"Time elapsed: 0.129s\n",
"-------------------------------------------------\n",
"Time: 0.129s\n",
"\n",
"\n",
"Final results ==================== >>\n",
"Total time: 0.132s\n",
"-------------------------------------\n",
"LogisticRegression --> f1: 0.9861\n"
]
}
],
"source": [
"atom.run(\"LR\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Analyze the results"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
shape: (569,)target |
---|
i64 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
… |
1 |
1 |
1 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
"
],
"text/plain": [
"shape: (569,)\n",
"Series: 'target' [i64]\n",
"[\n",
"\t0\n",
"\t0\n",
"\t0\n",
"\t0\n",
"\t0\n",
"\t0\n",
"\t0\n",
"\t0\n",
"\t0\n",
"\t0\n",
"\t0\n",
"\t0\n",
"\t…\n",
"\t1\n",
"\t1\n",
"\t1\n",
"\t1\n",
"\t1\n",
"\t1\n",
"\t0\n",
"\t0\n",
"\t0\n",
"\t0\n",
"\t0\n",
"\t0\n",
"\t1\n",
"]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# The prediction methods also return types of the requested data engine\n",
"atom.lr.predict(X)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 0\n",
"1 0\n",
"2 0\n",
"3 0\n",
"4 0\n",
"Name: target, dtype: int64[pyarrow]"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"atom.lr.engine = \"pandas-pyarrow\"\n",
"atom.lr.predict(X.head(5))"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Dask Series Structure:\n",
"npartitions=1\n",
"0 int64\n",
"4 ...\n",
"Name: target, dtype: int64\n",
"Dask Name: from_pandas, 1 graph layer"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"atom.lr.engine = \"dask\"\n",
"atom.lr.predict(X.head(5))"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\n",
"[\n",
" 0,\n",
" 0,\n",
" 0,\n",
" 0,\n",
" 0\n",
"]"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"atom.lr.engine = \"pyarrow\"\n",
"atom.lr.predict(X.head(5))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.2"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}