{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Debug pipeline\n", "\n", "This document demonstrates how you might use a `DebugPipeline`. It is much like a normal scikit-learn `Pipeline` but it offers more debugging options.\n", "\n", "We'll first set up libraries and config." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import logging\n", "\n", "import numpy as np\n", "import pandas as pd\n", "\n", "from sklearn.base import BaseEstimator, TransformerMixin\n", "from sklearn.pipeline import FeatureUnion\n", "\n", "from sklego.pipeline import DebugPipeline\n", "\n", "logging.basicConfig(\n", " format=('[%(funcName)s:%(lineno)d] - %(message)s'),\n", " level=logging.INFO\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next up, let's make a simple transformer." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# DebugPipeline set-up\n", "n_samples, n_features = 3, 5\n", "X = np.zeros((n_samples, n_features))\n", "y = np.arange(n_samples)\n", "\n", "\n", "class Adder(TransformerMixin, BaseEstimator):\n", " def __init__(self, value):\n", " self._value = value\n", " \n", " def fit(self, X, y=None):\n", " return self\n", "\n", " def transform(self, X):\n", " return X + self._value\n", " \n", " def __repr__(self):\n", " return f'Adder(value={self._value})'\n", " \n", " \n", "steps = [\n", " ('add_1', Adder(value=1)),\n", " ('add_10', Adder(value=10)),\n", " ('add_100', Adder(value=100)),\n", " ('add_1000', Adder(value=1000)),\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This pipeline behaves exactly the same as a normal pipeline. So let's use it." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Transformed X:\n", " [[1111. 1111. 1111. 1111. 1111.]\n", " [1111. 1111. 1111. 1111. 1111.]\n", " [1111. 1111. 1111. 1111. 1111.]]\n" ] } ], "source": [ "pipe = DebugPipeline(steps)\n", "\n", "pipe.fit(X, y=y)\n", "X_out = pipe.transform(X)\n", "\n", "print('Transformed X:\\n', X_out)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Log statements\n", "\n", "It is possible to set a `log_callback` variable that logs in between each step.\n", "\n", "_Note: there are __three__ log statements while there are __four__ steps, because there are __three__ moments __in between__ the steps. The output can be checked outside of the pipeline._" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[default_log_callback:34] - [Adder(value=1)] shape=(3, 5) time=0s\n", "[default_log_callback:34] - [Adder(value=10)] shape=(3, 5) time=0s\n", "[default_log_callback:34] - [Adder(value=100)] shape=(3, 5) time=0s\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Transformed X:\n", " [[1111. 1111. 1111. 1111. 1111.]\n", " [1111. 1111. 1111. 1111. 1111.]\n", " [1111. 1111. 1111. 1111. 1111.]]\n" ] } ], "source": [ "pipe = DebugPipeline(steps, log_callback='default')\n", "\n", "pipe.fit(X, y=y)\n", "X_out = pipe.transform(X)\n", "\n", "print('Transformed X:\\n', X_out)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set the `log_callback` function later\n", "\n", "It is possible to set the `log_callback` later then initialisation. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[default_log_callback:34] - [Adder(value=1)] shape=(3, 5) time=0s\n", "[default_log_callback:34] - [Adder(value=10)] shape=(3, 5) time=0s\n", "[default_log_callback:34] - [Adder(value=100)] shape=(3, 5) time=0s\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Transformed X:\n", " [[1111. 1111. 1111. 1111. 1111.]\n", " [1111. 1111. 1111. 1111. 1111.]\n", " [1111. 1111. 1111. 1111. 1111.]]\n" ] } ], "source": [ "pipe = DebugPipeline(steps)\n", "pipe.log_callback = 'default'\n", "\n", "pipe.fit(X, y=y)\n", "X_out = pipe.transform(X)\n", "\n", "print('Transformed X:\\n', X_out)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Custom `log_callback`\n", "\n", "The custom log callback function expect the output of each step, which is an tuple containing the output of the step and the step itself, and the execution time of the step." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[log_callback:16] - [Adder(value=1)] shape=(3, 5) nbytes=120 time=4.935264587402344e-05\n", "[log_callback:16] - [Adder(value=10)] shape=(3, 5) nbytes=120 time=3.0040740966796875e-05\n", "[log_callback:16] - [Adder(value=100)] shape=(3, 5) nbytes=120 time=2.5510787963867188e-05\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Transformed X:\n", " [[1111. 1111. 1111. 1111. 1111.]\n", " [1111. 1111. 1111. 1111. 1111.]\n", " [1111. 1111. 1111. 1111. 1111.]]\n" ] } ], "source": [ "def log_callback(output, execution_time, **kwargs):\n", " '''My custom `log_callback` function\n", " \n", " Parameters\n", " ----------\n", " output : tuple(\n", " numpy.ndarray or pandas.DataFrame\n", " :class:estimator or :class:transformer\n", " )\n", " The output of the step and a step in the pipeline.\n", " execution_time : float\n", " The execution time of the step.\n", " '''\n", " logger = logging.getLogger(__name__)\n", " step_result, step = output\n", " logger.info(f'[{step}] shape={step_result.shape} '\n", " f'nbytes={step_result.nbytes} time={execution_time}')\n", "\n", " \n", "pipe.log_callback = log_callback\n", "\n", "pipe.fit(X, y=y)\n", "X_out = pipe.transform(X)\n", "\n", "print('Transformed X:\\n', X_out)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Feature union\n", "\n", "Feature union also works with the debug pipeline." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[default_log_callback:34] - [Adder(value=1)] shape=(3, 5) time=0s\n", "[default_log_callback:34] - [Adder(value=10)] shape=(3, 5) time=0s\n", "[default_log_callback:34] - [Adder(value=100)] shape=(3, 5) time=0s\n", "[log_callback:16] - [Adder(value=1)] shape=(3, 5) nbytes=120 time=2.1219253540039062e-05\n", "[log_callback:16] - [Adder(value=10)] shape=(3, 5) nbytes=120 time=7.05718994140625e-05\n", "[log_callback:16] - [Adder(value=100)] shape=(3, 5) nbytes=120 time=2.8371810913085938e-05\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Transformed X:\n", " [[1111. 1111. 1111. 1111. 1111. 1111. 1111. 1111. 1111. 1111.]\n", " [1111. 1111. 1111. 1111. 1111. 1111. 1111. 1111. 1111. 1111.]\n", " [1111. 1111. 1111. 1111. 1111. 1111. 1111. 1111. 1111. 1111.]]\n" ] } ], "source": [ "pipe_w_default_log_callback = DebugPipeline(steps, log_callback='default')\n", "pipe_w_custom_log_callback = DebugPipeline(steps, log_callback=log_callback)\n", "\n", "pipe_union = FeatureUnion([\n", " ('pipe_w_default_log_callback', pipe_w_default_log_callback),\n", " ('pipe_w_custom_log_callback', pipe_w_custom_log_callback),\n", "])\n", "\n", "pipe_union.fit(X, y=y)\n", "X_out = pipe_union.transform(X)\n", "\n", "print('Transformed X:\\n', X_out)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Enough logging\n", "\n", "Remove the `log_callback` function when not needed anymore." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Transformed X:\n", " [[1111. 1111. 1111. 1111. 1111.]\n", " [1111. 1111. 1111. 1111. 1111.]\n", " [1111. 1111. 1111. 1111. 1111.]]\n" ] } ], "source": [ "pipe.log_callback = None\n", "\n", "pipe.fit(X, y=y)\n", "X_out = pipe.transform(X)\n", "\n", "print('Transformed X:\\n', X_out)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" }, "pycharm": { "stem_cell": { "cell_type": "raw", "source": [], "metadata": { "collapsed": false } } } }, "nbformat": 4, "nbformat_minor": 4 }