Preprocessing
- class sklego.preprocessing.ColumnCapper(quantile_range=(5.0, 95.0), interpolation='linear', discard_infs=False, copy=True)[source]
Bases:
sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
Caps the values of columns according to the given quantile thresholds.
- Parameters
quantile_range (tuple or list, optional, default=(5.0, 95.0)) – The quantile ranges to perform the capping. Their valus must be in the interval [0; 100].
interpolation (str, optional, default='linear') –
The interpolation method to compute the quantiles when the desired quantile lies between two data points i and j. The Available values are:
'linear'
: i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j.'lower'
: i.'higher'
: j.'nearest'
: i or j whichever is nearest.'midpoint'
: (i + j) / 2.
discard_infs (bool, optional, default=False) –
Whether to discard
-np.inf
andnp.inf
values or not. IfFalse
, such values will be capped. IfTrue
, they will be replaced bynp.nan
.Note
Setting
discard_infs=True
is important if the inf values are results of divisions by 0, which are interpreted bypandas
as-np.inf
ornp.inf
depending on the signal of the numerator.copy (bool, optional, default=True) – If False, try to avoid a copy and do inplace capping instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.
- Raises
TypeError
,ValueError
- Example
>>> import pandas as pd >>> import numpy as np >>> from sklego.preprocessing import ColumnCapper >>> df = pd.DataFrame({'a':[2, 4.5, 7, 9], 'b':[11, 12, np.inf, 14]}) >>> df a b 0 2.0 11.0 1 4.5 12.0 2 7.0 inf 3 9.0 14.0 >>> capper = ColumnCapper() >>> capper.fit_transform(df) array([[ 2.375, 11.1 ], [ 4.5 , 12. ], [ 7. , 13.8 ], [ 8.7 , 13.8 ]]) >>> capper = ColumnCapper(discard_infs=True) # Discarding infs >>> df[['a', 'b']] = capper.fit_transform(df) >>> df a b 0 2.375 11.1 1 4.500 12.0 2 7.000 NaN 3 8.700 13.8
- fit(X, y=None)[source]
Computes the quantiles for each column of
X
.- Parameters
X (pandas.DataFrame or numpy.ndarray) – The column(s) from which the capping limit(s) will be computed.
y – Ignored.
- Return type
- Returns
The fitted object.
- Raises
ValueError
ifX
contains non-numeric columns
- transform(X)[source]
Performs the capping on the column(s) of
X
.- Parameters
X (pandas.DataFrame or numpy.ndarray) – The column(s) for which the capping limit(s) will be applied.
- Return type
numpy.ndarray
- Returns
X
values with capped limits.- Raises
ValueError
if the number of columns fromX
differs from the number of columns when fitting
- class sklego.preprocessing.ColumnDropper(columns: list)[source]
Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Allows dropping specific columns from a pandas DataFrame by name. Can be useful in a sklearn Pipeline.
- Parameters
columns – column name
str
or list of column names to be selected
Note
Raises a
TypeError
if input provided is not a DataFrameRaises a
ValueError
if columns provided are not in the input DataFrame- Example
>>> # Selecting a single column from a pandas DataFrame >>> import pandas as pd >>> df = pd.DataFrame({ ... 'name': ['Swen', 'Victor', 'Alex'], ... 'length': [1.82, 1.85, 1.80], ... 'shoesize': [42, 44, 45] ... }) >>> ColumnDropper(['name']).fit_transform(df) length shoesize 0 1.82 42 1 1.85 44 2 1.80 45
>>> # Selecting multiple columns from a pandas DataFrame >>> ColumnDropper(['length', 'shoesize']).fit_transform(df) name 0 Swen 1 Victor 2 Alex
>>> # Selecting non-existent columns returns in a KeyError >>> ColumnDropper(['weight']).fit_transform(df) Traceback (most recent call last): ... KeyError: "['weight'] column(s) not in DataFrame"
>>> # How to use the ColumnSelector in a sklearn Pipeline >>> from sklearn.pipeline import Pipeline >>> from sklearn.preprocessing import StandardScaler >>> pipe = Pipeline([ ... ('select', ColumnDropper(['name', 'shoesize'])), ... ('scale', StandardScaler()), ... ]) >>> pipe.fit_transform(df) array([[-0.16222142], [ 1.29777137], [-1.13554995]])
- class sklego.preprocessing.ColumnSelector(columns: list)[source]
Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Allows selecting specific columns from a pandas DataFrame by name. Can be useful in a sklearn Pipeline.
- Parameters
columns – column name
str
or list of column names to be selected
Note
Raises a
TypeError
if input provided is not a DataFrameRaises a
ValueError
if columns provided are not in the input DataFrame- Example
>>> # Selecting a single column from a pandas DataFrame >>> import pandas as pd >>> df = pd.DataFrame({ ... 'name': ['Swen', 'Victor', 'Alex'], ... 'length': [1.82, 1.85, 1.80], ... 'shoesize': [42, 44, 45] ... }) >>> ColumnSelector(['length']).fit_transform(df) length 0 1.82 1 1.85 2 1.80
>>> # Selecting multiple columns from a pandas DataFrame >>> ColumnSelector(['length', 'shoesize']).fit_transform(df) length shoesize 0 1.82 42 1 1.85 44 2 1.80 45
>>> # Selecting non-existent columns returns in a KeyError >>> ColumnSelector(['weight']).fit_transform(df) Traceback (most recent call last): ... KeyError: "['weight'] column(s) not in DataFrame"
>>> # How to use the ColumnSelector in a sklearn Pipeline >>> from sklearn.pipeline import Pipeline >>> from sklearn.preprocessing import StandardScaler >>> pipe = Pipeline([ ... ('select', ColumnSelector(['length'])), ... ('scale', StandardScaler()), ... ]) >>> pipe.fit_transform(df) array([[-0.16222142], [ 1.29777137], [-1.13554995]])
- class sklego.preprocessing.DictMapper(mapper, default)[source]
Bases:
sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
Map the values of values of columns according to the input dictionary, fall back to the default if the key is not present in the dictionary.
- Parameters
mapper – The dictionary containing the mapping of the values
default – The value to fall back to if the value is not in the mapper
- fit(X, y=None)[source]
Checks the input dataframe and records the shape of it
- Parameters
X (pandas.DataFrame or numpy.ndarray) – The column(s) from which the mapping will be applied
y – Ignored.
- Return type
- Returns
The fitted object.
- transform(X)[source]
Performs the mapping on the column(s) of
X
.- Parameters
X (pandas.DataFrame or numpy.ndarray) – The column(s) for which the mapping will be applied.
- Return type
numpy.ndarray
- Returns
X
values with the mapping applied- Raises
ValueError
if the number of columns fromX
differs from the number of columns when fitting
- class sklego.preprocessing.IdentityTransformer(check_X: bool = False)[source]
Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
The identity transformer returns what it is fed. Does not apply anything useful. The reason for having it is because you can build more expressive pipelines.
- Parameters
check_X (bool, optional, default=False) – Whether to validate X to be non-empty 2D array of finite values and attempt to cast X to float. If disabled, the model/pipeline is expected to handle e.g. missing, non-numeric, or non-finite values.
- class sklego.preprocessing.InformationFilter(columns, alpha=1)[source]
Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
The InformationFilter uses a variant of the gram smidt process to filter information out of the dataset. This can be useful if you want to filter information out of a dataset because of fairness. To explain how it works: given a training matrix \(X\) that contains columns \(x_1, ..., x_k\). If we assume columns \(x_1\) and \(x_2\) to be the sensitive columns then the information-filter will remove information by applying these transformations; .. math:
\begin{split} v_1 & = x_1 \\ v_2 & = x_2 - \frac{x_2 v_1}{v_1 v_1}\\ v_3 & = x_3 - \frac{x_k v_1}{v_1 v_1} - \frac{x_2 v_2}{v_2 v_2}\\ ... \\ v_k & = x_k - \frac{x_k v_1}{v_1 v_1} - \frac{x_2 v_2}{v_2 v_2} \end{split}
Concatenating our vectors (but removing the sensitive ones) gives us a new training matrix \(X_{fair} = [v_3, ..., v_k]\). :param columns: the columns to filter out this can be a sequence of either int
(in the case of numpy) or string (in the case of pandas).
- Parameters
alpha – parameter to control how much to filter, for alpha=1 we filter out all information while for alpha=0 we don’t apply any.
- class sklego.preprocessing.IntervalEncoder(n_chunks=10, span=1, method='normal')[source]
Bases:
sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
The interval encoder bends features in X with regards to`y`. We take each column in X separately and smooth it towards y using the strategy that is defined in method. Note that this allows us to make certain features strictly monotonic in your machine learning model if you follow this with an appropriate model. :param n_chunks: the number of cuts that makes the interval :param method: the interpolation method used, must be in [“average”, “normal”, “increasing”, “decreasing”], default: “normal” :param span: a hyperparameter for the interpolation method, if the method is normal it resembles the width of the radial basis function used to weigh the points. It is ignored if if the method is “increasing” or “decreasing”.
- class sklego.preprocessing.OrthogonalTransformer(normalize=False)[source]
Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Transform the columns of a dataframe or numpy array to a column orthogonal or orthonormal matrix. Q, R such that X = Q*R, with Q orthogonal, from which follows Q = X*inv(R) :param normalize: whether orthogonal matrix should be orthonormal as well
- class sklego.preprocessing.OutlierRemover(outlier_detector, refit=True)[source]
Bases:
sklego.common.TrainOnlyTransformerMixin
,sklearn.base.BaseEstimator
Removes outliers (train-time only) using the supplied removal model.
- Parameters
outlier_detector – must implement fit and predict methods
refit – If True, fits the estimator during pipeline.fit().
- class sklego.preprocessing.PandasTypeSelector(include=None, exclude=None)[source]
Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Select columns in a pandas dataframe based on their dtype
- Parameters
include – types to be included in the dataframe
exclude – types to be excluded in the dataframe
- class sklego.preprocessing.PatsyTransformer(formula, return_type='matrix')[source]
Bases:
sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
The patsy transformer offers a method to select the right columns from a dataframe as well as a DSL for transformations. It is inspired from R formulas. This is can be useful as a first step in the pipeline. :param formula: a patsy-compatible formula :return_type: Either “matrix” or “dataframe”, passed on to patsy
- class sklego.preprocessing.RandomAdder(noise=1, random_state=None)[source]
Bases:
sklego.common.TrainOnlyTransformerMixin
,sklearn.base.BaseEstimator
- class sklego.preprocessing.RepeatingBasisFunction(column=0, remainder='drop', n_periods=12, input_range=None, width=1.0)[source]
Bases:
sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
This is a transformer for features with some form of circularity. E.g. for days of the week you might face the problem that, conceptually, day 7 is as close to day 6 as it is to day 1. While numerically their distance is different. This transformer remedies that problem. The transformer selects a column and transforms it with a given number of repeating (radial) basis functions. Radial basis functions are bell-curve shaped functions which take the original data as input. The basis functions are equally spaced over the input range. The key feature of repeating basis functions is that they are continuous when moving from the max to the min of the input range. As a result these repeating basis functions can capture how close each datapoint is to the center of each repeating basis function, even when the input data has a circular nature.
- Parameters
column (int or list, default=0) – Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name.
remainder ({'drop', 'passthrough'}, default="drop") – By default, only the specified column is transformed, and the non-specified columns are dropped. (default of
'drop'
). By specifyingremainder='passthrough'
, all remaining columns will be automatically passed through. This subset of columns is concatenated with the output of the transformer.n_periods (int, default=12) – number of basis functions to create, i.e., the number of columns that will exit the transformer.
input_range (tuple or None, default=None) – the values at which the data repeats itself. For example, for days of the week this is (1,7). If input_range=None it is inferred from the training data.
width (float, default=1.) – determines the width of the radial basis functions.