Preprocessing

class sklego.preprocessing.ColumnCapper(quantile_range=(5.0, 95.0), interpolation='linear', discard_infs=False, copy=True)[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Caps the values of columns according to the given quantile thresholds.

Parameters

quantile_range (tuple or list, optional, default=(5.0, 95.0)) – The quantile ranges to perform the capping. Their valus must be in the interval [0; 100].
interpolation (str, optional, default='linear') –
The interpolation method to compute the quantiles when the desired quantile lies between two data points i and j. The Available values are:
- 'linear': i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j.
- 'lower': i.
- 'higher': j.
- 'nearest': i or j whichever is nearest.
- 'midpoint': (i + j) / 2.
discard_infs (bool, optional, default=False) –
Whether to discard -np.inf and np.inf values or not. If False, such values will be capped. If True, they will be replaced by np.nan.

Note

Setting discard_infs=True is important if the inf values are results of divisions by 0, which are interpreted by pandas as -np.inf or np.inf depending on the signal of the numerator.
copy (bool, optional, default=True) – If False, try to avoid a copy and do inplace capping instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.

Raises

TypeError, ValueError

Example

>>> import pandas as pd
>>> import numpy as np
>>> from sklego.preprocessing import ColumnCapper
>>> df = pd.DataFrame({'a':[2, 4.5, 7, 9], 'b':[11, 12, np.inf, 14]})
>>> df
     a     b
0  2.0  11.0
1  4.5  12.0
2  7.0   inf
3  9.0  14.0
>>> capper = ColumnCapper()
>>> capper.fit_transform(df)
array([[ 2.375, 11.1  ],
       [ 4.5  , 12.   ],
       [ 7.   , 13.8  ],
       [ 8.7  , 13.8  ]])
>>> capper = ColumnCapper(discard_infs=True) # Discarding infs
>>> df[['a', 'b']] = capper.fit_transform(df)
>>> df
       a     b
0  2.375  11.1
1  4.500  12.0
2  7.000   NaN
3  8.700  13.8

fit(X, y=None)[source]

Computes the quantiles for each column of X.

Parameters

X (pandas.DataFrame or numpy.ndarray) – The column(s) from which the capping limit(s) will be computed.
y – Ignored.

Return type

sklego.preprocessing.ColumnCapper

Returns

The fitted object.

Raises

ValueError if X contains non-numeric columns

transform(X)[source]

Performs the capping on the column(s) of X.

Parameters: X (pandas.DataFrame or numpy.ndarray) – The column(s) for which the capping limit(s) will be applied.
Return type: numpy.ndarray
Returns: X values with capped limits.
Raises: ValueError if the number of columns from X differs from the number of columns when fitting

class sklego.preprocessing.ColumnDropper(columns: list)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Allows dropping specific columns from a pandas DataFrame by name. Can be useful in a sklearn Pipeline.

Parameters: columns – column name str or list of column names to be selected

Note

Raises a TypeError if input provided is not a DataFrame

Raises a ValueError if columns provided are not in the input DataFrame

Example

>>> # Selecting a single column from a pandas DataFrame
>>> import pandas as pd
>>> df = pd.DataFrame({
...     'name': ['Swen', 'Victor', 'Alex'],
...     'length': [1.82, 1.85, 1.80],
...     'shoesize': [42, 44, 45]
... })
>>> ColumnDropper(['name']).fit_transform(df)
   length  shoesize
0    1.82        42
1    1.85        44
2    1.80        45

>>> # Selecting multiple columns from a pandas DataFrame
>>> ColumnDropper(['length', 'shoesize']).fit_transform(df)
     name
0    Swen
1  Victor
2    Alex

>>> # Selecting non-existent columns returns in a KeyError
>>> ColumnDropper(['weight']).fit_transform(df)
Traceback (most recent call last):
    ...
KeyError: "['weight'] column(s) not in DataFrame"

>>> # How to use the ColumnSelector in a sklearn Pipeline
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.preprocessing import StandardScaler
>>> pipe = Pipeline([
...     ('select', ColumnDropper(['name', 'shoesize'])),
...     ('scale', StandardScaler()),
... ])
>>> pipe.fit_transform(df)
array([[-0.16222142],
       [ 1.29777137],
       [-1.13554995]])

fit(X, y=None)[source]

Checks 1) if input is a DataFrame, and 2) if column names are in this DataFrame

Parameters

X – pd.DataFrame on which we apply the column selection
y – pd.Series labels for X. unused for column selection

Returns

ColumnSelector object.

get_feature_names()[source]

transform(X)[source]

Returns a pandas DataFrame with only the specified columns

Parameters: X – pd.DataFrame on which we apply the column selection
Returns: pd.DataFrame with only the selected columns

class sklego.preprocessing.ColumnSelector(columns: list)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Allows selecting specific columns from a pandas DataFrame by name. Can be useful in a sklearn Pipeline.

Parameters: columns – column name str or list of column names to be selected

Note

Raises a TypeError if input provided is not a DataFrame

Raises a ValueError if columns provided are not in the input DataFrame

Example

>>> # Selecting a single column from a pandas DataFrame
>>> import pandas as pd
>>> df = pd.DataFrame({
...     'name': ['Swen', 'Victor', 'Alex'],
...     'length': [1.82, 1.85, 1.80],
...     'shoesize': [42, 44, 45]
... })
>>> ColumnSelector(['length']).fit_transform(df)
   length
0    1.82
1    1.85
2    1.80

>>> # Selecting multiple columns from a pandas DataFrame
>>> ColumnSelector(['length', 'shoesize']).fit_transform(df)
   length  shoesize
0    1.82        42
1    1.85        44
2    1.80        45

>>> # Selecting non-existent columns returns in a KeyError
>>> ColumnSelector(['weight']).fit_transform(df)
Traceback (most recent call last):
    ...
KeyError: "['weight'] column(s) not in DataFrame"

>>> # How to use the ColumnSelector in a sklearn Pipeline
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.preprocessing import StandardScaler
>>> pipe = Pipeline([
...     ('select', ColumnSelector(['length'])),
...     ('scale', StandardScaler()),
... ])
>>> pipe.fit_transform(df)
array([[-0.16222142],
       [ 1.29777137],
       [-1.13554995]])

fit(X, y=None)[source]

Checks 1) if input is a DataFrame, and 2) if column names are in this DataFrame

Parameters

X – pd.DataFrame on which we apply the column selection
y – pd.Series labels for X. unused for column selection

Returns

ColumnSelector object.

get_feature_names()[source]

transform(X)[source]

Returns a pandas DataFrame with only the specified columns

Parameters: X – pd.DataFrame on which we apply the column selection
Returns: pd.DataFrame with only the selected columns

class sklego.preprocessing.DictMapper(mapper, default)[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Map the values of values of columns according to the input dictionary, fall back to the default if the key is not present in the dictionary.

Parameters

mapper – The dictionary containing the mapping of the values
default – The value to fall back to if the value is not in the mapper

fit(X, y=None)[source]

Checks the input dataframe and records the shape of it

Parameters

X (pandas.DataFrame or numpy.ndarray) – The column(s) from which the mapping will be applied
y – Ignored.

Return type

sklego.preprocessing.DictMapper

Returns

The fitted object.

transform(X)[source]

Performs the mapping on the column(s) of X.

Parameters: X (pandas.DataFrame or numpy.ndarray) – The column(s) for which the mapping will be applied.
Return type: numpy.ndarray
Returns: X values with the mapping applied
Raises: ValueError if the number of columns from X differs from the number of columns when fitting

class sklego.preprocessing.IdentityTransformer(check_X: bool = False)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

The identity transformer returns what it is fed. Does not apply anything useful. The reason for having it is because you can build more expressive pipelines.

Parameters: check_X (bool, optional, default=False) – Whether to validate X to be non-empty 2D array of finite values and attempt to cast X to float. If disabled, the model/pipeline is expected to handle e.g. missing, non-numeric, or non-finite values.

fit(X, y=None)[source]: ‘Fits’ the estimator.

transform(X)[source]: ‘Applies’ the estimator.

class sklego.preprocessing.InformationFilter(columns, alpha=1)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

The InformationFilter uses a variant of the gram smidt process to filter information out of the dataset. This can be useful if you want to filter information out of a dataset because of fairness. To explain how it works: given a training matrix \(X\) that contains columns \(x_1, ..., x_k\). If we assume columns \(x_1\) and \(x_2\) to be the sensitive columns then the information-filter will remove information by applying these transformations; .. math:

\begin{split}
v_1 & = x_1 \\
v_2 & = x_2 - \frac{x_2 v_1}{v_1 v_1}\\
v_3 & = x_3 - \frac{x_k v_1}{v_1 v_1} - \frac{x_2 v_2}{v_2 v_2}\\
... \\
v_k & = x_k - \frac{x_k v_1}{v_1 v_1} - \frac{x_2 v_2}{v_2 v_2}
\end{split}

Concatenating our vectors (but removing the sensitive ones) gives us a new training matrix \(X_{fair} = [v_3, ..., v_k]\). :param columns: the columns to filter out this can be a sequence of either int

(in the case of numpy) or string (in the case of pandas).

Parameters: alpha – parameter to control how much to filter, for alpha=1 we filter out all information while for alpha=0 we don’t apply any.

fit(X, y=None)[source]: Learn the projection required to make the dataset orthogonal to sensitive columns.

transform(X)[source]: Transforms X by applying the information filter.

class sklego.preprocessing.IntervalEncoder(n_chunks=10, span=1, method='normal')[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

The interval encoder bends features in X with regards to`y`. We take each column in X separately and smooth it towards y using the strategy that is defined in method. Note that this allows us to make certain features strictly monotonic in your machine learning model if you follow this with an appropriate model. :param n_chunks: the number of cuts that makes the interval :param method: the interpolation method used, must be in [“average”, “normal”, “increasing”, “decreasing”], default: “normal” :param span: a hyperparameter for the interpolation method, if the method is normal it resembles the width of the radial basis function used to weigh the points. It is ignored if if the method is “increasing” or “decreasing”.

fit(X, y)[source]: Fits the estimator

transform(X)[source]: Transform each column such that it is bends smoothly towards y.

class sklego.preprocessing.OrthogonalTransformer(normalize=False)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Transform the columns of a dataframe or numpy array to a column orthogonal or orthonormal matrix. Q, R such that X = Q*R, with Q orthogonal, from which follows Q = X*inv(R) :param normalize: whether orthogonal matrix should be orthonormal as well

fit(X, y=None)[source]: Store the inverse of R of the QR decomposition of X, which can be used to calculate the orthogonal projection of X. If normalization is required, also stores a vector with normalization terms

transform(X)[source]: Transforms X using the fitted inverse of R. Normalizes the result if required

class sklego.preprocessing.OutlierRemover(outlier_detector, refit=True)[source]

Bases: sklego.common.TrainOnlyTransformerMixin, sklearn.base.BaseEstimator

Removes outliers (train-time only) using the supplied removal model.

Parameters

outlier_detector – must implement fit and predict methods
refit – If True, fits the estimator during pipeline.fit().

fit(X, y=None)[source]: Calculates the hash of X_train

transform_train(X)[source]

class sklego.preprocessing.PandasTypeSelector(include=None, exclude=None)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Select columns in a pandas dataframe based on their dtype

Parameters

include – types to be included in the dataframe
exclude – types to be excluded in the dataframe

fit(X, y=None)[source]: Saves the column names for check during transform :param X: pandas dataframe to select dtypes out of :param y: not used in this class

get_feature_names(*args, **kwargs)[source]

transform(X)[source]: Transforms pandas dataframe by (de)selecting columns based on their dtype :param X: pandas dataframe to select dtypes for

class sklego.preprocessing.PatsyTransformer(formula, return_type='matrix')[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

The patsy transformer offers a method to select the right columns from a dataframe as well as a DSL for transformations. It is inspired from R formulas. This is can be useful as a first step in the pipeline. :param formula: a patsy-compatible formula :return_type: Either “matrix” or “dataframe”, passed on to patsy

fit(X, y=None)[source]: Fits the estimator

transform(X)[source]

Applies the formula to the matrix/dataframe X.

Returns - A patsy.DesignMatrix, if return_type=”matrix” (the default) - A pandas.DataFrame, if return_type=”dataframe”

class sklego.preprocessing.RandomAdder(noise=1, random_state=None)[source]

Bases: sklego.common.TrainOnlyTransformerMixin, sklearn.base.BaseEstimator

fit(X, y)[source]: Calculates the hash of X_train

transform_train(X)[source]

class sklego.preprocessing.RepeatingBasisFunction(column=0, remainder='drop', n_periods=12, input_range=None, width=1.0)[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

This is a transformer for features with some form of circularity. E.g. for days of the week you might face the problem that, conceptually, day 7 is as close to day 6 as it is to day 1. While numerically their distance is different. This transformer remedies that problem. The transformer selects a column and transforms it with a given number of repeating (radial) basis functions. Radial basis functions are bell-curve shaped functions which take the original data as input. The basis functions are equally spaced over the input range. The key feature of repeating basis functions is that they are continuous when moving from the max to the min of the input range. As a result these repeating basis functions can capture how close each datapoint is to the center of each repeating basis function, even when the input data has a circular nature.

Parameters

column (int or list, default=0) – Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name.
remainder ({'drop', 'passthrough'}, default="drop") – By default, only the specified column is transformed, and the non-specified columns are dropped. (default of 'drop'). By specifying remainder='passthrough', all remaining columns will be automatically passed through. This subset of columns is concatenated with the output of the transformer.
n_periods (int, default=12) – number of basis functions to create, i.e., the number of columns that will exit the transformer.
input_range (tuple or None, default=None) – the values at which the data repeats itself. For example, for days of the week this is (1,7). If input_range=None it is inferred from the training data.
width (float, default=1.) – determines the width of the radial basis functions.

fit(X, y=None)[source]

transform(X)[source]