Model Selection

class sklego.model_selection.GroupTimeSeriesSplit(n_splits)[source]

Bases: sklearn.model_selection._split._BaseKFold

Sliding window time series split

Create n_splits folds with an as equally possible size through a smart variant of a brute force search. Groups parameter in .split() should be filled with the time groups (e.g. years)

Parameters

n_splits (int) – the amount of train-test combinations.

with n_splits at 3 * = train x = test |-----------------------| | * * * x x x - - - - - | | - - - * * * x x x - - | | - - - - - - * * * x x | |-----------------------|

get_n_splits(X=None, y=None, groups=None)[source]

Get the amount of splits

Parameters
  • X (Object, optional) – Always ignored, exists for compatibality, defaults to None

  • y (Object, optional) – Always ignored, exists for compatibality, defaults to None

  • groups (Object, optional) – Always ignored, exists for compatibality, defaults to None

Returns

amount of n_splits

Return type

int

split(X=None, y=None, groups=None)[source]

Returns the train-test splits of all the folds

Parameters
  • X (np.array, optional) – array-like of shape (n_samples, n_features) Training data, where n_samples is the number of samples and n_features is the number of features., defaults to None

  • y (np.array, optional) – array-like of shape (n_samples,) The target variable for supervised learning problems, defaults to None

  • groups (np.array) – Group labels for the samples used while splitting the dataset into train/test set, defaults to None

Returns

the indices of the train and test splits

Return type

List[np.array]

summary()[source]

Generates a pd.DataFrame which displays the groups splits and extra statistics about it. Can only be run after having applied the .split() method to the GroupTimeSeriesSplit instance.

Returns

a pd.DataFrame with info about where the splits were made

Return type

pd.DataFrame

class sklego.model_selection.KlusterFoldValidation(cluster_method=None)[source]

Bases: object

KlusterFold cross validator

  • Create folds based on provided cluster method

Parameters

cluster_method – Clustering method with fit_predict attribute

split(X, y=None, groups=None)[source]

Generator to iterate over the indices

Parameters
  • X – Array to split on

  • y – Always ignored, exists for compatibility

  • groups – Always ignored, exists for compatibility

class sklego.model_selection.TimeGapSplit(date_serie, valid_duration, train_duration=None, gap_duration=datetime.timedelta(0), n_splits=None, window='rolling')[source]

Bases: object

Provides train/test indices to split time series data samples. This cross-validation object is a variation of TimeSeriesSplit with the following differences: - The splits are made based on datetime duration, instead of number of rows. - The user specifies the validation durations and either training_duration or n_splits - The user can specify a ‘gap’ duration that is added

after the training split and before the validation split

The 3 duration parameters can be used to really replicate how the model is going to be used in production in batch learning. Each validation fold doesn’t overlap. The entire ‘window’ moves by 1 valid_duration until there is not enough data. If this would lead to more splits then specified with n_splits, the ‘window’ moves by the validation_duration times the fraction of possible splits and requested splits

– n_possible_splits = (total_length-train_duration-gap_duration)//valid_duration – time_shift = valid_duration n_possible_splits/n_slits

so the CV spans the whole dataset. If train_duration is not passed but n_split is, the training duration is increased to

– train_duration = total_length-(self.gap_duration + self.valid_duration * self.n_splits) such that the shifting the entire window by one validation duration spans the whole training set

Parameters
  • date_serie (pandas.Series) – Series with the date, that should have all the indices of X used in split()

  • train_duration (datetime.timedelta) – historical training data.

  • valid_duration (datetime.timedelta) – retraining period.

  • gap_duration (datetime.timedelta) – forward looking window of the target. The period of the forward looking window necessary to create your target variable. This period is dropped at the end of your training folds due to lack of recent data. In production you would have not been able to create the target for that period, and you would have drop it from the training data.

  • n_splits (int) – number of splits

  • window (string) – ‘rolling’ window has fixed size and is shifted entirely ‘expanding’ left side of window is fixed, right border increases each fold

get_n_splits(X=None, y=None, groups=None)[source]
split(X, y=None, groups=None)[source]

Generate indices to split data into training and test set. :param pandas.DataFrame X: :param y: Always ignored, exists for compatibility :param groups: Always ignored, exists for compatibility

summary(X)[source]

Describe all folds :param pandas.DataFrame X: :returns: pd.DataFrame summary of all folds