Model Selection
- class sklego.model_selection.GroupTimeSeriesSplit(n_splits)[source]
Bases:
sklearn.model_selection._split._BaseKFold
Sliding window time series split
Create n_splits folds with an as equally possible size through a smart variant of a brute force search. Groups parameter in .split() should be filled with the time groups (e.g. years)
- Parameters
n_splits (int) – the amount of train-test combinations.
with n_splits at 3 * = train x = test |-----------------------| | * * * x x x - - - - - | | - - - * * * x x x - - | | - - - - - - * * * x x | |-----------------------|
- get_n_splits(X=None, y=None, groups=None)[source]
Get the amount of splits
- Parameters
X (Object, optional) – Always ignored, exists for compatibality, defaults to None
y (Object, optional) – Always ignored, exists for compatibality, defaults to None
groups (Object, optional) – Always ignored, exists for compatibality, defaults to None
- Returns
amount of n_splits
- Return type
int
- split(X=None, y=None, groups=None)[source]
Returns the train-test splits of all the folds
- Parameters
X (np.array, optional) – array-like of shape (n_samples, n_features) Training data, where n_samples is the number of samples and n_features is the number of features., defaults to None
y (np.array, optional) – array-like of shape (n_samples,) The target variable for supervised learning problems, defaults to None
groups (np.array) – Group labels for the samples used while splitting the dataset into train/test set, defaults to None
- Returns
the indices of the train and test splits
- Return type
List[np.array]
- class sklego.model_selection.KlusterFoldValidation(cluster_method=None)[source]
Bases:
object
KlusterFold cross validator
Create folds based on provided cluster method
- Parameters
cluster_method – Clustering method with fit_predict attribute
- class sklego.model_selection.TimeGapSplit(date_serie, valid_duration, train_duration=None, gap_duration=datetime.timedelta(0), n_splits=None, window='rolling')[source]
Bases:
object
Provides train/test indices to split time series data samples. This cross-validation object is a variation of TimeSeriesSplit with the following differences: - The splits are made based on datetime duration, instead of number of rows. - The user specifies the validation durations and either training_duration or n_splits - The user can specify a ‘gap’ duration that is added
after the training split and before the validation split
The 3 duration parameters can be used to really replicate how the model is going to be used in production in batch learning. Each validation fold doesn’t overlap. The entire ‘window’ moves by 1 valid_duration until there is not enough data. If this would lead to more splits then specified with n_splits, the ‘window’ moves by the validation_duration times the fraction of possible splits and requested splits
– n_possible_splits = (total_length-train_duration-gap_duration)//valid_duration – time_shift = valid_duration n_possible_splits/n_slits
so the CV spans the whole dataset. If train_duration is not passed but n_split is, the training duration is increased to
– train_duration = total_length-(self.gap_duration + self.valid_duration * self.n_splits) such that the shifting the entire window by one validation duration spans the whole training set
- Parameters
date_serie (pandas.Series) – Series with the date, that should have all the indices of X used in split()
train_duration (datetime.timedelta) – historical training data.
valid_duration (datetime.timedelta) – retraining period.
gap_duration (datetime.timedelta) – forward looking window of the target. The period of the forward looking window necessary to create your target variable. This period is dropped at the end of your training folds due to lack of recent data. In production you would have not been able to create the target for that period, and you would have drop it from the training data.
n_splits (int) – number of splits
window (string) – ‘rolling’ window has fixed size and is shifted entirely ‘expanding’ left side of window is fixed, right border increases each fold