[1]:
import matplotlib.pylab as plt
import numpy as np

from collections import Counter

Datasets

Scikit-lego includes several datasets which can be used for testing purposes. Each dataset has different options for returning the data:

  • When setting as_frame to True, the data, including the target, is returned as a dataframe.

  • When setting return_X_y to True, the data is returned directly as (data, target) instead of a dict object.

This notebook describes the different sets included in Scikit-lego:

  • sklego.datasets.load_abalone loads in the abalone dataset

  • sklego.datasets.load_arrests loads in a dataset with fairness concerns

  • sklego.datasets.load_chicken loads in the joyful chickweight dataset

  • sklego.datasets.load_heroes loads a heroes of the storm dataset

  • sklego.datasets.load_hearts loads a dataset about hearts

  • sklego.datasets.load_penguins loads a lovely dataset about penguins

  • sklego.datasets.fetch_creditcard fetch a fraud dataset from openml

  • sklego.datasets.make_simpleseries make a simulated timeseries

Abalone

Loads the abalone dataset where the goal is to predict the gender of the creature.

[2]:
from sklego.datasets import load_abalone

df_abalone = load_abalone(as_frame=True)
df_abalone.head()
[2]:
sex length diameter height whole_weight shucked_weight viscera_weight shell_weight rings
0 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.150 15
1 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070 7
2 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210 9
3 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.155 10
4 I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.055 7
[3]:
X, y = load_abalone(return_X_y=True)

plt.bar(Counter(y).keys(), Counter(y).values())
plt.title('Distribution of sex (target)');
_images/datasets_4_0.png

Arrests

Loads the arrests dataset which can serve as a benchmark for fairness. It is data on the police treatment of individuals arrested in Toronto for simple possession of small quantities of marijuana. The goal is to predict whether or not the arrestee was released with a summons while maintaining a degree of fairness.

[4]:
from sklego.datasets import load_arrests

df_arrests = load_arrests(as_frame=True)
df_arrests.head()
[4]:
released colour year age sex employed citizen checks
0 Yes White 2002 21 Male Yes Yes 3
1 No Black 1999 17 Male Yes Yes 3
2 Yes White 2000 24 Male Yes Yes 3
3 No Black 2000 46 Male Yes Yes 1
4 Yes Black 1999 27 Female Yes Yes 1
[5]:
X, y = load_arrests(return_X_y=True)

plt.bar(Counter(y).keys(), Counter(y).values())
plt.title('Distribution of released (target)');
_images/datasets_7_0.png

Chicken

Loads the chicken dataset. The chicken data has 578 rows and 4 columns from an experiment on the effect of diet on early growth of chicks. The body weights of the chicks were measured at birth and every second day thereafter until day 20. They were also measured on day 21. There were four groups on chicks on different protein diets.

[6]:
from sklego.datasets import load_chicken

df_chicken = load_chicken(as_frame=True)
df_chicken.head()
[6]:
weight time chick diet
0 42 0 1 1
1 51 2 1 1
2 59 4 1 1
3 64 6 1 1
4 76 8 1 1
[7]:
X, y = load_chicken(return_X_y=True)

plt.hist(y)
plt.title('Distribution of weight (target)');
_images/datasets_10_0.png

Heroes

A dataset from a video game: “heroes of the storm”. The goal of the dataset is to predict the attack type. Note that the pandas dataset returns more information.

[8]:
from sklego.datasets import load_heroes

df_heroes = load_heroes(as_frame=True)
df_heroes.head()
[8]:
name attack_type role health attack attack_spd
0 Artanis Melee Bruiser 2470.0 111.0 1.00
1 Chen Melee Bruiser 2473.0 90.0 1.11
2 Dehaka Melee Bruiser 2434.0 100.0 1.11
3 Imperius Melee Bruiser 2450.0 122.0 0.83
4 Leoric Melee Bruiser 2550.0 109.0 0.77
[9]:
X, y = load_heroes(return_X_y=True)

plt.bar(Counter(y).keys(), Counter(y).values())
plt.title('Distribution of attack_type (target)');
_images/datasets_13_0.png

Hearts

Loads the Cleveland Heart Diseases dataset. The goal is to predict the presence of a heart disease (target values 1, 2, 3, and 4). The data originates from research to heart diseases by four institutions and originally contains 76 attributes. Yet, all published experiments refer to using a subset of 13 features and one target. This implementation loads the Cleveland dataset of the research which is the only set used by ML researchers to this date.

[10]:
from sklego.datasets import load_hearts

df_hearts = load_hearts(as_frame=True)
df_hearts.head()
[10]:
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
0 63 1 1 145 233 1 2 150 0 2.3 3 0 fixed 0
1 67 1 4 160 286 0 2 108 1 1.5 2 3 normal 1
2 67 1 4 120 229 0 2 129 1 2.6 2 2 reversible 0
3 37 1 3 130 250 0 0 187 0 3.5 3 0 normal 0
4 41 0 2 130 204 0 2 172 0 1.4 1 0 normal 0
[11]:
X, y = load_hearts(return_X_y=True)

y_counted = Counter([str(integer) for integer in y])

plt.bar(y_counted.keys(), y_counted.values())
plt.title('Distribution of target (presence of heart disease)');
_images/datasets_16_0.png

Load penguins

Loads the penguins dataset, which is a lovely alternative for the iris dataset. We’ve added this dataset for educational use.

Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network. The goal of the dataset is to predict which species of penguin a penguin belongs to.

[12]:
from sklego.datasets import load_penguins

df_penguins = load_penguins(as_frame=True)
df_penguins.head()
[12]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female
[13]:
X, y = load_penguins(return_X_y=True)

plt.bar(Counter(y).keys(), Counter(y).values())
plt.title('Distribution of species (target)');
_images/datasets_19_0.png

Fetch creditcard

Load the creditcard dataset. Download it if necessary.

Note that internally this is using fetch_openml from scikit-learn, which is experimental.

==============   ==============
Samples total            284807
Dimensionality               29
Features                   real
Target                 int 0, 1
==============   ==============

The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset present transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

Please cite: Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015

[14]:
from sklego.datasets import fetch_creditcard

dict_creditcard = fetch_creditcard(as_frame=True)

df_creditcard = dict_creditcard['frame']
df_creditcard.head()
[14]:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ... V21 V22 V23 V24 V25 V26 V27 V28 Amount Class
0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 0.098698 0.363787 0.090794 ... -0.018307 0.277838 -0.110474 0.066928 0.128539 -0.189115 0.133558 -0.021053 149.62 0
1 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 0.085102 -0.255425 -0.166974 ... -0.225775 -0.638672 0.101288 -0.339846 0.167170 0.125895 -0.008983 0.014724 2.69 0
2 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 0.247676 -1.514654 0.207643 ... 0.247998 0.771679 0.909412 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752 378.66 0
3 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609 0.377436 -1.387024 -0.054952 ... -0.108300 0.005274 -0.190321 -1.175575 0.647376 -0.221929 0.062723 0.061458 123.50 0
4 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941 -0.270533 0.817739 0.753074 ... -0.009431 0.798278 -0.137458 0.141267 -0.206010 0.502292 0.219422 0.215153 69.99 0

5 rows × 30 columns

[15]:
X, y = dict_creditcard['data'], dict_creditcard['target']

plt.bar(Counter(y).keys(), Counter(y).values())
plt.title('Distribution of fraud (target)');
_images/datasets_22_0.png

Simple time series

Generate a very simple timeseries dataset to play with. The generator assumes to generate daily data with a season, trend and noise.

[16]:
from sklego.datasets import make_simpleseries

df_simpleseries = make_simpleseries(as_frame=True, n_samples=1500, trend=0.001)
df_simpleseries.head()
[16]:
yt
0 -1.066620
1 0.303379
2 -0.412812
3 -0.227176
4 0.105581
[17]:
plt.plot(df_simpleseries['yt'])
plt.title('Timeseries yt');
_images/datasets_25_0.png