Preprocessing

Classes:

GBMDiscretizer(estimator, columns[, ...])

Feature discretizer based on GBDT.

GBMFeaturizer(estimator[, one_hot, append])

Feature generator for any GBDT model.

class skgbm.preprocessing.GBMDiscretizer(estimator, columns, one_hot=True, append=False)[source]

Bases: BaseEstimator, TransformerMixin, GBM

Feature discretizer based on GBDT.

Internally, it uses ArbitraryDiscretiser to handle discretization step after finding the optimal thresholds.

Parameters:
  • estimator (object) – A gradient boosting model from scikit-learn, XGBoost, LightGBM or CatBoost library

  • one_hot (bool) – Transform the ouput categorical features using one-hot encoding

  • columns (list of str) – List of column names to be transformed

  • append (bool) – Append the newly created features to the original ones

References

Examples

>>> from sklearn.datasets import load_diabetes
>>> from skgbm.preprocessing import GBMDiscretizer
>>> from xgboost import XGBClassifier
>>>
>>> iris = load_iris()
>>> data = pd.DataFrame(
>>>        data= np.c_[iris['data'], iris['target']],
>>>        columns= iris['feature_names'] + ['target']
>>>  )
>>> data.columns = data.columns.str[:-5]
>>> data.columns = data.columns.str.replace(' ', '_')
>>>
>>> # Data splitting
>>> X, y = data.iloc[:, :4], data.iloc[:, 4:]
>>> X_train, X_test, y_train, y_test = train_test_split(
>>>         X, y, test_size=0.3, random_state=0
>>> )
>>> X_cols = X.columns.tolist()
>>> X_train, X_test, y_train, y_test = train_test_split(X, y)
>>> gbm_discretizer = GBMDiscretizer(CatBoostClassifier(verbose=0),
>>>                                  X_cols, one_hot=False)
>>> X_train_disc = gbm_discretizer.fit_transform(X_train, y_train)
>>> #      sepal_length  sepal_width  petal_length  petal_width
>>> # 60              7            0             9            5
>>> # 116            22            9            29           13
>>> # 144            24           12            31           20
>>> # 119            17            1            24           10
>>> # 108            24            4            32           13
>>> # ..            ...          ...           ...          ...
>>> # 9               6           10             4            0
>>> # 103            20            8            30           13
>>> # 67             15            6            15            5
>>> # 117            32           17            38           17
>>> # 47              3           11             3            1
fit(X, y, **kwargs)[source]

Fit a set GBDT models (one per each discretized feature), distil split thresholds from them and create an internal ArbitraryDiscretiser. instance based on those values.

Parameters:
  • X ({array-like} of shape (n_samples, n_features)) – A data frame (matrix) of all the features.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (this is a supervised transformation).

Returns:

self – Fitted discretizer.

Return type:

object

fit_transform(X, y, **kwargs)[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

transform(X, **kwargs)[source]

Discretize the specified subset of columns.

Parameters:

X (array-like of shape (n_samples, n_features)) – The data to discretize.

Returns:

X_tr – Transformed array.

Return type:

{ndarray, sparse matrix} of shape (n_samples, n_features)

class skgbm.preprocessing.GBMFeaturizer(estimator, one_hot=True, append=True)[source]

Bases: BaseEstimator, TransformerMixin, GBM

Feature generator for any GBDT model.

Parameters:
  • estimator (object) – A gradient boosting model from scikit-learn, XGBoost, LightGBM or CatBoost library

  • one_hot (bool) – Transform the ouput categorical features using one-hot encoding

  • append (bool) – Append the newly created features to the original ones

References

Examples

>>> from sklearn.datasets import load_diabetes
>>> from skgbm.preprocessing import GBMFeaturizer
>>> from lightgbm import LGBMRegressor
>>>
>>> X, y = load_diabetes(return_X_y=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y)
>>> gbm_featurizer = GBMFeaturizer(LGBMRegressor())
>>> gbm_featurizer.fit(X_train, y_train)
>>> gbm_featurizer.transform(X_test)
fit(X, y, **kwargs)[source]

Fit a GBDT model and OneHotEncoder.

Parameters:
  • X ({array-like} of shape (n_samples, n_features)) – A data frame (matrix) of all the features.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values.

Returns:

self – Fitted discretizer.

Return type:

object

fit_transform(X, y, **kwargs)[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

transform(X, **kwargs)[source]

Return features distiled from the GBM model trees. The number of the output features depens on one_hot and append parameters.

Parameters:

X (array-like of shape (n_samples, n_features)) – The data to discretize.

Returns:

X_tr – Transformed array.

Return type:

{ndarray, sparse matrix} of shape (n_samples, n_trees) or (n_samples, 1)