Preprocessing

Classes:

`GBMDiscretizer`(estimator, columns[, ...])	Feature discretizer based on GBDT.
`GBMFeaturizer`(estimator[, one_hot, append])	Feature generator for any GBDT model.

class skgbm.preprocessing.GBMDiscretizer(estimator, columns, one_hot=True, append=False)[source]

Bases: BaseEstimator, TransformerMixin, GBM

Feature discretizer based on GBDT.

Internally, it uses ArbitraryDiscretiser to handle discretization step after finding the optimal thresholds.

Parameters:

estimator (object) – A gradient boosting model from scikit-learn, XGBoost, LightGBM or CatBoost library
one_hot (bool) – Transform the ouput categorical features using one-hot encoding
columns (list of str) – List of column names to be transformed
append (bool) – Append the newly created features to the original ones

References

Examples

>>> from sklearn.datasets import load_diabetes
>>> from skgbm.preprocessing import GBMDiscretizer
>>> from xgboost import XGBClassifier
>>>
>>> iris = load_iris()
>>> data = pd.DataFrame(
>>>        data= np.c_[iris['data'], iris['target']],
>>>        columns= iris['feature_names'] + ['target']
>>>  )
>>> data.columns = data.columns.str[:-5]
>>> data.columns = data.columns.str.replace(' ', '_')
>>>
>>> # Data splitting
>>> X, y = data.iloc[:, :4], data.iloc[:, 4:]
>>> X_train, X_test, y_train, y_test = train_test_split(
>>>         X, y, test_size=0.3, random_state=0
>>> )
>>> X_cols = X.columns.tolist()
>>> X_train, X_test, y_train, y_test = train_test_split(X, y)
>>> gbm_discretizer = GBMDiscretizer(CatBoostClassifier(verbose=0),
>>>                                  X_cols, one_hot=False)
>>> X_train_disc = gbm_discretizer.fit_transform(X_train, y_train)
>>> #      sepal_length  sepal_width  petal_length  petal_width
>>> # 60              7            0             9            5
>>> # 116            22            9            29           13
>>> # 144            24           12            31           20
>>> # 119            17            1            24           10
>>> # 108            24            4            32           13
>>> # ..            ...          ...           ...          ...
>>> # 9               6           10             4            0
>>> # 103            20            8            30           13
>>> # 67             15            6            15            5
>>> # 117            32           17            38           17
>>> # 47              3           11             3            1

fit(X, y, **kwargs)[source]

Fit a set GBDT models (one per each discretized feature), distil split thresholds from them and create an internal ArbitraryDiscretiser. instance based on those values.

Parameters:

X ({array-like} of shape (n_samples, n_features)) – A data frame (matrix) of all the features.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (this is a supervised transformation).

Returns:

self – Fitted discretizer.

Return type:

object

fit_transform(X, y, **kwargs)[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

transform(X, **kwargs)[source]

Discretize the specified subset of columns.

Parameters:: X (array-like of shape (n_samples, n_features)) – The data to discretize.
Returns:: X_tr – Transformed array.
Return type:: {ndarray, sparse matrix} of shape (n_samples, n_features)

class skgbm.preprocessing.GBMFeaturizer(estimator, one_hot=True, append=True)[source]

Bases: BaseEstimator, TransformerMixin, GBM

Feature generator for any GBDT model.

Parameters:

estimator (object) – A gradient boosting model from scikit-learn, XGBoost, LightGBM or CatBoost library
one_hot (bool) – Transform the ouput categorical features using one-hot encoding
append (bool) – Append the newly created features to the original ones

References

Examples

>>> from sklearn.datasets import load_diabetes
>>> from skgbm.preprocessing import GBMFeaturizer
>>> from lightgbm import LGBMRegressor
>>>
>>> X, y = load_diabetes(return_X_y=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y)
>>> gbm_featurizer = GBMFeaturizer(LGBMRegressor())
>>> gbm_featurizer.fit(X_train, y_train)
>>> gbm_featurizer.transform(X_test)

fit(X, y, **kwargs)[source]

Fit a GBDT model and OneHotEncoder.

Parameters:

X ({array-like} of shape (n_samples, n_features)) – A data frame (matrix) of all the features.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values.

Returns:

self – Fitted discretizer.

Return type:

object

fit_transform(X, y, **kwargs)[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

transform(X, **kwargs)[source]

Return features distiled from the GBM model trees. The number of the output features depens on one_hot and append parameters.

Parameters:: X (array-like of shape (n_samples, n_features)) – The data to discretize.
Returns:: X_tr – Transformed array.
Return type:: {ndarray, sparse matrix} of shape (n_samples, n_trees) or (n_samples, 1)