Preprocessing
Classes:
|
Feature discretizer based on GBDT. |
|
Feature generator for any GBDT model. |
- class skgbm.preprocessing.GBMDiscretizer(estimator, columns, one_hot=True, append=False)[source]
Bases:
BaseEstimator,TransformerMixin,GBMFeature discretizer based on GBDT.
Internally, it uses ArbitraryDiscretiser to handle discretization step after finding the optimal thresholds.
- Parameters:
estimator (object) – A gradient boosting model from scikit-learn, XGBoost, LightGBM or CatBoost library
one_hot (bool) – Transform the ouput categorical features using one-hot encoding
columns (list of str) – List of column names to be transformed
append (bool) – Append the newly created features to the original ones
References
Examples
>>> from sklearn.datasets import load_diabetes >>> from skgbm.preprocessing import GBMDiscretizer >>> from xgboost import XGBClassifier >>> >>> iris = load_iris() >>> data = pd.DataFrame( >>> data= np.c_[iris['data'], iris['target']], >>> columns= iris['feature_names'] + ['target'] >>> ) >>> data.columns = data.columns.str[:-5] >>> data.columns = data.columns.str.replace(' ', '_') >>> >>> # Data splitting >>> X, y = data.iloc[:, :4], data.iloc[:, 4:] >>> X_train, X_test, y_train, y_test = train_test_split( >>> X, y, test_size=0.3, random_state=0 >>> ) >>> X_cols = X.columns.tolist() >>> X_train, X_test, y_train, y_test = train_test_split(X, y) >>> gbm_discretizer = GBMDiscretizer(CatBoostClassifier(verbose=0), >>> X_cols, one_hot=False) >>> X_train_disc = gbm_discretizer.fit_transform(X_train, y_train) >>> # sepal_length sepal_width petal_length petal_width >>> # 60 7 0 9 5 >>> # 116 22 9 29 13 >>> # 144 24 12 31 20 >>> # 119 17 1 24 10 >>> # 108 24 4 32 13 >>> # .. ... ... ... ... >>> # 9 6 10 4 0 >>> # 103 20 8 30 13 >>> # 67 15 6 15 5 >>> # 117 32 17 38 17 >>> # 47 3 11 3 1
- fit(X, y, **kwargs)[source]
Fit a set GBDT models (one per each discretized feature), distil split thresholds from them and create an internal ArbitraryDiscretiser. instance based on those values.
- Parameters:
X ({array-like} of shape (n_samples, n_features)) – A data frame (matrix) of all the features.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (this is a supervised transformation).
- Returns:
self – Fitted discretizer.
- Return type:
object
- fit_transform(X, y, **kwargs)[source]
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
- class skgbm.preprocessing.GBMFeaturizer(estimator, one_hot=True, append=True)[source]
Bases:
BaseEstimator,TransformerMixin,GBMFeature generator for any GBDT model.
- Parameters:
estimator (object) – A gradient boosting model from scikit-learn, XGBoost, LightGBM or CatBoost library
one_hot (bool) – Transform the ouput categorical features using one-hot encoding
append (bool) – Append the newly created features to the original ones
References
[1] X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, R. Herbrich, S. Bowers, J. Q. Candela, “Practical Lessons from Predicting Clicks on Ads at Facebook”, 2016.
[2] C. Mougan, “Feature Generation with Gradient Boosted Decision Trees”, Towards Data Science, 2021.
[3] David Masip, “sktools — Helpers for scikit learn”
[4] xgboostExtension: xgboost Extension for Easy Ranking & TreeFeature
Examples
>>> from sklearn.datasets import load_diabetes >>> from skgbm.preprocessing import GBMFeaturizer >>> from lightgbm import LGBMRegressor >>> >>> X, y = load_diabetes(return_X_y=True) >>> X_train, X_test, y_train, y_test = train_test_split(X, y) >>> gbm_featurizer = GBMFeaturizer(LGBMRegressor()) >>> gbm_featurizer.fit(X_train, y_train) >>> gbm_featurizer.transform(X_test)
- fit(X, y, **kwargs)[source]
Fit a GBDT model and OneHotEncoder.
- Parameters:
X ({array-like} of shape (n_samples, n_features)) – A data frame (matrix) of all the features.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values.
- Returns:
self – Fitted discretizer.
- Return type:
object
- fit_transform(X, y, **kwargs)[source]
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
- transform(X, **kwargs)[source]
Return features distiled from the GBM model trees. The number of the output features depens on one_hot and append parameters.
- Parameters:
X (array-like of shape (n_samples, n_features)) – The data to discretize.
- Returns:
X_tr – Transformed array.
- Return type:
{ndarray, sparse matrix} of shape (n_samples, n_trees) or (n_samples, 1)