我如何适应这个不推荐使用的StratifiedKFold代码

问题描述

我有一个响应值不平衡的数据集，我有更多的合格拒绝值和非拒绝值，所以我希望平衡我的数据集。

为此，有一个代码可以与现在不推荐使用的Meta_query一起使用，但是现在我需要对其进行改编，并且我不太了解它，因此我在寻求帮助。

原始代码是：

$args = array(
            'update_post_term_cache' => false,'post_type'         => 'vandelay_industries','posts_per_page'    => $request['per_page'],'paged'             => $request['page'],'geo_query' => array(
                'lat_field' => 'flat_lat',// this is the name of the Meta field storing latitude
                'lng_field' => 'flat_lng',// this is the name of the Meta field storing longitude 
                'latitude'  => $lat1,// this is the latitude of the point we are getting distance from
                'longitude' => $lng1,// this is the longitude of the point we are getting distance from
                'distance'  => $proximity,// this is the maximum distance to search
                'units'     => 'miles'       // this supports options: miles,mi,kilometers,km
            ),'Meta_query' => array(
                'relation' => 'OR',array(
                    'key' => 'hair_types','value' => $value1,'compare' => 'LIKE'
                ),array(
                    'key' => 'education_type','value' => $value2,'compare' => 'IN'
                ),array(
                    'key' => 'something_else','value' => $value3,array(
                    'key' => 'george_is_getting_upset','value' => $value4,)
        );

其中cross_validation.StratifiedKFold是fit_transformed的数据集，转换为numpy浮点数组并进行缩放，而def stratified_cv(X,y,clf_class,shuffle=True,n_folds=10,**kwargs): stratified_k_fold = cross_validation.StratifiedKFold(y,n_folds=n_folds,shuffle=shuffle) y_pred = y.copy() # ii -> train # jj -> test indices for ii,jj in stratified_k_fold: X_train,X_test = X[ii],X[jj] y_train = y[ii] clf = clf_class(**kwargs) clf.fit(X_train,y_train) y_pred[jj] = clf.predict(X_test) return y_pred是转换为int（0的数组）的“拒绝”与“未拒绝”分类或1个）。最后，X可以是y，clf_class(**kwargs)和ensemble.GradientBoostingClassifier

这样的分类器。

svm.SVC

ensemble.RandomForestClassifier

解决方法

StratifiedKFold已移至model_selection。所以你应该这样做：

from sklearn.model_selection import StratifiedKFold
def stratified_cv(X,y,clf_class,shuffle=True,n_folds=10,**kwargs):
    stratified_k_fold = StratifiedKFold(n_splits=n_folds,shuffle=shuffle)
    y_pred = y.copy()
    # ii -> train
    # jj -> test indices
    for ii,jj in stratified_k_fold.split(X,y): 
        X_train,X_test = X[ii],X[jj]
        y_train = y[ii]
        clf = clf_class(**kwargs)
        clf.fit(X_train,y_train)
        y_pred[jj] = clf.predict(X_test)
    return y_pred

machine-learning python scikit-learn sklearn-pandas