使用 SMOTETomek 进行过采样：无法将字符串转换为浮点错误

问题描述

当我使用以下代码执行过采样时，使用 SMOTetomek 来平衡不平衡的数据集，

from imblearn.combine import SMOTetomek
import pandas as pd

# read the input dataset
input_dataset = pd.read_csv('sample.csv')

# get all the column names as the element of a list
columns = input_dataset.columns.tolist()

# get the target_column and all the other columns from the columns list
target_column = "svalue_numerical"
other_columns = [item for item in columns if item not in [target_column]]

X = input_dataset[other_columns]
y = input_dataset[target_column]

# Over-sampling using SMOTE and cleaning using Tomek links
oversample = SMOTetomek(random_state=42)
X_oversampled,y_oversampled = oversample.fit_resample(X,y)

我收到以下错误：

ValueError: Could not convert string to float: '2019-04-04 00:00:00.000049'

我的理解是，我无法将字符串类型的 DataFrame 提供给 fit_resample() 方法。我尝试按照建议的 here 将 DataFrame 转换为 numpy array，但我无法得到预期的结果。

我的数据集样本如下：

timestamp                      evalue       svalue        evalue_numerical        svalue_numerical
2019-04-04 00:00:00.000049     cam          cam_on        0                       1
2019-04-04 00:00:15.020115     fan          fan_off       1                       3
2019-04-04 00:00:15.031492     pc           pc_on         2                       5
2019-04-04 00:00:15.050193     scr          scr_on        3                       7
................................................................................................
................................................................................................

你能提出一些可以解决这个问题的建议吗？感谢您的帮助。

注意：如果需要，我可以去掉 evalue 和 svalue 列，但我必须在过采样数据集中包含 timestamp 列。

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

imblearn oversampling pandas pandas scikit-learn smote