如何从单独的数据框中指定训练集和测试集？

问题描述

我有一个数据框，其中混合了新闻文章和Facebook帖子（全文），并带有相应的标签（所有文本（文章和帖子）的一组标签）。但是，我想在两种文本（文章和帖子）上训练我的分类器，但是我的测试集中只有Facebook帖子。无论如何，是否指定从中提取测试集的一组行（按“源”列分组）？

我正在使用

  "highlight.regexes": {
    "(:(?:param|return))( \\w+)?(:)": {
      "regexFlags": "g","filterLanguageRegex": "python","decorations": [
        { "color": "blue" },{ "color": "green" },{ "color": "blue" }
      ]
    }
  }

和用于分类模型的simpletransformers。

谢谢！

解决方法

通过以下方式进行拆分：

# create X
X = df[<columns>]
# create y
y = df[<one column>]
# split to train and test
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=123,stratify = y)

如果有两个数据框，则需要先将它们合并：

df = df1.append(df2)

classification pandas python scikit-learn training-data