类型错误:数据必须是 CUDF 中的列表或类似字典

问题描述

我正在实施 CUDF 以加快我的 Python 进程。首先,我导入 CUDF 并删除多处理代码,并使用 CUDF 初始化变量。改成CUDF后出现字典错误

如何去除这些循环以有效实施?

代码

import more_itertools
import pandas as pd
import numpy as np
import itertools
from os import cpu_count
from sklearn.metrics import confusion_matrix,accuracy_score,roc_curve,auc
import matplotlib.pyplot as plt
import json
import os
import gc
from tqdm import tqdm
import cudf

gc.collect()
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

import logging
import sys

logging.basicConfig(stream=sys.stdout,level=logging.DEBUG)
import logging

mpl_logger = logging.getLogger('matplotlib')
mpl_logger.setLevel(logging.WARNING)

with open(Path(__file__).parent / "ageDB.json","r") as f:
    identities = json.load(f)

positives = cudf.DataFrame()

for value in tqdm(identities.values(),desc="Positives"):
    positives = positives.append(cudf.DataFrame(itertools.combinations(value,2),columns=["file_x","file_y"]),ignore_index=True)

positives["decision"] = "Yes"
print(positives)

samples_list = list(identities.values())
negatives = cudf.DataFrame()


######################====================Functions=============##############

def compute_cross_samples(x):
    return cudf.DataFrame(itertools.product(*x),"file_y"])

####################################
if Path("positives_negatives.csv").exists():
    df = cudf.read_csv("positives_negatives.csv")
else:
    for combos in tqdm(more_itertools.ichunked(itertools.combinations(identities.values(),cpu_count())):
        for cross_samples in (compute_cross_samples,combos):
            negatives = negatives.append(cross_samples)

negatives["decision"] = "No"
negatives = negatives.sample(positives.shape[0])
df = cudf.concat([positives,negatives]).reset_index(drop=True)
df.to_csv("positives_negatives.csv",index=False)

df.file_x = "deepface/tests/dataset/" + df.file_x
df.file_y = "deepface/tests/dataset/" + df.file_y

回溯

Traceback (most recent call last):
  File "Ensemble-Face-Recognition.py",line 36,in <module>
    positives = positives.append(cudf.DataFrame(itertools.combinations(value,File "/home/khawar/anaconda3/envs/rapids-0.17/lib/python3.7/contextlib.py",line 74,in inner
    return func(*args,**kwds)
  File "/home/khawar/anaconda3/envs/rapids-0.17/lib/python3.7/site-packages/cudf/core/dataframe.py",line 289,in __init__
    raise TypeError("data must be list or dict-like")
TypeError: data must be list or dict-like

解决方法

itertools.combinations 返回一个生成器,因此您需要显式调用 list 以获取类似列表的值

cudf.DataFrame(list(itertools.combinations(value,2)) . . .  

顺便说一句,我不确定cudf是否是这种情况,但在pandas中,创建数据帧列表并在最后连接它们要快得多,然后是创建一个空数据帧,然后不断追加到它。您的循环不断地为附加的数据帧设置正数。

,

格式化您的最后一条评论:

positives = positives.append(cudf.DataFrame(
   list(itertools.combinations(value,2),columns=["file_x","file_y"]),TypeError: list() takes no keyword arguments 

由于您配对 () 的方式,columns 被视为 list 的参数,而不是 DataFrame