迭代Dask数据框

问题描述

我正在尝试从数百个大型CSV文件的单列中创建Keras Tokenizer。达斯克似乎是一个很好的工具。我目前的方法最终会导致内存问题:

df = dd.read_csv('data/*.csv',usecol=['MyCol'])

# Process column and get underlying Numpy array.
# This greatly reduces memory consumption,but eventually materializes
# the entire dataset into memory
my_ids = df.MyCol.apply(process_my_col).compute().values

tokenizer = Tokenizer()
tokenizer.fit_on_texts(my_ids)

如何按部就班呢?类似于:

df = pd.read_csv('a-single-file.csv',chunksize=1000)
for chunk in df:
    # Process a chunk at a time

解决方法

Dask DataFrame从技术上讲是一组熊猫数据框,称为分区。当您获得底层的numpy数组时,您正在破坏分区结构,它将是一个大数组。我建议使用Dask DataFrames的class Shoe { constructor(name,price,type) { this.name = name; this.price = price; this.type = type; } } // to generate random LARGE list function randomName(n) { let letters = "abcdefghijklmnopqrstuvwxyz"; let name = ""; for (let i = 0; i < n; i++) { name += letters[Math.floor(Math.random() * letters.length)]; } return name; } function randomNumber(min,max) { return Math.floor(Math.random() * (max - min) + min); } var shoes = []; for (let i = 0; i < 100; i++) { shoes.push(new Shoe(randomName(20),randomNumber(50,5000),randomName(7))); } //bubblesort function bubbleSort(shoes) { var swapped; do { swapped = false; for (var i = 0; i < shoes.length - 1; i++) { // converting prices to numbers if (+shoes[i].price > +shoes[i + 1].price) { var temp = shoes[i]; shoes[i] = shoes[i + 1]; shoes[i + 1] = temp; swapped = true; } } } while (swapped); return shoes; } const t0 = performance.now(); bubbleSort(shoes); console.log('Bubble Sort:\n',shoes); const t1 = performance.now(); console.log(`Call to bubbleSort took ${t1 - t0} milliseconds.`)函数在每个分区上分别应用常规的熊猫函数。

,

我还建议map_partition适合您的问题。但是,如果您真的只想顺序访问,并且使用类似于read_csv(chunksize=...)的API,那么您可能正在寻找partitions属性

for part in df.partitions:
    process(model,part.compute())

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...