问题描述
我正在尝试从数百个大型CSV文件的单列中创建Keras Tokenizer。达斯克似乎是一个很好的工具。我目前的方法最终会导致内存问题:
df = dd.read_csv('data/*.csv',usecol=['MyCol'])
# Process column and get underlying Numpy array.
# This greatly reduces memory consumption,but eventually materializes
# the entire dataset into memory
my_ids = df.MyCol.apply(process_my_col).compute().values
tokenizer = Tokenizer()
tokenizer.fit_on_texts(my_ids)
如何按部就班呢?类似于:
df = pd.read_csv('a-single-file.csv',chunksize=1000)
for chunk in df:
# Process a chunk at a time
解决方法
Dask DataFrame从技术上讲是一组熊猫数据框,称为分区。当您获得底层的numpy数组时,您正在破坏分区结构,它将是一个大数组。我建议使用Dask DataFrames的class Shoe {
constructor(name,price,type) {
this.name = name;
this.price = price;
this.type = type;
}
}
// to generate random LARGE list
function randomName(n) {
let letters = "abcdefghijklmnopqrstuvwxyz";
let name = "";
for (let i = 0; i < n; i++) {
name += letters[Math.floor(Math.random() * letters.length)];
}
return name;
}
function randomNumber(min,max) {
return Math.floor(Math.random() * (max - min) + min);
}
var shoes = [];
for (let i = 0; i < 100; i++) {
shoes.push(new Shoe(randomName(20),randomNumber(50,5000),randomName(7)));
}
//bubblesort
function bubbleSort(shoes) {
var swapped;
do {
swapped = false;
for (var i = 0; i < shoes.length - 1; i++) {
// converting prices to numbers
if (+shoes[i].price > +shoes[i + 1].price) {
var temp = shoes[i];
shoes[i] = shoes[i + 1];
shoes[i + 1] = temp;
swapped = true;
}
}
} while (swapped);
return shoes;
}
const t0 = performance.now();
bubbleSort(shoes);
console.log('Bubble Sort:\n',shoes);
const t1 = performance.now();
console.log(`Call to bubbleSort took ${t1 - t0} milliseconds.`)
函数在每个分区上分别应用常规的熊猫函数。
我还建议map_partition
适合您的问题。但是,如果您真的只想顺序访问,并且使用类似于read_csv(chunksize=...)
的API,那么您可能正在寻找partitions属性
for part in df.partitions:
process(model,part.compute())