为什么我的 Kmeans CuPy 代码中有“OutOfMemoryError”？

问题描述

我对 gpu 编码真的很陌生，我发现这个 Kmeans 库比代码我的提议是使用大型数据库 (n,3) 例如来实现 gpu 和 cpu 上的时间差异，我想要大量的集群但我收到内存管理错误。有人可以告诉我应该采取的研究和修复路线吗，我已经研究过了，但还没有明确的开始。

import contextlib
import time

import cupy
import matplotlib.pyplot as plt
import numpy

@contextlib.contextmanager
def timer(message):
    cupy.cuda.Stream.null.synchronize()
    start = time.time()
    yield
    cupy.cuda.Stream.null.synchronize()
    end = time.time()
    print('%s:  %f sec' % (message,end - start))

    
var_kernel = cupy.ElementwiseKernel(
    'T x0,T x1,T c0,T c1','T out','out = (x0 - c0) * (x0 - c0) + (x1 - c1) * (x1 - c1)','var_kernel'
)
sum_kernel = cupy.ReductionKernel(
    'T x,S mask','mask ? x : 0','a + b','out = a','0','sum_kernel'
)
count_kernel = cupy.ReductionKernel(
    'T mask','float32 out','mask ? 1.0 : 0.0','0.0','count_kernel'
)
    
    
def fit_xp(X,n_clusters,max_iter):
    assert X.ndim == 2
 
    # Get NumPy or CuPy module from the supplied array.
    xp = cupy.get_array_module(X)

    n_samples = len(X)
    
    # Make an array to store the labels indicating which cluster each sample is
    # contained.
    pred = xp.zeros(n_samples)
    
    # Choose the initial centroid for each cluster.
    initial_indexes = xp.random.choice(n_samples,replace=False)
    centers = X[initial_indexes]
    
    for _ in range(max_iter):
        # Compute the new label for each sample.
        distances = xp.linalg.norm(X[:,None,:] - centers[None,:,:],axis=2)
        new_pred = xp.argmin(distances,axis=1)
    
        # If the label is not changed for each sample,we suppose the
        # algorithm has converged and exit from the loop.
        if xp.all(new_pred == pred):
            break
        pred = new_pred
    
        # Compute the new centroid for each cluster.
        i = xp.arange(n_clusters)
        mask = pred == i[:,None]
        sums = xp.where(mask[:,None],X,0).sum(axis=1)
        counts = xp.count_nonzero(mask,axis=1).reshape((n_clusters,1))
        centers = sums / counts
    
    return centers,pred
    
    
def fit_custom(X,max_iter):
    assert X.ndim == 2
    
    n_samples = len(X)
    
    pred = cupy.zeros(n_samples,dtype='float32')
    
    initial_indexes = cupy.random.choice(n_samples,replace=False)
    centers = X[initial_indexes]
    
    for _ in range(max_iter):
        distances = var_kernel(X[:,0],X[:,1],centers[None,0])
        new_pred = cupy.argmin(distances,axis=1)
        if cupy.all(new_pred == pred):
            break
        pred = new_pred
    
        i = cupy.arange(n_clusters)
        mask = pred == i[:,None]
        sums = sum_kernel(X,mask[:,axis=1)
        counts = count_kernel(mask,pred
    
    
def draw(X,centers,pred,output):
    # Plot the samples and centroids of the fitted clusters into an image file.
    for i in range(n_clusters):
        labels = X[pred == i]
        plt.scatter(labels[:,labels[:,c=numpy.random.rand(3))
    plt.scatter(
        centers[:,centers[:,s=120,marker='s',facecolors='y',edgecolors='k')
    plt.savefig(output)
  
    
def run_cpu(gpuid,num,max_iter,use_custom_kernel):##,output
    samples = numpy.random.randn(num,3)
    X_train = numpy.r_[samples + 1,samples - 1]
    
    with timer(' cpu '):
        centers,pred = fit_xp(X_train,max_iter)
    
    
    
def run_gpu(gpuid,samples - 1]
    
    with cupy.cuda.Device(gpuid):
        X_train = cupy.asarray(X_train)
    
        with timer(' GPU '):
            if use_custom_kernel:
                centers,pred = fit_custom(X_train,max_iter)
            else:
                centers,max_iter)

顺便说一句，我在 colab pro 25GB(RAM) 中工作，代码使用 n_clusters=200 和 num=1000000，但如果我使用更大的数字，则会出现错误，我正在运行这样的代码：

run_gpu(0,200,1000000,10,True)

This is the error that i have

欢迎提出任何建议，感谢您的时间。

解决方法

假设 CuPy 足够聪明，不会创建 var_kernel 的广播输入的显式副本，则输出 distances 的大小必须为 2 * num * num_clusters，正好是它的 6,400,000,000 字节正在尝试分配。通过从不实际将距离写入内存，这意味着将 var_kernel 与 argmin 融合，您可以获得更小的内存占用。请参阅文档的 this 部分。

如果我正确理解了那里的例子，这应该可行：

@cupy.fuse(kernel_name='argmin_distance')
def argmin_distance(x1,y1,x2,y2):
    return cupy.argmin((x1 - x2) * (x1 - x2) + (y1 - y2) * (y1 - y2),axis = 1)

下一个问题是其他 13.7GB 的来源。其中很大一部分可能只是早期迭代中的 distances 实例。我不是 CuPy 专家，但至少在 Python/Numpy 中，您在循环内使用距离不会重用相同的内存，而是在每次调用 var_kernel 时分配更多内存。同样的问题对于在循环之前分配的 pred 也是可见的。如果 CuPy 以 Numpy 的方式做事，解决方案就是把 [:] 放在那里，比如

pred[:] = new_pred

或

distances[:,:,:] = var_kernel(X[:,None,0],X[:,1],centers[None,0])

为此，您还需要在循环之前分配 distances。此外，使用内核融合时不再需要此功能，因此仅以它为例。最好事先分配所有内容，然后在循环中的任何地方使用此语法。

我对 CuPy 的了解不够，无法回答为什么 fit_xp 没有同样的问题（或者有？）。但我的猜测是，CuPy 对象的垃圾收集在那里的工作方式有所不同。如果 fit_custom 中的垃圾收集“足够快”，即使没有内核融合或重用已分配的数组，它也应该可以工作。

您的代码的其他问题或至少是奇怪的地方：

为什么要比较 centers 的第一个坐标和 X 的第一个坐标？打电话不是更有意义吗

distances = var_kernel(X[:,1])

为什么只使用 2D 平面上的投影创建 3D 数据？那么为什么不

samples = numpy.random.randn(num,2)

您为什么在（初始版本）pred 中使用浮点数？ argmin 应该给出一个整数类型的结果。

cupy k-means memory-management out-of-memory