Google Colab 是否使用 GPU 进行基于 NLTK 的词形还原？

问题描述

我正在尝试在 Google Colab 上运行下面的给定代码，这里的 corpus['text'] 是通过执行 Corpus['text']= [word_tokenize(entry) for entry in Corpus['text']] 获得的。请注意，在执行上述行之前，Corpus['text'] 是一个由 1M 个句子组成的数据帧，现在它包含标记化的单词。所以现在当我尝试运行下面的代码时，它花费了很多时间。我想知道这段代码是否使用了 Google Colab 提供的 GPU。如果没有，我能做些什么来增加这个数据集的预处理；建议不要减少数据集。

for index,entry in tqdm(enumerate(Corpus['text'])):
# Declaring Empty List to store the words that follow the rules for this step
Final_words = []
# Initializing WordNetLemmatizer()
word_Lemmatized = WordNetLemmatizer()
# pos_tag function below will provide the 'tag' i.e if the word is Noun(N) or Verb(V) or something else.
for word,tag in pos_tag(entry):
    # Below condition is to check for Stop words and consider only alphabets
    if word not in stopwords.words('english') and word.isalpha():
        word_Final = word_Lemmatized.lemmatize(word,tag_map[tag[0]])
        Final_words.append(word_Final)
# The final processed set of words for each iteration will be stored in 'text_final'
Corpus.loc[index,'text_final'] = str(Final_words)

注意：我用 tqdm 包装了我的迭代器，它显示了大约 30it/s 的速度。并且，据此，对这个数据集进行词形还原大约需要 10-11 个小时。另外，GPU 能否仅用于训练目的，而不能用于此类循环？

解决方法

不会在 GPU 上完成诸如 Lematization 之类的操作。您的问题可能是您检查停用词的那一行。 NLTK 停用词是一个在循环中检查非常慢的列表。尝试将其转换为类似这样的集合...

# Put this at the top
stops = set(stopwords.words('english'))

# Make this your loop
if word.lower() not in stops and word.isalpha():

注意我输入了 word.lower()。这是相当标准的，因此您仍然可以看到大写的 The 等。

deep-learning google-colaboratory lemmatization pandas pandas python-3.x