问题描述
我的数据集有 42,000 行。这是我用来在矢量化之前编辑我的文本的代码。然而,问题是它有一个嵌套的 for 循环,我猜这使它非常慢,而且我无法将它用于超过 1500 行。有人可以帮忙提供更好的方法吗?
filtered = []
for i in range(2):
rev = re.sub('[^a-zA-Z]',' ',df['text'][i])
rev = rev.lower()
rev = rev.split()
filtered =[]
for word in rev:
if word not in stopwords.words("english"):
word = Porterstemmer().stem(word)
filtered.append(word)
filtered = " ".join(filtered)
corpus.append(filtered)
解决方法
编写代码中最耗时的部分是停用词部分。
每次循环迭代时,它都会调用库来获取停用词列表。
因此最好将停用词设置一次并在每次迭代时使用相同的设置。
我将代码改写如下(其他差异只是为了可读性):
corpus = []
texts = df['text']
stopwords_set = stopwords.words("english")
stemmer = PorterStemmer()
for i in range(len(texts)):
rev = re.sub('[^a-zA-Z]',' ',texts[i])
rev = rev.lower()
rev = rev.split()
filtered = []
filtered = [stemmer.stem(word) for word in rev if word not in stopwords_set]
filtered = " ".join(filtered)
corpus.append(filtered)
,
我使用 line_profiler 来衡量您发布代码的速度。
测量结果如下。
Line # Hits Time Per Hit % Time Line Contents
==============================================================
8 @profile
9 def profile_nltk():
10 1 435819.0 435819.0 0.3 df = pd.read_csv('IMDB_Dataset.csv') # (50000,2)
11 1 1.0 1.0 0.0 filtered = []
12 1 247.0 247.0 0.0 reviews = df['review'][:4000]
13 1 0.0 0.0 0.0 corpus = []
14 4001 216341.0 54.1 0.1 for i in range(len(reviews)):
15 4000 221885.0 55.5 0.2 rev = re.sub('[^a-zA-Z]',df['review'][i])
16 4000 3878.0 1.0 0.0 rev = rev.lower()
17 4000 30209.0 7.6 0.0 rev = rev.split()
18 4000 1097.0 0.3 0.0 filtered = []
19 950808 235589.0 0.2 0.2 for word in rev:
20 946808 115658060.0 122.2 78.2 if word not in stopwords.words("english"):
21 486614 30898223.0 63.5 20.9 word = PorterStemmer().stem(word)
22 486614 149604.0 0.3 0.1 filtered.append(word)
23 4000 11290.0 2.8 0.0 filtered = " ".join(filtered)
24 4000 1429.0 0.4 0.0 corpus.append(filtered)
正如@parsa-abbasi 所指出的,检查停用词的过程约占总数的 80%。
修改后的脚本的测量结果如下。相同的过程已减少到大约 1/100 的处理时间。
Line # Hits Time Per Hit % Time Line Contents
==============================================================
8 @profile
9 def profile_nltk():
10 1 441467.0 441467.0 1.4 df = pd.read_csv('IMDB_Dataset.csv') # (50000,2)
11 1 1.0 1.0 0.0 filtered = []
12 1 335.0 335.0 0.0 reviews = df['review'][:4000]
13 1 1.0 1.0 0.0 corpus = []
14 1 2696.0 2696.0 0.0 stopwords_set = stopwords.words('english')
15 4001 59013.0 14.7 0.2 for i in range(len(reviews)):
16 4000 186393.0 46.6 0.6 rev = re.sub('[^a-zA-Z]',df['review'][i])
17 4000 3657.0 0.9 0.0 rev = rev.lower()
18 4000 27357.0 6.8 0.1 rev = rev.split()
19 4000 999.0 0.2 0.0 filtered = []
20 950808 220673.0 0.2 0.7 for word in rev:
21 # if word not in stopwords.words("english"):
22 946808 1201271.0 1.3 3.8 if word not in stopwords_set:
23 486614 29479712.0 60.6 92.8 word = PorterStemmer().stem(word)
24 486614 141242.0 0.3 0.4 filtered.append(word)
25 4000 10412.0 2.6 0.0 filtered = " ".join(filtered)
26 4000 1329.0 0.3 0.0 corpus.append(filtered)
我希望这会有所帮助。