问题描述
`User` `Text`
49 there is a cat under the table
21 the sun is hot
431 Could you please close the window?
65 there is a cat under the table
21 the sun is hot
53 there is a cat under the table
我的预期输出是:
Text Freq
there is a cat under the table 3
the sun is hot 2
Could you please close the window? 1
我的方法是使用fuzz.partial_ratio
确定所有句子之间的匹配度(相似度),然后使用groupby计算频率。
我正在使用fuzz.partial_ratio,因此在完全匹配的情况下,它将返回1(100):
check_match =df.apply(lambda row: ((fuzz.partial_ratio(row['Text'],row['Text'])) >= value),axis=1)
其中值是阈值。这是为了确定匹配/相似度
解决方法
您可以使用playerphysicals <- tibble(Wingspan=c("5' 10.5\"","6' 1\""))
playerphysicals
# # A tibble: 2 x 1
# Wingspan
# <chr>
# 1 "5' 10.5\""
# 2 "6' 1\""
out <- playerphysicals %>%
mutate(first = as.numeric(str_extract(Wingspan,"[^\']+")),second = str_extract(Wingspan,'[\\d\\.]+\"$'),second = as.numeric(str_replace(second,"\"",""))/100,Wingspan_num = first + second) %>%
select(-first,-second) %>%
as.data.frame
out
# Wingspan Wingspan_num
# 1 5' 10.5" 5.105
# 2 6' 1" 6.010
value_counts()
,
尝试一下:
df = df.groupby('Text').count()
,
以下方法应该起作用:
from collections import Counter
l=dict(Counter(df.Text))
new_df=pd.DataFrame({'Text':list(d.keys()),'Freq': list(d.values())})