问题描述
我需要从数据集中的电子邮件中提取域,并计算前5个域。
import re
from collections import Counter
with open("emails")
domain = re.search('@[\w.)]+,email')
print(domain.group())
jbutt@gmail.com http://www.bentonjohnbjr.com
josephine_darakjy@darakjy.org http://www.chanayjeffreyaesq.com
art@venere.org http://www.chemeljameslcpa.com
lpaprocki@hotmail.com http://www.feltzprintingservice.com
donette.foller@cox.net http://www.printingdimensions.com
解决方法
这将列出前5个域:
import re
from collections import Counter
resultList = []
with open("emails","r") as email:
for x in email:
result = re.search('@(.*) ',x)
resultList.append(result.group(1))
occurence_count = Counter(resultList)
print(occurence_count.most_common(5))
输出:
[('gmail.com ',1),('darakjy.org ',('venere.org',('hotmail.com ',('cox.net',1)]
输出的是5个最常见的域名