使用 GLOB、BS4 从多个本地 .html 文件中提取元素并写入 CSV Excel

问题描述

我正在尝试从多个本地下载的 .HTML 文件中提取标记之间的单词并提取到 CSV。它在使用 print (title) 命令时显示“标题”列表，但一旦我尝试导出到 CSV，它只显示一个条目。

import glob
import lxml
import csv
from bs4 import BeautifulSoup
    
path = "C:\\Users\\user1\\Downloads\\lksd\\"
for infile in glob.glob(os.path.join(path,"*.html")):
    markup = (infile)
    soup = BeautifulSoup(open(markup,"r").read(),'lxml')
    title = soup.find_all('title')
    title.append(title)
    print ([title])

with open('output2.csv','w') as myfile:
   writer = csv.writer(myfile)
   writer.writerows((title))

有什么建议吗？

解决方法

会发生什么？

您将循环中的 title 附加到自身：

title = soup.find_all('title')
title.append(title)

尝试在循环外定义一个空列表，并将您的 title 附加到此列表中。

...
titleList = []

for infile in glob.glob(os.path.join(path,"*.html")):
    markup = (infile)
    soup = BeautifulSoup(open(markup,"r").read(),'lxml')
    title = soup.find_all('title')
    titleList.append(title)
  
with open('output2.csv','w') as myfile:
   writer = csv.writer(myfile)
   writer.writerows((titleList))

beautifulsoup beautifulsoup csv csv csv glob python