问题描述
我正在尝试抓取包含 textarea 的数据,我已经获得了文本的内容。
我尝试将其转换为 csv 文件,但是当我打开转换后的文件时。出现的数据杂乱无章,堆积在第1行。
以下是我如何使用beautifulsoup:
import requests
from bs4 import BeautifulSoup
import csv
url = 'https://www.ndbc.noaa.gov/station_page.PHP?station=46410'
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/88.0.4324.150 Safari/537.36 Edg/88.0.705.63'
}
datas = []
req = requests.get(url,headers=headers)
soup = BeautifulSoup(req.text,'html.parser')
dt = soup.find_all('pre')
for text in dt :
contents = text.find(id = "data").text
datas.append([contents])
kepala = ['Year','Month','Day','Hour','Minute','Second','T','Height']
writer = csv.writer(open('result/station-56003','w',newline=''))
writer.writerow(kepala)
for d in datas: writer.writerow(d)
解决方法
似乎输入数据本身是一个类似于 csv 的数据。您需要去掉标题并用 ' '
分割。
import requests
from bs4 import BeautifulSoup
import csv
url = 'https://www.ndbc.noaa.gov/station_page.php?station=46410'
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/88.0.4324.150 Safari/537.36 Edg/88.0.705.63'
}
datas = []
req = requests.get(url,headers=headers)
soup = BeautifulSoup(req.text,'html.parser')
dt = soup.find_all('pre')
for text in dt :
contents = text.find(id = "data").text
datas.append(contents)
kepala = ['Year','Month','Day','Hour','Minute','Second','T','Height']
writer = csv.writer(open('file.csv','w'))
writer.writerow(kepala)
for d in datas:
writer.writerows([i.split(' ') for i in d.split('\n')[2:]])
,
数据是纯文本,而不是表格或 CSV,因此您需要手动解析它。该格式似乎是由空格分隔的固定宽度字段。下面的代码给出了 429 行,并假设每行由数字、空格和句点字符组成,宽度为 30 个字符。
顺便说一句,writer = csv.writer(open('file.csv','w'))
会泄漏内存。使用 with open(...)
创建一个上下文管理器,在块结束时自动关闭资源。
import csv
import re
import requests
from bs4 import BeautifulSoup
url = "https://www.ndbc.noaa.gov/station_page.php?station=46410"
request_headers = {
"user-agent": ("Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML,like Gecko) "
"Chrome/88.0.4324.150 Safari/537.36 Edg/88.0.705.63")
}
response = requests.get(url,headers=request_headers)
response.raise_for_status()
soup = BeautifulSoup(response.text,"html.parser")
headers = ["Year","Month","Day","Hour","Minute","Second","T","Height"]
with open("station-56003","w") as f:
writer = csv.writer(f,lineterminator="\n")
writer.writerow(headers)
for line in soup.select_one("#data").text.split("\n"):
if re.fullmatch(r"[\d. ]{30}",line) and len(line.split()) == len(headers):
writer.writerow(line.split())
,
尝试在 '\n' 上拆分数据,然后再次按空格拆分。
请注意,在下面的代码中,我已将其更改为找到“textarea”标签而不是“pre”标签。
import requests
from bs4 import BeautifulSoup
import csv
url = 'https://www.ndbc.noaa.gov/station_page.php?station=46410'
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,'html.parser')
dt = soup.find_all('textarea')[0].text
datas = dt.split('\n')[2:]
kepala = ['Year','Height']
with open('station-56003','w',newline='') as file:
writer = csv.writer(file)
writer.writerow(kepala)
for d in datas:
writer.writerow(d.split(' '))