Python将数据文件文本刮到csv

问题描述

我正在尝试抓取包含 textarea 的数据,我已经获得了文本的内容

我尝试将其转换为 csv 文件,但是当我打开转换后的文件时。出现的数据杂乱无章,堆积在第1行。

以下是我如何使用beautifulsoup:

import requests
from bs4 import BeautifulSoup
import csv

url = 'https://www.ndbc.noaa.gov/station_page.PHP?station=46410'
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/88.0.4324.150 Safari/537.36 Edg/88.0.705.63'
}

datas = []
req = requests.get(url,headers=headers)
soup = BeautifulSoup(req.text,'html.parser')
dt = soup.find_all('pre')
for text in dt :
    contents = text.find(id = "data").text
    datas.append([contents])

kepala = ['Year','Month','Day','Hour','Minute','Second','T','Height']
writer = csv.writer(open('result/station-56003','w',newline=''))
writer.writerow(kepala)
for d in datas: writer.writerow(d)

解决方法

似乎输入数据本身是一个类似于 csv 的数据。您需要去掉标题并用 ' ' 分割。

import requests
from bs4 import BeautifulSoup
import csv

url = 'https://www.ndbc.noaa.gov/station_page.php?station=46410'
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/88.0.4324.150 Safari/537.36 Edg/88.0.705.63'
}

datas = []
req = requests.get(url,headers=headers)
soup = BeautifulSoup(req.text,'html.parser')
dt = soup.find_all('pre')
for text in dt :
    contents = text.find(id = "data").text
    datas.append(contents)

kepala = ['Year','Month','Day','Hour','Minute','Second','T','Height']
writer = csv.writer(open('file.csv','w'))
writer.writerow(kepala)
for d in datas: 
    writer.writerows([i.split(' ') for i in d.split('\n')[2:]])
,

数据是纯文本,而不是表格或 CSV,因此您需要手动解析它。该格式似乎是由空格分隔的固定宽度字段。下面的代码给出了 429 行,并假设每行由数字、空格和句点字符组成,宽度为 30 个字符。

顺便说一句,writer = csv.writer(open('file.csv','w')) 会泄漏内存。使用 with open(...) 创建一个上下文管理器,在块结束时自动关闭资源。

import csv
import re
import requests
from bs4 import BeautifulSoup

url = "https://www.ndbc.noaa.gov/station_page.php?station=46410"
request_headers = {
    "user-agent": ("Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                   "AppleWebKit/537.36 (KHTML,like Gecko) "
                   "Chrome/88.0.4324.150 Safari/537.36 Edg/88.0.705.63")
}
response = requests.get(url,headers=request_headers)
response.raise_for_status()
soup = BeautifulSoup(response.text,"html.parser")
headers = ["Year","Month","Day","Hour","Minute","Second","T","Height"]

with open("station-56003","w") as f:
    writer = csv.writer(f,lineterminator="\n")
    writer.writerow(headers)

    for line in soup.select_one("#data").text.split("\n"):
        if re.fullmatch(r"[\d. ]{30}",line) and len(line.split()) == len(headers):
            writer.writerow(line.split())
,

尝试在 '\n' 上拆分数据,然后再次按空格拆分。

请注意,在下面的代码中,我已将其更改为找到“textarea”标签而不是“pre”标签。

import requests
from bs4 import BeautifulSoup
import csv

url = 'https://www.ndbc.noaa.gov/station_page.php?station=46410'
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,'html.parser')
dt = soup.find_all('textarea')[0].text

datas = dt.split('\n')[2:]

kepala = ['Year','Height']
with open('station-56003','w',newline='') as file:
  writer = csv.writer(file)
  writer.writerow(kepala)
  for d in datas: 
    writer.writerow(d.split(' '))

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...