使用 BeautifulSoup 从 URL 抓取数据并将其保存到 csv

问题描述

好吧,我是 Python BS 的新手。我编写了一个代码来抓取 HTML 并将我需要的所有数据保存在 csv 文件中。将ALL_NUMBERS文件中的值代入URL,从而得到大量的URL。

代码如下:

import requests
from bs4 import BeautifulSoup

#--READ NAMES--
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.2)\
    AppleWebKit/537.36 (KHTML,like Gecko)\
    Chrome/63.0.3239.84 Safari/537.36','Accept-Language': 'ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7'}
all_names = [] # TO KEEP ALL NAMES IN MEMORY

with open('ALL_NUMBERS.txt','r') as text_file:
    for line in text_file:
        line = line.strip()
        all_names.append(line)

url_template = 'https://www.investing.com/news/stock-market-news/learjet-the-private-plane-synonymous-with-the-jetset-nears-end-of-runway-{}'

all_urls = [] # TO KEEP ALL URLs IN MEMORY

with open("url_requests.txt","w") as text_file:
    for name in all_names:
        url = url_template.format(name)
        print('url:',url)
        all_urls.append(url)
        text_file.write(url + "\n")

# --- read data ---

for name,url in zip(all_names,all_urls):
    # print('name:',name)
    # print('url:',url)
    r1 = requests.get(url,headers = headers)

page = r1.content
soup = BeautifulSoup(page,'html5lib')
results = soup.find('div',class_= 'WYSIWYG articlePage')
para = results.findAll("p")
results_2 = soup.find('div',class_= 'contentSectionDetails')
para_2 = results_2.findAll ("span")
#for n in results_2:
    #print n.find('p').text

#cont = soup.select_one("div.contentSectionDetails")
#ram = cont.select_one("span")
#[x.extract() for x in ram.select_one('span')]


with open('stock_market_news_' + name + '.csv','w') as text_file:
    text_file.write(str(para))
    text_file.write(str(para_2))

它运行良好,但只能使用一个 URL。我想将每个 URL 中的 parapara_2 保存在一个 csv 文件中。即每行保存每个 URL 的两个参数:

文字 时间
para 来自 URL(1) para_2 来自 URL(1)
para 来自 URL(2) para_2 来自 URL(2)
... ...

不幸的是,在我的情况下,我不知道如何更好地处理很多 URL。

解决方法

您可以将所有参数存储在一个列表中,然后将结果保存在您的文件中:

import csv

# ...

# --- read data ---

params = []
for name,url in zip(all_names,all_urls):
    r1 = requests.get(url,headers = headers)
    page = r1.content
    soup = BeautifulSoup(page,'html5lib')
    results = soup.find('div',class_= 'WYSIWYG articlePage')
    para = '\n'.join([r.text for r in results.findAll("p")])
    results_2 = soup.find('div',class_= 'contentSectionDetails')
    para_2 = results_2.findAll("span")[0].text
    params.append([str(para),str(para_2)])

with open('stock_market_news_' + name + '.csv','w') as text_file:
    text_file.write("Text;Time\n")
    wr = csv.writer(f,quoting=csv.QUOTE_ALL)
    wr.writerow(['Text','Time'])
    wr.writerows(params)

这个答案是否解决了您的问题?

祝您有美好的一天!

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...