beautifulsoup抓取-缺少可扩展的标题文本

问题描述

我试图使用BeautifulSoup从Y！Finance网站提取数据并将所有内容存储在列表中。在列表中，缺少可扩展行的标题（总收入，运营费用），但数字仍然存在。有没有办法在输出中包含标题？

import pandas as pd
from bs4 import BeautifulSoup
import urllib.request as ur

url = 'https://finance.yahoo.com/quote/AAPL/financials?p=AAPL'

read_data = ur.urlopen(url).read() 
soup= BeautifulSoup(read_data,'lxml')

ls= [] # Create empty list
for l in soup.find_all('div'): 
  ls.append(l.string) 


new_ls = list(filter(None,ls))

当前输出：

 'Expand All','ttm','9/30/2019','9/30/2018','9/30/2017','9/30/2016','273,857,000','260,174,'265,595,'229,234,'215,639,

预期输出：

 'Expand All','Total Revenue',

更新：如果我从“ span”中提取，则输出中缺少0的数字，这在以后构造数据框时会产生另一个问题

for l in soup.select('div.D\(tbr\)'): 
    for n in l.select('span'):
        print(n.text)

解决方法

我知道这有点题外话，但看起来您只想要Yahoo Finance的数据正确吗？如果是这样，他们已经有了一个python软件包，使用它可能会比随后的Web抓取更容易。

https://pypi.org/project/yahoo-finance/

您可以输入共享

import numpy
import cv2

b = numpy.zeros([5,5,3],dtype=numpy.uint8)
b[:,:,0] = numpy.ones([5,5])*64
b[:,1] = numpy.ones([5,5])*128
b[:,2] = numpy.ones([5,5])*192

还可以通过使用以下命令来获取大量数据

apple = Share('AAPL')

以下内容将为您提供所有数据，然后您可以过滤掉不需要的内容：

for row in soup.select('div[data-test="fin-row"]'):     
    for r in row:
        for l in r:
            print(l.text)
    print('-------\n')

输出：

Total Revenue
273,857,000
260,174,000
265,595,000
-
215,639,000
-------

Cost of Revenue
169,277,000
161,782,000
163,756,000
-
131,376,000
-------

Gross Profit

等

如果您还想以编程方式获取标题，请尝试：

head_ind = [55,58,60,62,64,66]
for i in head_ind:
    heads = f'span[data-reactid="{i}"]:not([class])'
    for head in soup.select(heads):
        print(head.text)

输出：

Breakdown
ttm
9/30/2019
9/30/2018
9/30/2017
9/30/2016

beautifulsoup python