问题描述
我试图抓取与资产负债表、损益表和现金流相关的所有信息,比如 KO,但我的代码没有检测到折叠的行。 有人可以帮我吗?
网站 = https://finance.yahoo.com/quote/KO/financials?p=KO
response = requests.get('https://finance.yahoo.com/quote/KO/financials?p=KO')
soup = BeautifulSoup(response.text,'html.parser')
try:
div = soup.find_all('div',class_ = "D(tbr) C($primaryColor)")[0].find_all('span')
except:
return self.stock + ' non è presente nel sito di YahooFinance'
columns = [span.text for span in div[1:]]
div = soup.find_all('div',class_ = 'D(tbrg)')[0]
values = dict()
for element in div:
try:
values[element.find_all('div',class_ = "D(ib) Va(m) Ell Mt(-3px) W(215px)--mv2 W(200px) undefined")[0].get('title')] = [i.text for i in element.find_all('div',{'data-test' : 'fin-col'})]
except:
pass
print(pd.DataFrame(values))
此代码将返回 32 行,而不是折叠的 45 行以上。
解决方法
欢迎,看起来折叠的行不是静态网页的一部分。您可能需要使用诸如 selenium 之类的网络驱动程序来展开所有行。
我很快搜索了所有细分标题:
import pprint
pprint.pprint(sorted([i.text for i in soup.find_all('span',attrs={'class': 'Va(m)'})]))
这是我看到的:
['Basic Average Shares','Basic EPS','Cost of Revenue','Diluted Average Shares','Diluted EPS','Diluted NI Available to Com Stockholders','EBIT','EBITDA','Gross Profit','Interest Expense','Interest Income','Net Income Common Stockholders','Net Income from Continuing & Discontinued Operation','Net Income from Continuing Operation Net Minority Interest','Net Interest Income','Net Non Operating Interest Income Expense','Normalized EBITDA','Normalized Income','Operating Expense','Operating Income','Other Income Expense','Pretax Income','Reconciled Cost of Revenue','Reconciled Depreciation','Tax Effect of Unusual Items','Tax Provision','Tax Rate for Calcs','Total Expenses','Total Operating Income as Reported','Total Revenue','Total Unusual Items','Total Unusual Items Excluding Goodwill']
由于“营业收入”不在列表中,我认为它是动态生成的。它使用模板来描述折叠的标题具有哪些细分标题。静态 HTML 中的模板名称为 FinancialTemplateStore
。
此外,您可能想查看以下脚本以了解如何使用模板构建折叠的行。它可能会帮助您找出折叠数据的来源:
https://s.yimg.com/uc/finance/dd-site/js/Quote.financials.7de24e8fee85155f7e70.modern.js