Python 中的 BeautifulSoup 链接属性

问题描述

我通过参考 Ryan Mitchell 的“Web Scraping with Python”来探索 BeautifulSoup。

有一些示例代码解释了从维基百科抓取文章链接。为简洁起见，我省略了导入代码。代码是：

html = urlopen("http://en.wikipedia.org")
bsObj = BeautifulSoup(html)

for link in bsObj.findAll("a",href = re.compile("^(/wiki/)((?!:).)*$")):
     if 'href' in link.attrs:
          print(link.attrs['href'])

我很困惑为什么代码需要包含 if 语句：

if 'href' in link.attrs:

findAll 函数不是返回所有具有指定 href 的锚标记吗？因此，假设所有“链接”都具有“href”作为属性不是可以吗？提前致谢！

解决方法

您的想法是有效的，但是尝试通过添加 else 语句并在 else 语句中打印 link.attrs 进行试验，看看是否有任何链接没有 href 属性，这种情况绝不应该发生，但你永远不知道，祝你好运。

因为从这一行返回

for link in bsObj.findAll("a",href = re.compile("^(/wiki/)((?!:).)*$")):

不仅是'href'，还有任何其他属性

所以如果在link.attrs中有'href'这一行：确保你只得到属性'href'

beautifulsoup beautifulsoup python web-scraping