无法通过 Selenium Python 在可折叠窗口中找到段落文本元素

问题描述

我正在尝试通过 Python 中的 Selenium 获取网页上可折叠元素的段落文本。到目前为止,可折叠窗口在 Selenium 中通过 .click 打开,但是之后 Selenium 无法找到所需的类“object-viewer__ocr-articletext”的段落。

Selenium 似乎无法关注包含新可见元素(例如所需段落)的折叠窗口。

页面链接https://www.delpher.nl/nl/kranten/view?query=kernenergie&facets%5Bpapertitle%5D%5B%5D=Algemeen+Dagblad&facets%5Bpapertitle%5D%5B%5D=De+Volkskrant&facets%5Bpapertitle%5D%5B%5D=De+Telegraaf&facets%5Bpapertitle%5D%5B%5D=Trouw&page=1&sortfield=date&cql%5B%5D=%28date+_gte_+%2201-01-1970%22%29&cql%5B%5D=%28date+_lte_+%2201-01-2018%22%29&coll=ddd&redirect=true&identifier=ABCDDD:010818460:mpeg21:a0207&resultsidentifier=ABCDDD:010818460:mpeg21:a0207&rowid=1

完整代码

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import pandas as pd
import numpy as np
import re
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import webdriverwait
from selenium.webdriver.support import expected_conditions
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.chrome.options import Options


chrome_options = Options()  
chrome_options.add_argument("--no-proxy-server")
chrome_options.add_argument("--proxy-server='direct://'");
chrome_options.add_argument("--proxy-bypass-list=*");

driver = webdriver.Chrome(options=chrome_options) 
driver.set_window_size(1400,1080)

#Set up the path to the chrome driver
html = driver.find_element_by_tag_name('html')

all_details = []
for c in range(1,2):
    try:
        driver.get("https://www.delpher.nl/nl/kranten/results?query=kernenergie&facets%5Bpapertitle%5D%5B%5D=Algemeen+Dagblad&facets%5Bpapertitle%5D%5B%5D=De+Volkskrant&facets%5Bpapertitle%5D%5B%5D=De+Telegraaf&facets%5Bpapertitle%5D%5B%5D=Trouw&page={}&sortfield=date&cql%5B%5D=(date+_gte_+%2201-01-1970%22)&cql%5B%5D=(date+_lte_+%2201-01-2018%22)&coll=ddd".format(c))
        driver.execute_script("window.scrollTo(0,document.body.scrollHeight);")
        incategory = driver.find_elements_by_class_name("search-result")
        print(driver.current_url)
        
        links = [ i.find_element_by_class_name("search-result__link").get_attribute("href") for i in incategory]
            
        # Loop through each link to acces the page of each article
        for link in links:
            # get one book url
            driver.get(link)
                      
            # newspaper 
            newspaper = driver.find_element_by_xpath("//*[@id='content']/div[2]/div/div[2]/header/h1/span[2]")
            
            # date of the article
            date = driver.find_element_by_xpath("//*[@id='content']/div[2]/div/div[2]/header/div/ul/li[1]")
            
            #click button and find title
            div_element = webdriverwait(driver,60).until(expected_conditions.presence_of_element_located((By.XPATH,'//*[@id="object"]/div/div/div')))
            hover = ActionChains(driver).move_to_element(div_element)
            hover.perform()
            div_element.click()
            
            button = webdriverwait(driver,10).until(expected_conditions.presence_of_element_located((By.XPATH,'//*[@id="object-viewer__ocr-button"]')))
            hover = ActionChains(driver).move_to_element(button)
            hover.perform()
            
            button.click()
            
                         
            element = driver.find_element_by_css_selector(".object-viewer__ocr-panel-results")
            driver.execute_script("$(arguments[0]).click();",element)
            driver.execute_script("window.scrollTo(0,document.body.scrollHeight);")
                
                               
            # content of article 
                        
            try:
                content = driver.find_element_by_class_name("object-viewer__ocr-articletext")
                
            except Exception as e: 
                print(str(e))
                pass
                                
            # Define a dictionary with details we need
            r = {
                "1Newspaper":newspaper.text,"2Date":date.text,"3Content":content,}
            # append r to all details
            all_details.append(r)
            
    except Exception as e:
        print(str(e))
        pass
            
# save the information into a CSV file
df = pd.DataFrame(all_details)
df = df.to_string()

time.sleep(3)
driver.close()

特别是这部分代码

element = driver.find_element_by_css_selector(".object-viewer__ocr-panel-results")
        driver.execute_script("$(arguments[0]).click();",element)
        driver.execute_script("window.scrollTo(0,document.body.scrollHeight);")
            
                           
        # content of article 
                    
        try:
            content = driver.find_element_by_class_name("object-viewer__ocr-articletext")
            
        except Exception as e: 
            print(str(e))
            pass

有人对在可折叠窗口中找到段落文本有什么建议吗?

提前致谢。

解决方法

如果没有指向所需网页的链接,就很难确定问题所在。

我的猜测是,当您单击可折叠对象时,DOM 会发生变化,这意味着可折叠对象本身不再属于同一类、ID、名称。

第二个猜测是我们正在处理 iframe,这将要求我们捕获它的 id 并专注于它。

你的错误异常是什么?

,

发现展开的元素在 HTML 中整体可见。 使用 Urllib 和 BeautifulSoup 创建了一个新代码。

如果有人对新代码感兴趣,请告诉我!