问题描述
我在 python 中有两个脚本:
登录 >> 进入网站,使用登录表单登录并将 cookie 存储到 JSON 文件中以备后用
import json
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(slow_mo=50)
context = browser.new_context(user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/89.0.4389.114 Safari/537.36')
page = context.new_page()
page.goto('https://www.url.us/signin')
try:
page.wait_for_selector('#signInFormPage input[name="userName"]',state='visible')
page.type('#signInFormPage input[name="userName"]',"aaa")
page.type('#signInFormPage input[name="password"]',"aa")
page.click('#userNamePasswordSignInButton')
page.wait_for_timeout(3000)
cookies = context.cookies()
page.wait_for_timeout(10000)
f = open('./cookies.json','w')
f.write(json.dumps(cookies))
page.close()
context.close()
browser.close()
except Exception as e:
print("Error in playwright script.")
page.close()
context.close()
browser.close()
这个脚本运行良好。 第二个脚本是从文件中获取存储的cookies并打印同一网站其他页面的页面源:
import json
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=False,slow_mo=50)
context = browser.new_context(user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/89.0.4389.114 Safari/537.36')
page = context.new_page()
cookie_file = open('./cookies.json')
cookies = json.load(cookie_file)
context.add_cookies(cookies)
page.goto('https://www.url.us/Product/10aaa')
try:
page.wait_for_timeout(6000)
print(page.content())
page.close()
except Exception as e:
print("Error in playwright script.")
page.close()
而且这个脚本也运行良好。
但问题是这个网站有一些我想提取的信息的 API,而且信息不能通过前端用户可见的页面源获得。因此,当我将 API 链接放在第二个链接中时,我收到了空的 JSON 页面。这些 API 请求使用令牌值,但由于我使用 cookie 来获取页面源,因此我没有令牌。我使用这些脚本是因为这是通过该网站拥有的 Cloudflare 保护的唯一途径。例如,有什么方法可以将请求模块与 playwright 模块结合使用吗?或者任何其他对这种情况有帮助的建议,我如何使用 cookie 获取 JSON 页面?
使用持久上下文更新代码:
1 脚本:
import json
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch_persistent_context(r'C:\Users\test\Downloads\pyyy',headless=False)
page = browser.new_page()
page.goto('https://www.url.us/signin')
try:
page.wait_for_selector('#signInFormPage input[name="userName"]',"aaaaa")
page.type('#signInFormPage input[name="password"]',"aaaa")
page.click('#userNamePasswordSignInButton')
page.wait_for_timeout(3000)
page.close()
except Exception as e:
print("Error in playwright script.")
page.close()
2:
import json
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch_persistent_context(r'C:\Users\test\Downloads\pyyy',headless=False)
page = browser.new_page()
page.goto('https://www.url.us/Product/aaa')
try:
page.wait_for_timeout(6000)
print(page.content())
page.close()
except Exception as e:
print("Error in playwright script.")
page.close()
解决方法
我会启动一个 Persistent context,而不是保存和加载 cookie。此持久上下文将保留user_data_dir
您提供的信息。