问题描述
我想使用Python填写this表单。
我尝试使用Mechanize,但这是一个使用JavaScript的Microsoft表单,没有表单标签,也没有GET / POST URL。也许BeautifulSoup / Selenium可以做到这一点,但是我在抓取JS表单方面没有任何经验。谁能帮助我,并提出解决方法的建议?
这是我尝试过的内容,Mechanize无法识别页面上的任何表格:
import mechanize
def main():
br = mechanize.browser()
br.set_handle_robots(False)
br.set_handle_refresh(False)
br.addheaders = [('User-agent','Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
response = br.open("https://forms.office.com/Pages/ResponsePage.aspx?id=8Pm7rtoj40mYvzIXGrvJvCxQDveyljlCrKN2Teo3EHFUQVNaWDlYRkhYR09JRTZWRFpKTTNIQU9HUC4u")
for form in br.forms():
print("Form name:",form.name) #prints nothing
print(form) #prints nothing
if __name__ == '__main__':
main()
解决方法
硒工作正常。
- 安装硒
pip install selenium
- 您需要确保为您的浏览器和操作系统版本下载正确的chromedriver(或其他驱动程序),并将其添加到路径
然后运行:
from selenium import webdriver
driver = webdriver.Chrome()
url = "https://forms.office.com/Pages/ResponsePage.aspx?id=8Pm7rtoj40mYvzIXGrvJvCxQDveyljlCrKN2Teo3EHFUQVNaWDlYRkhYR09JRTZWRFpKTTNIQU9HUC4u"
driver.get(url)
name = driver.find_element_by_xpath("//div[@class='question-title-box'][.//span[text()='NAME']]/following-sibling::*//input")
name.send_keys("hello,World")
setionSelection = "F"
section = driver.find_element_by_xpath("//div[@class='question-title-box'][.//span[text()='Section']]/following-sibling::*//input[@value='" + setionSelection + "']")
section.click()
date = driver.find_element_by_xpath("//input[contains(@placeholder,'Please input date')]")
date.send_keys("01/12/2020")
submit = driver.find_element_by_xpath("//div[text()='Submit']")
submit.click()
xapth有点长,但它们基于问题文本,因此可能很稳定
作为另一种方法-当您说没有POST网址时,是否检查过devtools? -公开表单的目的地:
Request URL: https://forms.office.com/formapi/api/aebbf9f0-23da-49e3-98bf-32171abbc9bc/users/f70e502c-96b2-4239-aca3-764dea371071/forms('8Pm7rtoj40mYvzIXGrvJvCxQDveyljlCrKN2Teo3EHFUQVNaWDlYRkhYR09JRTZWRFpKTTNIQU9HUC4u')/responses
Request Method: POST
它也暴露了有效载荷...这是第一个提交:
{startDate: "2020-08-17T10:40:18.504Z",submitDate: "2020-08-17T10:40:18.507Z",…}
answers: "[{"questionId":"r8f09d63e6f6f42feb2f8f4f8ed3f9389","answer1":"Hello,World"},{"questionId":"r28fe12073dfa47399f8ce95ae679dccf","answer1":"G"},{"questionId":"r8f9e9fedcc2e410c80bfa1e0e3ef9750","answer1":"2020-08-28"}]"
startDate: "2020-08-17T10:40:18.504Z"
submitDate: "2020-08-17T10:40:18.507Z"
这些帖子URL UUID / GUID的问题ID对于此表单似乎很严格。每次我跑步时,他们都不会改变。这是第二次运行:
{startDate: "2020-08-17T10:43:48.544Z",submitDate: "2020-08-17T10:43:48.546Z","answer1":"test me"},"answer1":"2020-08-12"}]"
startDate: "2020-08-17T10:43:48.544Z"
submitDate: "2020-08-17T10:43:48.546Z"
一旦您捕获了此内容,您就可以通过不带GUI的API来完成此操作。
...只是为了确保,我尝试了一下,然后获得了成功...
import requests
url = "https://forms.office.com/formapi/api/aebbf9f0-23da-49e3-98bf-32171abbc9bc/users/f70e502c-96b2-4239-aca3-764dea371071/forms('8Pm7rtoj40mYvzIXGrvJvCxQDveyljlCrKN2Teo3EHFUQVNaWDlYRkhYR09JRTZWRFpKTTNIQU9HUC4u')/responses"
myobj = {"startDate":"2020-08-17T10:48:40.118Z","submitDate":"2020-08-17T10:48:40.121Z","answers":"[{\"questionId\":\"r8f09d63e6f6f42feb2f8f4f8ed3f9389\",\"answer1\":\"Hello again,World\"},{\"questionId\":\"r28fe12073dfa47399f8ce95ae679dccf\",\"answer1\":\"F\"},{\"questionId\":\"r8f9e9fedcc2e410c80bfa1e0e3ef9750\",\"answer1\":\"2020-08-26\"}]"}
x = requests.post(url,data = myobj)
我的答案只是硬编码到数据对象中,但似乎可行。
如果您还没有install requests
,请记住点