使用 python 和 win32com 从 Web 服务器下载文件

问题描述

我需要从网上下载一堆pdf文件。我通常使用 urllib3 库，但它是一个带有身份验证的企业网站。我可以使用以下方法下载普通的 html 网页：

url = 'https://corpweb.example/index.html'
h = win32com.client.dispatch('WinHTTP.WinHTTPRequest.5.1')
h.SetAutologonPolicy(0)
h.Open('GET',url,False)
h.Send()
result = h.responseText

但此解决方案不适用于 PDF。

url = "https://corpweb.example/file.pdf"
h = win32com.client.dispatch('WinHTTP.WinHTTPRequest.5.1')
h.SetAutologonPolicy(0)
h.Open('GET',False)
h.Send()
with open(filename,'wb') as f:
    f.write(h.responseText)

我收到一个错误：

TypeError: a bytes-like object is required,not 'str'

我能做什么？

解决方法

正如 Microsoft 的 WinHttpRequest 文档所述，responseText 包含作为 Unicode 文本的响应正文。要以原始字节形式获取响应正文，请改用 responseBody。

还要考虑使用 responseStream 而不是其中之一，以避免将整个文件一次保存在内存中。

尝试使用 urllib.request.urlretrieve(url,filepath)？

import urllib.request as url
url="https://corpweb/file.pdf"
url.urlretrieve(url,"file.pdf")

这可能是最好的解决方案。或者您可以使用请求：

import requests
import os
url="https://corpweb/file.pdf"
resp = requests.get(url) # Get the response
os.system("type nul > file.pdf") # Create a new file
f = open("file.pdf","wb") # Open file
f.write(resp.content) # Write
f.close() # Close file

打开文件模式：

with open(fname,'rb') as f:
    ...

这意味着从文件中读取的所有数据都作为字节对象返回，而不是 str。然后，您不能在包含测试中使用字符串：

if 'some-pattern' in tmp:
    continue

python win32com winhttprequest