pycurl 和 curl 在请求相同资源时表现不同; curl 正确给出一个 JSON 对象,PycURL 一个 HTML 对象

问题描述

ipinfo.io 提供有关与 IP 地址相对应的网站/服务器的信息,通过在他们的 website 上输入或通过 curl 命令行实用程序向他们发送请求,例如:

$ curl  https://ipinfo.io/172.217.169.6

输出,JSON 格式:

{
  "ip": "172.217.169.68","hostname": "lhr48s09-in-f4.1e100.net","city": "London","region": "England","country": "GB","loc": "51.5085,-0.1257","org": "AS15169 Google LLC","postal": "EC1A","timezone": "Europe/London","readme": "https://ipinfo.io/missingauth"
}

我最终想要做的是在 Python 中执行此操作并将此结果存储为 JSON 对象。我相信以下代码,使用 pycURL 应该产生相同的输出

import pycurl
from io import BytesIO

buffer = BytesIO()
c = pycurl.Curl()
c.setopt(c.URL,"https://ipinfo.io/172.217.169.6")
c.setopt(c.WRITEDATA,buffer)
c.perform()
c.close

body = buffer.getvalue()
print(body.decode('iso-8859-1'))

即,将相同的 JSON 字符串写入缓冲区。

然而,它会打印大量的 HTML 输出,即我怀疑来自实际页面 pycURL 的 HTML 正在请求数据,而不是 JSON 数据。例如:

<!DOCTYPE html>
<html>
<head>
    <title>
    172.217.169.6 IP Address Details
 - IPinfo.io</title>
    <Meta charset="utf-8">
    <Meta name="apple-itunes-app" content="app-id=917634022">
    <Meta name="viewport" content="width=device-width,initial-scale=1,shrink-to-fit=no,user-scalable=no">
    <Meta name="description" content="Full IP address details for 172.217.169.6 (AS15169 Google LLC) including geolocation and map,hostname,and API details.">

    <link rel="manifest" href="/static/manifest.json">
    <link rel="icon" sizes="48x48" href="/static/deviceicons/android-icon-48x48.png">


...
    

</html>

基本上,我怎样才能让 pycURL 也接收这个 JSON 数据?



我尝试比较两者的详细输出,但我无法弄清楚为什么它们的行为不同,只是内容类型字段不同; curl 的“application/json”和 pycURL 的“text/html”,解释了不同的输出。冒着让这篇文章变得冗长乏味的风险,我也在下面提供了它们:

curl(命令行) 详细输出

$ curl -v https://ipinfo.io/172.217.169.6
*   Trying 34.117.59.81:443...
* TCP_NODELAY set
* Connected to ipinfo.io (34.117.59.81) port 443 (#0)
* ALPN,offering h2
* ALPN,offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT),TLS handshake,Client hello (1):
* TLSv1.3 (IN),Server hello (2):
* TLSv1.3 (IN),Encrypted Extensions (8):
* TLSv1.3 (IN),Certificate (11):
* TLSv1.3 (IN),CERT verify (15):
* TLSv1.3 (IN),Finished (20):
* TLSv1.3 (OUT),TLS change cipher,Change cipher spec (1):
* TLSv1.3 (OUT),Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN,server accepted to use h2
* Server certificate:
*  subject: CN=ipinfo.io
*  start date: Jul 10 20:18:59 2021 GMT
*  expire date: Oct  8 21:18:59 2021 GMT
*  subjectAltName: host "ipinfo.io" matched cert's "ipinfo.io"
*  issuer: C=US; O=Google Trust Services LLC; CN=GTS CA 1D4
*  SSL certificate verify ok.
* Using HTTP2,server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x55a887a40e10)
> GET /172.217.169.6 HTTP/2
> Host: ipinfo.io
> user-agent: curl/7.68.0
> accept: */*
> 
* TLSv1.3 (IN),Newsession Ticket (4):
* TLSv1.3 (IN),Newsession Ticket (4):
* old SSL session ID is stale,removing
* Connection state changed (MAX_CONCURRENT_STREAMS == 100)!
< HTTP/2 200 
< access-control-allow-origin: *
< x-frame-options: DENY
< x-xss-protection: 1; mode=block
< x-content-type-options: nosniff
< referrer-policy: strict-origin-when-cross-origin
< content-type: application/json; charset=utf-8
< content-length: 286
< date: Tue,27 Jul 2021 21:03:50 GMT
< x-envoy-upstream-service-time: 1
< via: 1.1 google
< alt-svc: clear
< 
{
  "ip": "172.217.169.6","hostname": "lhr25s26-in-f6.1e100.net","readme": "https://ipinfo.io/missingauth"
* Connection #0 to host ipinfo.io left intact
}

pycURL 详细输出

$ python3 ip_helper.py
*   Trying 34.117.59.81:443...
* TCP_NODELAY set
* Connected to ipinfo.io (34.117.59.81) port 443 (#0)
* ALPN,offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN,server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x19d65c0)
> GET /172.217.169.6 HTTP/2
Host: ipinfo.io
user-agent: PycURL/7.43.0.6 libcurl/7.68.0 OpenSSL/1.1.1f zlib/1.2.11 brotli/1.0.7 libidn2/2.2.0 libpsl/0.21.0 (+libidn2/2.2.0) libssh/0.9.3/openssl/zlib nghttp2/1.40.0 librtmp/2.3
accept: */*

* old SSL session ID is stale,removing
* Connection state changed (MAX_CONCURRENT_STREAMS == 100)!
< HTTP/2 200 
< access-control-allow-origin: *
< x-frame-options: DENY
< x-xss-protection: 1; mode=block
< x-content-type-options: nosniff
< referrer-policy: strict-origin-when-cross-origin
< content-type: text/html; charset=utf-8
< content-length: 44645
< date: Tue,27 Jul 2021 21:07:50 GMT
< x-envoy-upstream-service-time: 13
< via: 1.1 google
< alt-svc: clear
< 
* Connection #0 to host ipinfo.io left intact
<!DOCTYPE html>
<html>
<head>
    <title>
    172.217.169.6 IP Address Details
 - IPinfo.io</title>
    <Meta charset="utf-8">
    <Meta name="apple-itunes-app" content="app-id=917634022">
    <Meta name="viewport" content="width=device-width,user-scalable=no">
    <Meta name="description" content="
    
        Full IP address details for 172.217.169.6 (AS15169 Google LLC) including geolocation and map,and API details.
    
">

    <link rel="manifest" href="/static/manifest.json">
    <link rel="icon" sizes="48x48" href="/static/deviceicons/android-icon-48x48.png">


...

</html>

感谢您的时间

解决方法

来自docs

我们尝试自动检测何时有人想要调用我们的 API 而不是查看我们的网站,然后我们发送回适当的 JSON 响应而不是 HTML。我们基于已知流行编程语言、工具和框架的用户代理来执行此操作。但是,当 JSON 响应不会自动发生时,还有其他几种方法可以强制它做出响应。一种是在URL中添加/json,另一种是在application/json中设置一个Accept头

所以看起来可以通过三种不同的方式使用 pycurl 取回 JSON。

  1. /json 附加到您的网址:
c.setopt(c.URL,"https://ipinfo.io/172.217.169.6/json")
  1. 将您的 Accept 标头设置为仅允许 JSON 响应:
c.setopt(c.HTTPHEADER,["Accept: application/json"])
  1. 设置您的 User-Agent 标头,让网站认为它在与 curl 而不是 pycurl 交谈:
c.setopt(c.HTTPHEADER,["User-Agent: curl"])