提取失败并显示协议状态:exception16,lastModified = 0:Http代码= 406,url = https://www.randolphnj.org/

问题描述

我正在尝试抓取网址:https://www.randolphnj.org/

但是显示错误

2020-09-22 15:03:08,395 INFO httpclient.Http: http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2020-09-22 15:03:08,395 INFO httpclient.Http: http.enable.cookie.header = true
2020-09-22 15:03:08,399 INFO conf.Configuration: found resource httpclient-auth.xml at file:/tmp/hadoop-unjar7802696204891280694/httpclient-auth.xml

Fetch Failed with protocol status: exception(16),lastModified=0: Http code=406,url=https://www.randolphnj.org/

请问是什么原因。请帮我解决

解决方法

当HTTP请求标头“ User-agent”包含字符串“ Nutch”时,服务器很可能阻止请求。我能够使用wget重现该行为:

$> wget --header='User-Agent: mycrawler/Nutch-1.17' https://www.randolphnj.org/
--2020-09-25 10:55:42--  https://www.randolphnj.org/
Resolving www.randolphnj.org (www.randolphnj.org)... 63.247.128.112
Connecting to www.randolphnj.org (www.randolphnj.org)|63.247.128.112|:443... connected.
HTTP request sent,awaiting response... 406 Not Acceptable
2020-09-25 10:55:43 ERROR 406: Not Acceptable.

$> wget https://www.randolphnj.org/
--2020-09-25 11:02:25--  https://www.randolphnj.org/
Resolving www.randolphnj.org (www.randolphnj.org)... 63.247.128.112
Connecting to www.randolphnj.org (www.randolphnj.org)|63.247.128.112|:443... connected.
HTTP request sent,awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘index.html’