Nutch Fetch失败，协议状态：moveed12，lastModified = 0：https：//moorecompletedental.com/

问题描述

当我执行parsechecker网址https://moorecompletedental.com/时 parsechecker的输出是 2020-09-02 19：43：26,757信息配置。配置：在文件：/tmp/hadoop-unjar8666322013990061416/httpclient-auth.xml中找到资源httpclient-auth.xml 提取失败，协议状态为：moved（12），lastModified = 0：https://moorecompletedental.com/ 由于配置，未处理重定向。每个配置可处理的最大重定向数：10 处理的重定向数：0

我找到了一些链接来更改属性http.redirect.max 10 但是我仍然遇到同样的问题。任何人都可以帮助我进行哪些更改，以便我可以在这些网站上进行爬网。我是新手。

解决方法

parsechecker工具提供了命令行标志-followRedirects来跟随重定向：

$> bin/nutch parsechecker
Usage:
  ParserChecker [OPTIONS] <url>
    Fetch single URL and parse it
  ParserChecker [OPTIONS] -stdin
    Read URLs to be parsed from stdin
  ParserChecker [OPTIONS] -listen <port> [-keepClientCnxOpen]
    Listen on <port> for URLs to be parsed
Options:
  -D<property>=<value>  set/overwrite Nutch/Hadoop properties
                        (a generic Hadoop option to be passed
                         before other command-specific options)
  -normalize            normalize URLs
  -followRedirects      follow redirects when fetching URL
  -checkRobotsTxt       fail if the robots.txt disallows fetching
  -dumpText             also show the plain-text extracted by parsers
  -forceAs <mimeType>   force parsing as <mimeType>
  -md <key>=<value>     metadata added to CrawlDatum before parsing

然后，属性http.redirect.max用于确定递归遵循的重定向的数量。如果-followRedirects不存在，则会被忽略。

nutch