在使用Storm-Crawler爬行网页时，如何排除ID /类，Header和Footer部分的HTMl的特定DIV？

问题描述

我正在尝试使用带有页眉和页脚的html页面抓取网站，这对于我的所有页面和2个ID为ID的单独DIV都是通用的。我想在我的ES中使用id = firstSection，id = secondSection数据存储div的内容。但是带有页眉和页脚的整个html数据都存储在我的ES中。有什么方法可以抓取特定的DIV ID或我可以排除不存储到我的ES中的任何特定DIV内容？

注意：我试图在“排除”标签中添加div的类/ id，但是没有用。

我正在使用Storm Crawler 1.17和ES-7.6

以下是我的配置和htmls

crawler-conf.yaml

# text extraction for JSoupParserBolt
  textextractor.include.pattern:
   - DIV[id="maincontent"]
   - DIV[itemprop="articleBody"]
   - ARTICLE
   - DIV[id="block-edu-bootstrap-subtheme-content" class="block block-system block-system-main-block"]
   - MAIN[role="main"]
   - DIV[id="content--news"]
   - DIV[id="content--person"]
   - ARTICLE[class="node container node--type-facility facility-full node-101895 node--promoted node--view-mode-full py-5"]
   - ARTICLE[class="node container node--type-spotlight spotlight-full node-90543 node--promoted node--view-mode-full py-5"]
   - DIV[class="field field--name-field-content field--type-entity-reference-revisions field--label-hidden field__items"]
   - BODY


  textextractor.exclude.tags:
   - STYLE
   - SCRIPT
   - HEADER[class="fixed-header"]
   - FOOTER[class="fixed-footer"]

和我使用的示例html文件

index.html


<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Implement Sticky Header and Footer with CSS</title>
<style>
/* Add some padding on document's body to prevent the content
    to go underneath the header and footer */
body {
    padding-top: 60px;
    padding-bottom: 40px;
}


.fixed-header,.fixed-footer {
    width: 100%;
    position: fixed;
    background: #333;
    padding: 10px 0;
    color: #fff;
}


.fixed-header {
    top: 0;
}


.fixed-footer {
    bottom: 0;
}


.container {
    width: 80%;
    margin: 0 auto; /* Center the DIV horizontally */
}


.welcome-box {
    background: #cccccc;
    border: 1px solid #333333;
}


nav a {
    color: #fff;
    text-decoration: none;
    padding: 7px 25px;
    display: inline-block;
}
</style>
</head>
<body>
    <div class="fixed-header">
        <div class="container">
            <nav>
                <a href="#">Home</a> <a href="#">About</a> <a href="#">Products</a>
                <a href="#">Services</a> <a href="#">Contact Us</a>
            </nav>
        </div>
    </div>
    <div id="firstSection">
        <div class="container">
            <div class="welcome-box">
                <h1>Welcome</h1>
                <p>Hi,welcome to our website.</p>
            </div>
            <div class="welcome-box">
                <h1>Welcome</h1>
                <p>Lorem ipsum dolor sit amet,consectetur adipiscing elit...</p>
            </div>
        </div>
    </div>
    <div id="secondSection">
        <div class="container">
            <div class="welcome-box">
                <h1>Welcome2</h1>
                <p>Hi,welcome to our second section.</p>
            </div>
            <div class="welcome-box">
                <h1>Second section</h1>
                <p>Hi,welcome to our second section.</p>
            </div>
        </div>
    </div>
     <div id="thirdSection">
        <div class="container">
            <div class="welcome-box">
                <h1>Welcome</h1>
                <p>Dont crawl this section.</p>
            </div>
            <div class="welcome-box">
                <h1>Welcome</h1>
                <p>Dont crawl this section</p>
            </div>
        </div>
    </div>
    <div class="fixed-footer">
        <div class="container">Copyright &copy; 2020 Your Company</div>
    </div>
</body>
</html>

解决方法

不确定为什么排除不起作用（并且没有时间测试您提供的示例），但是您可以配置XPathFilter来提取元数据中特定键下您感兴趣的元素，然后然后配置索引器以将该键的内容存储为ES中的字段。

有关如何设置XPathFilter的示例，请参见parse filter config from archetype。

apache-tika html jsoup stormcrawler text-extraction

在使用Storm-Crawler爬行网页时，如何排除ID /类，Header和Footer部分的HTMl的特定DIV？

问题描述

crawler-conf.yaml

解决方法

相关问答