在机械化功能中使用哪个正则表达式标记？

问题描述

我从列表中的URL内包含/title/tt的网页中检索了所有链接。

my @url_links= $mech->find_all_links( url_regex => qr/title\/tt/i );

但是列表太长，因此我想通过在函数find_all_Links中添加过滤条件来确定链接也必须位于以<id="actor-tt...">开头的标记中，这是链接（/title/tt...）所在的位置，在cmd.exe检索的代码源中：

<div class="filmo-row odd" id="actor-tt0361748">
<span class="year_column">
&nbsp;2009
</span>
<b><a href="/title/tt0361748/"
>InglourIoUs Basterds</a></b>
<br/>
Lt. Aldo Raine
</div>

我想您必须使用tag_regex，但是我不知道如何使用，因为在我输入命令提示符时似乎并没有考虑tag_regex。

解决方法

使用HTML::TreeBuilder和HTML::Element代替Mechanize：

use strict;
use warnings;
use feature 'say';
use HTML::TreeBuilder;

my $html_string = join "",<DATA>;

my $tree = HTML::TreeBuilder->new_from_content($html_string);

my @url_links = map { $_->attr_get_i("href") }
                map { $_->look_down(href => qr{/title/tt}) }
                $tree->look_down(id => qr/^actor-tt/);

say for @url_links;

__DATA__
<div class="filmo-row odd" id="actor-tt0361748">
    <span class="year_column">
      &nbsp;2009
    </span>
    <b><a href="/title/tt0361748/">Inglourious Basterds</a></b>
    <br/>
    Lt. Aldo Raine
</div>
<div id="not-the-right-id">
    <a href="/title/tt-looks-correct-but-wrong-id/"></a>
</div>
<div class="filmo-row odd" id="actor-tt0123456">
    <b><a href="/title/tt0123456/">Another movie</a></b>
</div>
<div class="filmo-row odd" id="actor-tt0123456">
    the id will match,but no href in here
</div>

$tree->look_down(id => qr/^actor-tt/);查找所有id与actor-tt匹配的元素。然后$_->look_down(href => qr{/title/tt})将在其中找到与href相匹配的字段/title/tt的所有元素。最后，$_->attr_get_i("href")返回其href字段的值。

您可能对new_from_url中的方法new_from_file或HTML::TreeBuilder感兴趣，而不是我使用的new_from_content。

WWW :: Mechanize不够复杂，无法执行您要尝试执行的操作。它只能搜索链接on one criterium at a time，并将其转换为WWW :: Mechanize :: Link对象，该对象不保持其祖先（如DOM树中的位置）。

机械化旨在成为浏览器，而不是刮板。为您要做的工作选择正确的工具很重要。

作为Dada suggested in their answer，您可以使用自己的解析器进行搜索。您仍然可以从WWW :: Mechanize中提取HTML，然后使用他们建议的代码。使用$mech->content或$mech->content_raw来获取HTML。

对此有几种选择。尽管我个人喜欢Web::Scraper来完成此类任务，但它的界面有点怪异，并且具有学习曲线。

相反，我建议使用Mojo::UserAgent和Mojo::DOM。实际上，方便使用的ojo package单线飞机应该可以做到这一点。

perl -Mojo -E 'g("https://www.imdb.com/name/nm0000093/")->dom->find("div[id^=actor-tt] a")->map(sub {say $_->attr("href")})'

总而言之，它执行以下操作：

使用Mojo :: UserAgent获取该页面
看看DOM树
在<a>中找到所有以{em> actor-tt 开头的<div>的所有id（有关详细信息，请参见https://metacpan.org/pod/Mojo::DOM::CSS#SELECTORS）
对于每一个，打印出href属性

您可以根据需要进行任意自定义。

请注意，根据他们的Terms of Services，不允许刮刮IMDB。

mechanize perl regex