Perl：在循环foreach时是否应改写TreeBuilder函数？

问题描述

我的代码是输入演员名称，然后通过IMDB中给定演员的电影作品来输入程序，在哈希表上列出他所拍摄电影的所有电影流派以及其播放频率。但是，我有一个问题：在提示符下运行程序后，键入“ brad pitt”或“ bruce willis”之类的名称时，执行将无限期地进行。您怎么知道问题出在哪里？

另一个问题：当我键入“ nicolas bedos”（我从头开始输入的演员名称）时，它可以工作，但似乎索引仅针对@url_links列表中选择的单个电影。是否应该修改foreach循环中TreeBuilder模块的look_down函数？我告诉自己，每次迭代都会覆盖@genres列表，因此我添加了push（），但结果保持不变。

use LWP::Simple;
use PerlIO::locale;
use HTML::TreeBuilder;
use WWW::Mechanize;
binmode STDOUT,':locale';
use strict;
use warnings;


print "Enter the actor's name:";
my $acteur1 = <STDIN>;  # the user enters the name of the actor
print "We will analyze the filmography of the actor $actor1 by genre\n";

#we put the link with the given actor in Mechanize variable in order to browse the internet links
my $lien1 = "https://www.imdb.com/find?s=nm&q=$acteur1";
my $mech = WWW::Mechanize->new();
$mech->get($lien1); #we access the search page with the get function
$mech->follow_link( url_regex => qr/nm0/i ); #we access the first result using the follow_link function and the regular expression nm0 which is in the URL
my @url_links= $mech->find_all_links( url_regex => qr/title\/tt/i ); #owe insert in an array all the links having as regular expression "title" in their URL 
my $nb_links = @url_links; #we record the number of links in the list in this variable

my $tree = HTML::TreeBuilder->new(); #we create the TreeBuilder module to access a specific text on the page via the tags
my %index; #we create a hashing table

my @genres = (); #we create the genre list to insert all the genres encountered
foreach (@url_links) { #we make a loop to browse all the saved links
    my $mech2 = WWW::Mechanize->new();
    my $html = $_->url(); #we take the url of the link
    if ($html =~ m=^/title=) { #if the url starts with "/title"
        $mech2 ->get("https://www.imdb.com$html"); #we complete the link
        my $content = $mech2->content; #we take the content of the page
        $tree->parse($content); #we access the url and we use the tree to find the strings that interest us
        @genres = $tree->look_down ('class','see-more inline canwrap',#We have as criterion to access the class = "see-more .."
        sub {
                my $link = $_[0]->look_down('_tag','a'); #new conditions: <a> tags
                $link->attr('href') =~ m{genres=}; #autres conditions: "genres" must be in the URL
        }
        );      
    }       
}   

my @genres1 = (); #we create a new list to insert the words found (the genres of films)
foreach my $e (@genres){ #we create a loop to browse the list
    my $genre = $e->as_text;  #the text of the list element is inserted into the variable
    @genres1 = split(/[à| ]/,$genre); #we remove the unnecessary characters that are spaces,at and | which allow to keep that the terms of genre cine
}

foreach my $e (@genres1){ #another loop to filter listing errors (Genres: etc ..) and add the correct words to the hash table
    if ($e ne ("Genres:" or "") ) {
        $index{$e}++;
    }
}

$tree->delete; #we delete the tree as we no longer need it

foreach my $cle (sort{$index{$b} <=> $index{$a}} keys %index){
    print "$cle : $index{$cle}\n"; #we display the hash table with the genres and the number of times that appear in the filmography of the given actor
}

预先感谢您的帮助，机器人

解决方法

IMDB Conditions of Use这样说：

机器人和屏幕抓取：除非获得我们以下明确的书面同意，否则您不得在本网站上使用数据挖掘，机器人，屏幕抓取或类似的数据收集和提取工具。>

所以您可能想重新考虑您在做什么。也许您可以改用OMDB API。