问题描述
我的代码是输入演员名称,然后通过IMDB中给定演员的电影作品来输入程序,在哈希表上列出他所拍摄电影的所有电影流派以及其播放频率。但是,我有一个问题:在提示符下运行程序后,键入“ brad pitt”或“ bruce willis”之类的名称时,执行将无限期地进行。您怎么知道问题出在哪里?
另一个问题:当我键入“ nicolas bedos”(我从头开始输入的演员名称)时,它可以工作,但似乎索引仅针对@url_links列表中选择的单个电影。是否应该修改foreach循环中TreeBuilder模块的look_down函数?我告诉自己,每次迭代都会覆盖@genres列表,因此我添加了push(),但结果保持不变。
use LWP::Simple;
use PerlIO::locale;
use HTML::TreeBuilder;
use WWW::Mechanize;
binmode STDOUT,':locale';
use strict;
use warnings;
print "Enter the actor's name:";
my $acteur1 = <STDIN>; # the user enters the name of the actor
print "We will analyze the filmography of the actor $actor1 by genre\n";
#we put the link with the given actor in Mechanize variable in order to browse the internet links
my $lien1 = "https://www.imdb.com/find?s=nm&q=$acteur1";
my $mech = WWW::Mechanize->new();
$mech->get($lien1); #we access the search page with the get function
$mech->follow_link( url_regex => qr/nm0/i ); #we access the first result using the follow_link function and the regular expression nm0 which is in the URL
my @url_links= $mech->find_all_links( url_regex => qr/title\/tt/i ); #owe insert in an array all the links having as regular expression "title" in their URL
my $nb_links = @url_links; #we record the number of links in the list in this variable
my $tree = HTML::TreeBuilder->new(); #we create the TreeBuilder module to access a specific text on the page via the tags
my %index; #we create a hashing table
my @genres = (); #we create the genre list to insert all the genres encountered
foreach (@url_links) { #we make a loop to browse all the saved links
my $mech2 = WWW::Mechanize->new();
my $html = $_->url(); #we take the url of the link
if ($html =~ m=^/title=) { #if the url starts with "/title"
$mech2 ->get("https://www.imdb.com$html"); #we complete the link
my $content = $mech2->content; #we take the content of the page
$tree->parse($content); #we access the url and we use the tree to find the strings that interest us
@genres = $tree->look_down ('class','see-more inline canwrap',#We have as criterion to access the class = "see-more .."
sub {
my $link = $_[0]->look_down('_tag','a'); #new conditions: <a> tags
$link->attr('href') =~ m{genres=}; #autres conditions: "genres" must be in the URL
}
);
}
}
my @genres1 = (); #we create a new list to insert the words found (the genres of films)
foreach my $e (@genres){ #we create a loop to browse the list
my $genre = $e->as_text; #the text of the list element is inserted into the variable
@genres1 = split(/[à| ]/,$genre); #we remove the unnecessary characters that are spaces,at and | which allow to keep that the terms of genre cine
}
foreach my $e (@genres1){ #another loop to filter listing errors (Genres: etc ..) and add the correct words to the hash table
if ($e ne ("Genres:" or "") ) {
$index{$e}++;
}
}
$tree->delete; #we delete the tree as we no longer need it
foreach my $cle (sort{$index{$b} <=> $index{$a}} keys %index){
print "$cle : $index{$cle}\n"; #we display the hash table with the genres and the number of times that appear in the filmography of the given actor
}
预先感谢您的帮助, 机器人
解决方法
机器人和屏幕抓取:除非获得我们以下明确的书面同意,否则您不得在本网站上使用数据挖掘,机器人,屏幕抓取或类似的数据收集和提取工具。>
所以您可能想重新考虑您在做什么。也许您可以改用OMDB API。