Perl脚本，网页搜罗

siotufzp 于 2023-03-03 发布在 Perl

关注(0)|答案(2)|浏览(175)

我有这个脚本，刮亚马逊网站的评论。每次我运行它，我得到一个关于编译错误的错误。我想知道是否有人可以透露一些光，因为它有什么问题。

#!/usr/bin/perl
# get_reviews.pl
#
# A script to scrape Amazon, retrieve reviews, and write to a file
# Usage: perl get_reviews.pl <asin>
use strict;
use warnings;
use LWP::Simple;

# Take the asin from the command-line
my $asin = shift @ARGV or die "Usage: perl get_reviews.pl <asin>\n";

# Assemble the URL from the passed asin.
my $url = "http://amazon.com/o/tg/detail/-/$asin/?vi=customer-reviews";
  
# Set up unescape-HTML rules. Quicker than URI::Escape.
my %unescape = ('&quot;'=>'"', '&amp;'=>'&', '&nbsp;'=>' ');
my $unescape_re = join '|' => keys %unescape;

# Request the URL.
my $content = get($url);
die "Could not retrieve $url" unless $content;

#Remove everything before the reviews
$content =~ s!.*?Number of Reviews:!!ms;

# Loop through the HTML looking for matches
while ($content =~ m!<img.*?stars-(\d)-0.gif.*?>.*?<b>(.*?)</b>, (.*?)[RETURN]
    \n.*?Reviewer:\n<b>\n(.*?)</b>.*?</table>\n(.*?)<br>\n<br>!mgis) {

  my($rating,$title,$date,$reviewer,$review) = [RETURN] 
  ($1||'',$2||'',$3||'',$4||'',$5||'');
  $reviewer =~ s!<.+?>!!g;   # drop all HTML tags
  $reviewer =~ s!\(.+?\)!!g;   # remove anything in parenthesis
  $reviewer =~ s!\n!!g;      # remove newlines
  $review =~ s!<.+?>!!g;     # drop all HTML tags
  $review =~ s/($unescape_re)/$unescape{$1}/migs; # unescape.

  # Print the results
  print "$title\n" . "$date\n" . "by $reviewer\n" . "$rating stars.\n\n" . "$review\n\n";
}

perl

来源：https://stackoverflow.com/questions/19568975/perl-script-web-scraper