laravel Symfony/Panther的网页抓取：无法获取HTML

ryoqjall 于 2022-11-26 发布在其他

关注(0)|答案(2)|浏览(126)

我想在Laravel应用程序中使用symfony panther包来抓取一个站点。根据文档https://github.com/symfony/panther#a-polymorphic-feline，我不能使用HttpBrowser和HttpClient类，因为它们不支持JS。
因此，我尝试使用ChromClient，它使用本地chrome可执行文件和随Panther软件包提供的chromedriver二进制文件。

$client = Client::createChromeClient();
$crawler = $client->request('GET', 'http://example.com');
dd($crawler->html());

不幸的是，我只收到HTML格式的空默认chrome页面：

<html><head></head><body></body></html>

使用$client或$crawler示例执行其他操作的每种方法都会导致错误“no nodes available”。
此外，我尝试了文档www.example.com中的基本示例https://github.com/symfony/panther#basic-usage--〉，结果相同。
我在Windows上使用WSL下的ubuntu18.04服务器，并安装了google-chrome-stable deb-package。这似乎是工作的，因为在安装后，错误“二进制文件找不到”不再发生。
我还尝试手动使用Windows主机系统的可执行文件，但这只会打开一个空的CMD窗口，关闭时总是重新打开。我必须通过TaskManager杀死该进程。
这是因为Ubuntu服务器没有任何可用的x-server吗？
如何接收HTML？

laravel

来源：https://stackoverflow.com/questions/61665180/webscraping-symfony-panther-cant-get-html

2条答案

按热度按时间

64jmpszr1#

$client = Client::createChromeClient();
$crawler = $client->request('GET', 'http://example.com');

/**
* Get all Html code of page
*/

$client->getCrawler()->html();

/**
* For example to filter field by ID = AuthenticationBlock and get text
*/

$loginUsername = $client->getCrawler()->filter('#AuthenticationBlock')->text();

赞(0）回复(0）举报 2022-11-26

fgw7neuy2#

所以，我可能迟到了，但我遇到了同样的问题，一个非常简单的解决方案：只需打开一个带有响应内容的简单爬虫。
这一个与Panther DomCrawler不同，特别是在方法上，但它在评估HTML结构时更安全。

$client = Client::createChromeClient();
$client->request('GET', 'http://example.com');

$html = $client->getInternalResponse()->getContent();
$crawler = new Symfony\Component\DomCrawler\Crawler($html);

// you can use following to get the whole HTML
$crawler->outerHtml();

// or specific parts
$crawler->filter('.some-class')->outerHtml();

赞(0）回复(0）举报 2022-11-26

我来回答

laravel Symfony/Panther的网页抓取：无法获取HTML

2条答案

相关问题

热门标签

最新问答