我正在尝试使用puppeteer获取多深度的iframe内容。
多深度iframe的例子：
top.html:

<html>
  <title>top</title>
  <body>
    <p>top text</p>

    <iframe src="1.html"></iframe>

    <hr />

    <iframe src="2.html"></iframe>
  </body>
</html>

1.html:

<html>
  <title>1</title>
  <body>
    <p>1 text</p>

    <iframe src="1-1.html"></iframe>

  </body>
</html>

1-1.html:

<html>
  <title>1-1</title>
  <body>
    <p>1-1 text</p>
  </body>
</html>

2.html:

<html>
  <title>2</title>
  <body>
    <p>2 text</p>
  </body>
</html>

我的最终目标是得到一个像这样的HTML字符串：

<html>
  <title>top</title>
  <body>
    <p>top text</p>

    <iframe>

      <html>
        <title>1</title>
        <body>
          <p>1 text</p>

          <iframe>
            <html>
              <title>1-1</title>
              <body>
                <p>1-1 text</p>
              </body>
            </html>
          </iframe>

        </body>
      </html>

    </iframe>

    <hr />

    <iframe>
      <html>
        <title>2</title>
        <body>
          <p>2 text</p>
        </body>
      </html>
    </iframe>

  </body>
</html>

iframe、html和body标签的存在或位置并不是非常重要。所以以下也对我有好处：

<p>top text</p>

  <p>1 text</p>

    <p>1-1 text</p>

  <p>2 text</p>

经过大量的尝试和错误，我在单一深度上取得了一些成功：

import { launch } from 'puppeteer';

(async () => {
  const browser = await launch({
    headless: 'new',
    args: [
      '--disable-web-security',
      '--disable-features=IsolateOrigins,site-per-process'
    ]
  });
  const page = await browser.newPage();
  await page.goto('file:///C:/test/src/top.html', { waitUntil: 'networkidle0' });

  const iframes = await page.$$("iframe");
  for (const iframe of iframes) {
    const frame = await iframe.contentFrame();
    if (!frame) continue;
    const context = await frame.executionContext();
    const res = await context.evaluate(() => document.querySelector("*").outerHTML);
    if (res) {
      await iframe.evaluate((a, res) => {
        a.insertAdjacentHTML('afterend', res);
        a.remove();
      }, res);
    }
  }

  const htmlContent = await page.content();
  console.log(htmlContent);

})();

这只在一个深度起作用。
我一直没有成功地尝试递归修复这个问题。
特别是evaluate内部和外部的区别还没有完全理解。
而且我期望会有一个更容易的方法，以一种完全不同的方式比我尝试过的。
我想会有很多情况下得到一个特定的网址的所有信息没有遗漏。

<head></head><body><p>top text</p> <iframe src="1.html"><head></head><body><p>1 text</p> <iframe src="1-1.html"><head></head><body><p>1-1 text</p> </body></iframe> </body></iframe> <hr> <iframe src="2.html"><head></head><body><p>2 text</p> </body></iframe> </body>

1条答案

按热度按时间

wlp8pajw1#

在编写了数百个Puppeteer脚本之后，我逐渐意识到，编写浏览器控制台代码比处理元素句柄更容易。如果您不需要可信事件，可以将Puppeteer视为一个瘦 Package 器，它允许您以编程方式运行原生的普通控制台代码。
您的目标可以通过句柄来实现，但我只想在浏览器中实现，在浏览器中您可以直接同步处理DOM。
这是一张素描。它并不完美，但应该为您提供了一个合理的起点来调整您的用例。您可以使用outerHTML代替innerHTML，或者根据需要尝试使用其他线程中的技术添加<html>根。

输出（如果难以读取，则使用--parser html运行Prettier）：

下面是上述算法的伪代码：

对于根文档，调用执行以下步骤的递归函数：
递归遍历所有iframe子元素，将其文档的HTML收集到一个数组中
解析当前节点的HTML并使用上面递归步骤中的HTML填充每个iframe
将此文档的HTML字符串与填充的iframe内容一起传递到调用堆栈，以便其父级可以使用它来填充其iframe。

基本案例是一个没有子级的文档。它只是直接将其HTML传递给它的父对象，以开始填充它的iframe。
买者自负：在web抓取中，很少有人的目标是获取所有的HTML内容，所以如果你把它作为子步骤，你认为必须实现一个更大的目标，小心不要陷入XY problem。在99.9%的情况下，在典型的Web抓取或测试情况下不需要这样做。
此外，在网页抓取中没有银，所以我想这将由于各种令人惊讶的原因在很多网站上崩溃。

赞(0）回复(0）举报 2023-05-17

NodeJS 如何从puppeteer递归获取iframe内容

1条答案

相关问题

热门标签

最新问答