javascript 获取完整的网页源html与puppeteer -但有些部分总是失踪

rhfm7lfc 于 2023-03-28 发布在 Java

关注(0)|答案(2)|浏览(125)

我正在尝试抓取下面网页上的特定字符串：
https://www.booking.com/hotel/nl/scandic-sanadome-nijmegen.en-gb.html?checkin=2020-09-19;checkout=2020-09-20;i_am_from=nl;
我想从这个网页的信息来源是在字符串下面的序列号（这是我可以搜索时，右键单击鼠标-〉

"View Page source"): 
 name="nr_rooms_4377601_232287150_0_1_0"/ name="nr_rooms_4377601_232287150_1_1_0"

我使用的是“puppeteer”，下面是我的代码：

const puppeteer = require('puppeteer');
(async() => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    //await page.goto('https://example.com');
    const response = await page.goto("My-url-above");
    let bodyHTML = await page.evaluate(() => document.body.innerHTML);
    let outbodyHTML = await page.evaluate(() => document.body.outerHTML);
    console.log(await response.text());
    console.log(await page.content());
    await browser.close();
})()

但是我在response.text()或page.content()中找不到要查找的字符串。
我在页面中使用了错误的方法吗？
我如何将实际的页面源转储到网页上，与我右键单击鼠标完全相同？

JavaScript

来源：https://stackoverflow.com/questions/63614065/get-complete-web-page-source-html-with-puppeteer-but-some-part-always-missing

2条答案

按热度按时间

kzipqqlq1#

如果你调查这些字符串出现的位置，你可以看到在<select>元素中有一个特定的类（.hprt-nos-select）：

<select
  class="hprt-nos-select"
  name="nr_rooms_4377601_232287150_0_1_0"
  data-component="hotel/new-rooms-table/select-rooms"
  data-room-id="4377601"
  data-block-id="4377601_232287150_0_1_0"
  data-is-fflex-selected="0"
  id="hprt_nos_select_4377601_232287150_0_1_0"
  aria-describedby="room_type_id_4377601 rate_price_id_4377601_232287150_0_1_0 rate_policies_id_4377601_232287150_0_1_0"
>

你会等到这个元素被加载到DOM中，然后它也会在页面源代码中可见：

await page.waitForSelector('.hprt-nos-select', { timeout: 0 });

但你的问题实际上在于，你访问的URL有一些额外的URL参数：?checkin=2020-09-19;checkout=2020-09-20;i_am_from=nl;这些参数没有被puppeteer考虑在内（你可以截取一个完整的页面截图，你会看到它仍然有默认的酒店搜索表单，没有具体的酒店优惠，而不是你所期望的）。
您应该使用puppeteer（page.click()等）与搜索表单进行交互，以自行设置日期和来源国，以实现预期的页面内容。

赞(0）回复(0）举报 2023-03-28

zzlelutf2#

似乎booking.com正在阻止你。强烈建议你使用带有puppeteer-extra和puppeteer-extra-plugin-stealth软件包的Puppeteer，这样可以防止网站检测到你使用的是无头Chromium或者是网页驱动。
在你进入URL后，你需要等待页面加载：

const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");

const { executablePath } = require("puppeteer");

puppeteer.use(StealthPlugin());

(async () => {
  const browser = await puppeteer.launch({
    headless: true,
    args: ["--no-sandbox", "--disable-setuid-sandbox", "--window-size=1600,900", "--single-process"],
    executablePath: executablePath(),
  });

  const page = await browser.newPage();
  await page.setViewport({
    width: 1280,
    height: 720,
  });
  const url = "https://www.booking.com/hotel/nl/scandic-sanadome-nijmegen.en-gb.html?checkin=2020-09-19;checkout=2020-09-20;i_am_from=nl";
  await page.goto(url);
  // wait for load selector with id=hp_hotel_name
  await page.waitForSelector("#hp_hotel_name");

  // now you can do what you want

  await browser.close();
})();

作为替代方案，要获取有关酒店的所有信息，您可以使用hotels-scraper-js库。然后您的代码将是：

import { booking } from "hotels-scraper-js";

booking.getHotelInfo("https://www.booking.com/hotel/nl/scandic-sanadome-nijmegen.en-gb.html").then((result) => console.dir(result, { depth: null }));

输出如下所示：

{
   "title":"Sanadome Nijmegen",
   "type":"Hotel",
   "stars":4,
   "preferredBadge":true,
   "subwayAccess":false,
   "sustainability":"",
   "address":"Weg door Jonkerbos 90, 6532 SZ Nijmegen, Netherlands",
   "highlights":[

   ],
   "description":"You're eligible for a Genius discount at Sanadome Nijmegen!"... and more description,
   "descriptionHighlight":"Couples particularly like the location — they rated it 8.3 for a two-person trip.",
   "descriptionSummary":"Sanadome Nijmegen has been welcoming Booking.com guests since 10 Jun 2010.",
   "facilities":["Indoor swimming pool", "Parking on site",... and more facilities],
   "areaInfo":[
      {
         "What's nearby":[
            {
               "place":"Goffertpark",
               "distance":"650 m"
            },
            ... and more nearby places
         ]
      },
      ... and other area info
   ],
   "link":"https://www.booking.com/hotel/nl/scandic-sanadome-nijmegen.en-gb.html",
   "photos":[
      "https://cf.bstatic.com/xdata/images/hotel/max1024x768/196181914.jpg?k=e37d21c8a403e920b868bcd7845dbca656d772bc114dc10473a76de52afc67bc&o=&hp=1",
      "https://cf.bstatic.com/xdata/images/hotel/max1024x768/225703925.jpg?k=0d4938ca6752057ba607d2fd7fb8cf95cec000770a68738b92ef3b6688e8a62e&o=&hp=1",
      ... and other photos
   ],
   "reviewsInfo":{
      "score":7.8,
      "scoreDescription":"Rated good",
      "totalReviews":823,
      "categoriesRating":[
         {
            "Staff":8.5
         },
         ... and other categories
      ],
      "reviews":[
         {
            "name":"Ewelina",
            "avatar":"https://cf.bstatic.com/static/img/review/avatars/ava-e/8d80ab6bf73fa873e990c76bfc96a1bf23708307.png",
            "country":"Poland",
            "date":"16 February 2023",
            "reting":"10",
            "review":[
               {
                  "liked":"very beautiful surroundings.  I love the peace and quiet around 🥰"
               }
            ]
         },
         ... and other reviews
      ]
   }
}

赞(0）回复(0）举报 2023-03-28

我来回答

javascript 获取完整的网页源html与puppeteer -但有些部分总是失踪

2条答案

相关问题

热门标签

最新问答