从网页中提取reCaptcha以通过cURL在外部完成,然后将结果返回到查看页面

368yc8dk  于 2022-11-13  发布在  其他
关注(0)|答案(2)|浏览(299)

我正在创建一个Web scraper用于个人使用,它基于我的个人输入来抓取汽车经销商网站,但是我试图从其中收集数据的几个网站被重定向的验证码页面阻止。

<html>
   <head>
      <title>You have been blocked</title>
      <style>#cmsg{animation: A 1.5s;}@keyframes A{0%{opacity:0;}99%{opacity:0;}100%{opacity:1;}}</style>
   </head>
   <body style="margin:0">
      <p id="cmsg">Please enable JS and disable any ad blocker</p>
      <script>
            var dd={'cid':'AHrlqAAAAAMA1gZrYHNP4MIAAYhtzg==','hsh':'C0705ACD75EBF650A07FF8291D3528','t':'fe','host':'geo.captcha-delivery.com'}
      </script>
      <script src="https://ct.captcha-delivery.com/c.js"></script>
   </body>
</html>

我用这个来刮书页:

<?php

function web_scrape($url)
{
    $ch = curl_init();
    $imei = "013977000272744";

    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_POST, 1);
    curl_setopt($ch, CURLOPT_VERBOSE, 1);
    curl_setopt($ch, CURLOPT_COOKIE, '_ym_uid=1460051101134309035;  _ym_isad=1; cxx=80115415b122e7c81172a0c0ca1bde40; _ym_visorc_20293771=w');
    curl_setopt($ch, CURLOPT_POSTFIELDS, array(
        'imei' => $imei,
    ));

    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

    $server_output = curl_exec($ch);
    return $server_output;

    curl_close($ch);

}
echo web_scrape($url);

?>

并重申我想做的事;我想从这个页面收集摘要信息,这样当我想查看外部网站上的页面详细信息时,我可以在外部网站上填写摘要信息,然后刮取最初估算的页面。任何响应都将是伟大的!

rryofs0p

rryofs0p1#

基于对代码的高要求,这里是我升级的刮刀,它绕过了这个特定的问题。2然而我试图获得验证码没有工作,我仍然没有解决如何获得它。

include "simple_html_dom.php";
  /**
   * Get a web file (HTML, XHTML, XML, image, etc.) from a URL.  Return an
   * array containing the HTTP server response header fields and content.
   */
  // This function is where the Magic comes from. It bypasses ever peice of security carsales.com.au can throw at me
  function get_web_page( $url ) { 
    $options = array(
        CURLOPT_RETURNTRANSFER => true,     // return web page
        CURLOPT_HEADER         => false,    // don't return headers
        CURLOPT_FOLLOWLOCATION => true,     // follow redirects
        CURLOPT_ENCODING       => "",       // handle all encodings
        CURLOPT_USERAGENT      => "spider", // who am i
        CURLOPT_AUTOREFERER    => true,     // set referer on redirect
        CURLOPT_CONNECTTIMEOUT => 120,      // timeout on connect
        CURLOPT_TIMEOUT        => 120,      // timeout on response
        CURLOPT_MAXREDIRS      => 10,       // stop after 10 redirects
        CURLOPT_SSL_VERIFYPEER => false     // Disabled SSL Cert checks
    );

    $ch      = curl_init( $url ); //initiate the Curl program that we will use to scrape data off the webpage
    curl_setopt_array( $ch, $options ); //set the data sent to the webpage to be readable by the webpage (JSON)
    $content = curl_exec( $ch ); //creates function to read pages content. This variable will be used to hold the sites html
    $err     = curl_errno( $ch ); //errno function that saves all the locations our scraper is sent to. This is just for me so that in the case of a error,
    //I can see what parts of the page has it seen and more importantly hasnt seen
    $errmsg  = curl_error( $ch ); //check error message function. for example if I am denied permission this string will be equal to: 404 access denied
    $header  = curl_getinfo( $ch ); //the information of the page stored in a array
    curl_close( $ch ); //Closes the Curler to save site memory

    $header['errno']   = $err; //sending the header data to the previously made errno, which contains a array path of all the places my scraper has been
    $header['errmsg']  = $errmsg; //sending the header data to the previously made error message checker function.
    $header['content'] = $content; //sending the header data to the previously made content checker that will be the variable holder of the webpages HTML.
    return $header; //Return all the pages data and my identifying functions in a array. To be used in the presentation of the search results.
  };

  //using the function we just made, we use the url genorated by the form to get a developer view of the scraping.
  $response_dev = get_web_page($url);
  
  // print_r($response_dev);

  $response = end($response_dev); //takes only the end of the developer response because the rest is for my eyes only in  the case that the site runs into a issue
lb3vh1jj

lb3vh1jj2#

Datadome目前正在使用Recaptcha v2和GeeTest验证码,因此您的脚本应该执行以下操作:
1.导航到重定向https://geo.captcha-delivery.com/captcha/?initialCid=…
1.检测所使用的验证码类型。
1.使用任何验证码解决服务(如反验证码)获取此验证码的令牌。
1.提交令牌,检查您是否被重定向到目标页面。
1.有时目标页面包含一个地址为https://geo.captcha-delivery.com/captcha/?initialCid=. .的iframe,因此您需要在该iframe中从步骤2开始重复。
我不确定以上步骤是否可以用PHP来完成,但你可以用浏览器自动化引擎来完成,比如Puppeteer,NodeJS的一个库。它启动一个Chromium示例,模拟一个真实的用户存在。NodeJS是你想要构建专业scraper的必备工具,值得你在Youtube上花一些时间。下面是一个脚本,它可以完成以上所有步骤:https://github.com/MoterHaker/bypass-captcha-examples/blob/main/geo.captcha-delivery.com.js您将需要一个代理来绕过GeeTest保护。

相关问题