PHP curl HEAD请求给出404,但浏览器(或默认curl GET请求)加载页面正常

w8f9ii69  于 2023-10-19  发布在  PHP
关注(0)|答案(1)|浏览(196)

我有一个表单,其中有几个URL的字段。我写了一个Zend Framework验证器,它执行一个简单的preg_match来筛选出荒谬的字符串,然后执行一个curl HEAD请求(CURLOPT_NOBODY)来筛选出404和其他连接问题。在测试中,我遇到了神秘的返回代码0与“未知的SSL协议错误”,所以我添加了一个检查,以接受任何有效的消息与“SSL”在它,因为这将表明该URL到达了一个Web服务器。
但是,我们的客户在实践中可能会使用的一个特定URL重定向到PDF文件的s3.amazonaws.com URL。在浏览器中,原始URL和它重定向到的s3 URL都可以很好地显示PDF。因为我使用了CURLOPT_FOLLOWLOCATION,我希望我的验证器会接受它。但结果却是404.然后我尝试直接指定s3 URL,结果出现了403(!)。我认为403可能是因为我指定了一个HTTP_X_REQUESTED_WITH的头而触发的:XMLHttpRequest ',我注解掉了代码中的那一行。但它仍然给了一个403。
怎么会这样?在我看来,亚马逊S3将不得不显式地寻找HEAD请求,并根据它是否通过重定向来故意发出404或403???
我想我可以删除CURLOPT_NOBODY,让它发送GET请求,但这似乎很愚蠢,因为我不关心主体。

以下是我的完整代码:

<?php

class Oshk_ZendX_Validate_Url {
    static $debug = true;
    // Based on https://stackoverflow.com/a/42619410/467590
    const PATTERN = '/^(https?:\/\/)?[^" ]+(\.[^" ]+)*$/';

    public static function isValid($value) {
        $STDERR = fopen("php://stderr", "w");
        $value = (string) $value;
        $matches = array();
        if (! preg_match(self::PATTERN, $value, $matches)) {
            fwrite($STDERR, sprintf("File '%s', line %d, value '%s' does not match pattern '%s'\n", __FILE__, __LINE__, $value, self::PATTERN));
            fclose($STDERR);
            return false;
        }
        if (! array_key_exists(1, $matches)) {
            $value = "https://$value";
        }
        if (self::$debug) {
            fwrite($STDERR, sprintf("File '%s', line %d, \$value = '%s', \$matches = %s", __FILE__, __LINE__, $value, print_r($matches, true)));
        }
        // URL looks well-formed. Ask curl to send a HEAD request to it
        $ch = curl_init($value);
        if ($ch === false) {
            throw new Exception("curl_init($value) failed!");
        }
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_HEADER, 0); // From https://www.php.net/manual/en/curl.examples-basic.php
        curl_setopt($ch, CURLOPT_HTTPHEADER, array('HTTP_X_REQUESTED_WITH: XMLHttpRequest'));
        curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36');
        curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
        curl_setopt($ch, CURLOPT_NOBODY, true);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($ch, CURLOPT_FAILONERROR, true);
        if (self::$debug) {
            curl_setopt($ch, CURLOPT_VERBOSE, true);
            curl_setopt($ch, CURLOPT_STDERR, $STDERR);
        }
        $data = curl_exec($ch);
        $msg = curl_error($ch);
        $status = curl_getinfo($ch, CURLINFO_HTTP_CODE);
        if (self::$debug) {
            // https://stackoverflow.com/a/14436877/467590
            $allinfo = curl_getinfo($ch);
            fwrite($STDERR, sprintf("File '%s', line %d, \$allinfo = %s\n", __FILE__, __LINE__, print_r($allinfo, true)));
        }
        curl_close($ch);
        if (self::$debug) {
            fwrite($STDERR,  sprintf("File '%s', line %d, data = '%s'\n", __FILE__, __LINE__, substr($data, 0, 255)));
        }
        if(! strlen($data) && $status != 0 && false === strpos($msg, 'SSL')) {
            fwrite($STDERR, sprintf("File '%s', line %d, '%s' gives bad status code %d when accessed, with message '%s'\n", __FILE__, __LINE__, $value, $status, $msg));
            fclose($STDERR);
            return false;
        }
        if (self::$debug) {
            fwrite($STDERR, sprintf("File '%s', line %d, url = '%s'\n", __FILE__, __LINE__, $value));
            fwrite($STDERR, sprintf("File '%s', line %d, data = '%s'\n", __FILE__, __LINE__, substr($data, 0, 255)));
        }
        unset($data);
        if (self::$debug) {
            fwrite($STDERR, sprintf("File '%s', line %d, \$msg = '%s'\n", __FILE__, __LINE__, $msg));
            fwrite($STDERR, sprintf("File '%s', line %d, \$status = '%s'\n", __FILE__, __LINE__, $status));
            fwrite($STDERR, sprintf("File '%s', line %d, \$value = '%s'\n", __FILE__, __LINE__, $value));
        }
        if (($status >= 100 & $status < 400) || false !== strpos($msg, 'SSL')) {
            fclose($STDERR);
            return true;
        }
        fwrite($STDERR, sprintf("File '%s', line %d, '%s' gives bad status code %d when accessed, with message '%s'\n", __FILE__, __LINE__, $value, $status, $msg));
        fclose($STDERR);
        return false;
    }
}

echo var_dump(Oshk_ZendX_Validate_Url::isValid($argv[1]));

下面是使用原始URL运行的bash shell会话:

$ php curltest.php 'https://americandrivingsociety.org/docs.ashx?id=1037680'
File 'C:\xampp1826\htdocs\OSH0\curltest.php', line 21, $value = 'https://americandrivingsociety.org/docs.ashx?id=1037680', $matches = Array
(
        [0] => https://americandrivingsociety.org/docs.ashx?id=1037680
        [1] => https://
)
*   Trying 208.66.171.71:443...
* Connected to americandrivingsociety.org (208.66.171.71) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: \xampp7412\apache\bin\curl-ca-bundle.crt
    CApath: none
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server did not agree to a protocol
* Server certificate:
*  subject: CN=americandrivingsociety.org
*  start date: Sep  2 00:00:00 2022 GMT
*  expire date: Oct  3 23:59:59 2023 GMT
*  subjectAltName: host "americandrivingsociety.org" matched cert's "americandrivingsociety.org"
*  issuer: C=GB; ST=Greater Manchester; L=Salford; O=Sectigo Limited; CN=Sectigo RSA Domain Validation Secure Server CA
*  SSL certificate verify ok.
> HEAD /docs.ashx?id=1037680 HTTP/1.1
Host: americandrivingsociety.org
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36
Accept: */*
HTTP_X_REQUESTED_WITH: XMLHttpRequest

* old SSL session ID is stale, removing
* Mark bundle as not supporting multiuse
* The requested URL returned error: 404 Not Found
* Closing connection 0
File 'C:\xampp1826\htdocs\OSH0\curltest.php', line 46, $allinfo = Array
(
        [url] => https://americandrivingsociety.org/docs.ashx?id=1037680
        [content_type] =>
        [http_code] => 404
        [header_size] => 0
        [request_size] => 250
        [filetime] => -1
        [ssl_verify_result] => 0
        [redirect_count] => 0
        [total_time] => 0.132769
        [namelookup_time] => 0.009406
        [connect_time] => 0.035694
        [pretransfer_time] => 0.090879
        [size_upload] => 0
        [size_download] => 0
        [speed_download] => 0
        [speed_upload] => 0
        [download_content_length] => -1
        [upload_content_length] => -1
        [starttransfer_time] => 0.132714
        [redirect_time] => 0
        [redirect_url] =>
        [primary_ip] => 208.66.171.71
        [certinfo] => Array
                (
                )

        [primary_port] => 443
        [local_ip] => 16.1.1.151
        [local_port] => 55977
        [http_version] => 2
        [protocol] => 2
        [ssl_verifyresult] => 0
        [scheme] => HTTPS
        [appconnect_time_us] => 90757
        [connect_time_us] => 35694
        [namelookup_time_us] => 9406
        [pretransfer_time_us] => 90879
        [redirect_time_us] => 0
        [starttransfer_time_us] => 132714
        [total_time_us] => 132769
)

File 'C:\xampp1826\htdocs\OSH0\curltest.php', line 50, data = ''
File 'C:\xampp1826\htdocs\OSH0\curltest.php', line 53, 'https://americandrivingsociety.org/docs.ashx?id=1037680' gives bad status code 404 when accessed, with message 'The requested URL returned error: 404 Not Found'
C:\xampp1826\htdocs\OSH0\curltest.php:77:
bool(false)

repete@DESKTOP-CLQS7C1 /cygdrive/c/xampp1826/htdocs/OSH0
$

下面是使用s3 URL重定向到相同的事情:

$ php curltest.php 'https://s3.amazonaws.com/ClubExpressClubFiles/548049/documents/Omnibus_02-01-2023_Black_Prong_Driving_Derby_4_581817244.pdf?AWSAccessKeyId=AKIA6MYUE6DNNNCCDT4J&Expires=1683645984&response-content-disposition=inline%3B%20filename%3DOmnibus_02-01-2023_Black_Prong_Driving_Derby_4.pdf&Signature=YQGemVm9Gphf2EZ%2F4K%2FIyK%2Bmm7I%3D'
File 'C:\xampp1826\htdocs\OSH0\curltest.php', line 21, $value = 'https://s3.amazonaws.com/ClubExpressClubFiles/548049/documents/Omnibus_02-01-2023_Black_Prong_Driving_Derby_4_581817244.pdf?AWSAccessKeyId=AKIA6MYUE6DNNNCCDT4J&Expires=1683645984&response-content-disposition=inline%3B%20filename%3DOmnibus_02-01-2023_Black_Prong_Driving_Derby_4.pdf&Signature=YQGemVm9Gphf2EZ%2F4K%2FIyK%2Bmm7I%3D', $matches = Array
(
        [0] => https://s3.amazonaws.com/ClubExpressClubFiles/548049/documents/Omnibus_02-01-2023_Black_Prong_Driving_Derby_4_581817244.pdf?AWSAccessKeyId=AKIA6MYUE6DNNNCCDT4J&Expires=1683645984&response-content-disposition=inline%3B%20filename%3DOmnibus_02-01-2023_Black_Prong_Driving_Derby_4.pdf&Signature=YQGemVm9Gphf2EZ%2F4K%2FIyK%2Bmm7I%3D
        [1] => https://
)
*   Trying 52.216.56.0:443...
* Connected to s3.amazonaws.com (52.216.56.0) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: \xampp7412\apache\bin\curl-ca-bundle.crt
    CApath: none
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use http/1.1
* Server certificate:
*  subject: CN=s3.amazonaws.com
*  start date: Apr 11 00:00:00 2023 GMT
*  expire date: Dec 20 23:59:59 2023 GMT
*  subjectAltName: host "s3.amazonaws.com" matched cert's "s3.amazonaws.com"
*  issuer: C=US; O=Amazon; CN=Amazon RSA 2048 M01
*  SSL certificate verify ok.
> HEAD /ClubExpressClubFiles/548049/documents/Omnibus_02-01-2023_Black_Prong_Driving_Derby_4_581817244.pdf?AWSAccessKeyId=AKIA6MYUE6DNNNCCDT4J&Expires=1683645984&response-content-disposition=inline%3B%20filename%3DOmnibus_02-01-2023_Black_Prong_Driving_Derby_4.pdf&Signature=YQGemVm9Gphf2EZ%2F4K%2FIyK%2Bmm7I%3D HTTP/1.1
Host: s3.amazonaws.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36
Accept: */*
HTTP_X_REQUESTED_WITH: XMLHttpRequest

* Mark bundle as not supporting multiuse
* The requested URL returned error: 403 Forbidden
* Closing connection 0
File 'C:\xampp1826\htdocs\OSH0\curltest.php', line 46, $allinfo = Array
(
        [url] => https://s3.amazonaws.com/ClubExpressClubFiles/548049/documents/Omnibus_02-01-2023_Black_Prong_Driving_Derby_4_581817244.pdf?AWSAccessKeyId=AKIA6MYUE6DNNNCCDT4J&Expires=1683645984&response-content-disposition=inline%3B%20filename%3DOmnibus_02-01-2023_Black_Prong_Driving_Derby_4.pdf&Signature=YQGemVm9Gphf2EZ%2F4K%2FIyK%2Bmm7I%3D
        [content_type] =>
        [http_code] => 403
        [header_size] => 0
        [request_size] => 523
        [filetime] => -1
        [ssl_verify_result] => 0
        [redirect_count] => 0
        [total_time] => 0.128771
        [namelookup_time] => 0.027331
        [connect_time] => 0.043198
        [pretransfer_time] => 0.107906
        [size_upload] => 0
        [size_download] => 0
        [speed_download] => 0
        [speed_upload] => 0
        [download_content_length] => -1
        [upload_content_length] => -1
        [starttransfer_time] => 0.128721
        [redirect_time] => 0
        [redirect_url] =>
        [primary_ip] => 52.216.56.0
        [certinfo] => Array
                (
                )

        [primary_port] => 443
        [local_ip] => 16.1.1.151
        [local_port] => 56277
        [http_version] => 2
        [protocol] => 2
        [ssl_verifyresult] => 0
        [scheme] => HTTPS
        [appconnect_time_us] => 107740
        [connect_time_us] => 43198
        [namelookup_time_us] => 27331
        [pretransfer_time_us] => 107906
        [redirect_time_us] => 0
        [starttransfer_time_us] => 128721
        [total_time_us] => 128771
)

File 'C:\xampp1826\htdocs\OSH0\curltest.php', line 50, data = ''
File 'C:\xampp1826\htdocs\OSH0\curltest.php', line 53, 'https://s3.amazonaws.com/ClubExpressClubFiles/548049/documents/Omnibus_02-01-2023_Black_Prong_Driving_Derby_4_581817244.pdf?AWSAccessKeyId=AKIA6MYUE6DNNNCCDT4J&Expires=1683645984&response-content-disposition=inline%3B%20filename%3DOmnibus_02-01-2023_Black_Prong_Driving_Derby_4.pdf&Signature=YQGemVm9Gphf2EZ%2F4K%2FIyK%2Bmm7I%3D' gives bad status code 403 when accessed, with message 'The requested URL returned error: 403 Forbidden'
C:\xampp1826\htdocs\OSH0\curltest.php:77:
bool(false)

repete@DESKTOP-CLQS7C1 /cygdrive/c/xampp1826/htdocs/OSH0
$
oaxa6hgo

oaxa6hgo1#

我添加了一个检查,以接受任何有效的消息,其中包含“SSL”
这似乎很危险。如果错误消息是“无效的SSL证书”怎么办?
因为这意味着URL到达了Web服务器
这对任何回答都是真的-- 300,400,500,随便什么。如果您的连接没有超时,则表明您已成功连接到某个对象,无论状态代码如何。也就是说,按照这种逻辑,如果你验证的是“到达Web服务器”,那么只有超时应该失败。
我想我可以删除CURLOPT_NOBODY,让它发送GET请求,但这似乎很愚蠢,因为我不关心主体。
你不能期望每个URL都能通过HEAD请求成功到达,或者HEAD请求的结果总是与GET请求的结果相同。
curl_setopt($ch,CURLOPT_SSL_VERIFYPEER,false);
别这样如果验证失败,你希望请求失败,这就是SSL的全部意义。
总的来说,如果您不打算验证页面的实际内容,那么我认为即使提出请求也没有任何意义。只需验证URL的语法即可。否则,你会失败的东西,如短暂的网络错误,维护停机时间,广告拦截器,基于IP的过滤,等等。你已经有了大量的代码,而这些代码应该只有一行:

class Oshk_ZendX_Validate_Url {
    public static function isValid(string $url): bool
    {
        return (bool) filter_var($url, FILTER_VALIDATE_URL);
    }
}

如果你还想测试连接,并确保在表单提交时有一个实时服务器响应请求,那么状态并不重要,你可以通过file_get_contents()检查HTTP Package 器的非假返回值:

class Oshk_ZendX_Validate_Url {
    public static function isValid(string $url): bool
    {
        return filter_var($url, FILTER_VALIDATE_URL) &&
            file_get_contents($url) !== false;
    }
}

相关问题