curl 如何从网页获取外部域/源的列表

wko9yo5t  于 2022-11-13  发布在  其他
关注(0)|答案(1)|浏览(140)

我正在寻找一种方法来获得所有外部域名用于网站。
例如:堆栈溢出.com
googletagservices.com, google-analytics.com, fbcdn.net, i.stack.imgur.com, cdn.sstatic.net.
有没有办法在bash或php中得到这个域名列表?我的google fu失败了。
基本上是这个列表:

使用www.example.com的另一示例webpagetest.org

cig3rfwq

cig3rfwq1#

<?php
// Download The Remote WebPage
$websiteURL= "https://www.google.com";
$curl = curl_init($websiteURL);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$webPageContent = curl_exec($curl);
print("Download size: Of Main Page " . curl_getinfo($curl, CURLINFO_SIZE_DOWNLOAD) .''); //get the download size of page

// Match And Extract src and href Tags URLs
preg_match_all('/(?:src=)"([^"]*)"/m', $webPageContent, $matchessrc); // Get All src URLs
preg_match_all('/link.*\s*(?:href=)"([^"]*)"/m', $webPageContent, $matcheslink); // Get All link->href URLs

$matches = array_merge($matchessrc[1], $matcheslink[1]);
$domain = parse_url($websiteURL, PHP_URL_SCHEME). '://'.parse_url($websiteURL, PHP_URL_HOST);
$path = parse_url($websiteURL, PHP_URL_PATH);
$checked = array();
print_r($matches); // Print All Resources URLs
foreach($matches as $m)
{
    if($m[0] == '/')  // Convert / Pathe URL To Main Domain
        $m = $domain.$m;
    elseif(substr($m, 0, 5) != 'http:' and substr($m, 0, 6) != 'https:')
        $m = $domain.'/'.$path.'/'.$m;
    if(in_array($m, $checked)) // Remove Duplicate Resources URLS
        continue;
    $checked[] = $m; 
}
?>

相关问题