curl 如何从网页获取外部域/源的列表

wko9yo5t 于 2022-11-13 发布在其他

关注(0)|答案(1)|浏览(140)

我正在寻找一种方法来获得所有外部域名用于网站。
例如：堆栈溢出.com
googletagservices.com, google-analytics.com, fbcdn.net, i.stack.imgur.com, cdn.sstatic.net.
有没有办法在bash或php中得到这个域名列表？我的google fu失败了。
基本上是这个列表：

使用www.example.com的另一示例webpagetest.org

curl

来源：https://stackoverflow.com/questions/53636919/how-do-i-get-a-list-of-external-domains-sources-from-a-webpage

1条答案

按热度按时间

cig3rfwq1#

<?php
// Download The Remote WebPage
$websiteURL= "https://www.google.com";
$curl = curl_init($websiteURL);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$webPageContent = curl_exec($curl);
print("Download size: Of Main Page " . curl_getinfo($curl, CURLINFO_SIZE_DOWNLOAD) .''); //get the download size of page

// Match And Extract src and href Tags URLs
preg_match_all('/(?:src=)"([^"]*)"/m', $webPageContent, $matchessrc); // Get All src URLs
preg_match_all('/link.*\s*(?:href=)"([^"]*)"/m', $webPageContent, $matcheslink); // Get All link->href URLs

$matches = array_merge($matchessrc[1], $matcheslink[1]);
$domain = parse_url($websiteURL, PHP_URL_SCHEME). '://'.parse_url($websiteURL, PHP_URL_HOST);
$path = parse_url($websiteURL, PHP_URL_PATH);
$checked = array();
print_r($matches); // Print All Resources URLs
foreach($matches as $m)
{
    if($m[0] == '/')  // Convert / Pathe URL To Main Domain
        $m = $domain.$m;
    elseif(substr($m, 0, 5) != 'http:' and substr($m, 0, 6) != 'https:')
        $m = $domain.'/'.$path.'/'.$m;
    if(in_array($m, $checked)) // Remove Duplicate Resources URLS
        continue;
    $checked[] = $m; 
}
?>

赞(0）回复(0）举报 2022-11-13

我来回答

curl 如何从网页获取外部域/源的列表

1条答案

相关问题

热门标签

最新问答