regex 如何使用正则表达式从HTML页面中提取h1标题？

oxcyiej7 于 2023-02-05 发布在其他

关注(0)|答案(3)|浏览(145)

我还在努力学习正则表达式，我正在考虑一个简单的查询，我正在尝试解析我的网站主页并提取H1标记。

<?php
    $string_get = file_get_contents("http://davidelks.com/");
    
    
    $replace = "$1";
    
    $matches = preg_replace ("/<h1 class=\"title\"><a href=\"([A-Z]|[0-9]|[a-z]|[\s]|[\x21]|[\/]|[\-]|[\.]|[\£]|[\:])*\">([A-Z]|[0-9]|[a-z]|[\s]|[\x21]|[\/]|[\-]|[\.]|[\£]|[\:])*<\/a><\/h1>/", $replace, $string_get, 1);
    
    $string_construct = "Mum " . $matches .  " Dad";
    
    echo ($string_construct);
    
    ?>

但是，它不是只显示使用$1标记的第一个HTML链接，而是拉入整个页面。

regex

来源：https://stackoverflow.com/questions/5010791/how-to-extract-h1-headings-from-an-html-page-using-regular-expressions

3条答案

按热度按时间

dly7yett1#

这看起来像是可以用DOM parser轻松完成的任务：

libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->load('http://davidelks.com/');
$h1 = $dom->getElementsByTagName('h1')->item(0);
echo $h1->textContent;

您应该获得：

Let's make things happen in and around Stoke-on-Trent

- 注意：**我不确定这是您的网站还是您管理的网站，但HTML页面中的<h1>标记不应超过一个（主页上有两个）。

赞(0）回复(0）举报 2023-02-05

ycggw6v22#

错误在于您对preg_replace的使用。您想要提取某个东西，而preg_match将用于该东西：

<?php
 $text = file_get_contents("http://davidelks.com/");

 preg_match('#<h1 class="title"><a href="([\w\s\x21\/\-\.\£\:]*)">([^<>]*)</a></h1>#', $text, $match);

 echo "Mum " . $match[1] .  " Dad";
?>

特别注意你可以合并字符类，你不需要[A-Z]|[a-z]|[..]，因为你可以把它组合成一个[A-Za-z...]方括号列表。
如果你想搜索双引号，也可以尝试用单引号来表示PHP字符串，这样可以节省很多额外的转义，就像用#代替/来表示正则表达式一样。

赞(0）回复(0）举报 2023-02-05

dldeef673#

使用DOM解析器会更容易一些，但是如果你想用regex来做，你应该使用php中的preg_match_all函数：

preg_match_all("/<h1 class=\"title\"><a href=\"([A-Z]|[0-9]|[a-z]|[\s]|[\x21]|[\/]|[\-]|[\.]|[\£]|[\:])*\">([A-Z]|[0-9]|[a-z]|[\s]|[\x21]|[\/]|[\-]|[\.]|[\£]|[\:])*<\/a><\/h1>/",$string_get,$matches);
var_dump($matches);

赞(0）回复(0）举报 2023-02-05

我来回答

regex 如何使用正则表达式从HTML页面中提取h1标题？

3条答案

相关问题

热门标签

最新问答