html 使用preg_replace更新src值

xdnvmnnf 于 2023-01-28 发布在其他

关注(0)|答案(1)|浏览(121)

我有一些<img>标签如下：

<img alt="" src="{assets_8170:{filedir_14}test.png}" style="width: 700px; height: 181px;" />
<img src="{filedir_14}test.png" alt="" />

我需要更新src值，提取文件名并将其添加到WordPress短代码中：

<img src="[my-shortcode file='test.png']" ... />

用于提取文件名的正则表达式如下所示：[a-zA-Z_0-9-()]+\.[a-zA-Z]{2,4}，但我无法创建完整的正则表达式，因为图像标记属性在所有示例中的顺序不同。

Html

来源：https://stackoverflow.com/questions/75222526/update-src-value-using-preg-replace

1条答案

按热度按时间

xwmevbvl1#

PHP -解析html内容，进行转换并返回生成的html

在其生命周期中，答案变得越来越大，试图解决这个问题。
我们做了几次尝试，但最新的尝试（loadXML/saveXML）成功了。

DOMDocument -加载HTML并保存HTML

如果你需要在php中解析一个html字符串，以便以后可以在不破坏编码的情况下以一种结构化和安全的方式获取和修改它的内容，你可以使用DOMDocument::loadHTML()：
https://www.php.net/manual/en/domdocument.loadhtml.php
在这里，我将展示如何解析html字符串，获取它的所有<img>元素，以及如何为每个元素检索其src属性并将其设置为任意值。
最后，返回转换后文档的html字符串，可以使用DOMDocument::saveHTML：
https://www.php.net/manual/en/domdocument.savehtml
考虑到默认情况下文档包含基本的html框架来 Package 你的原始内容，所以为了确保生成的html只限于这部分，我在这里展示了如何获取body内容并循环遍历它的子元素以返回最终的合成：
https://onlinephp.io/c/157de

<?php

$html = "
<img alt=\"\" src=\"{assets_8170:{filedir_14}test.png}\" style=\"width: 700px; height: 181px;\" />
<img src=\"{filedir_14}test.png\" alt=\"\" />
";

$transformed = processImages($html);

echo $transformed;

function processImages($html){

    //parse the html fragment
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    
    //fetch the <img> elements
    $images = $dom->getElementsByTagName('img');
    
    //for each <img>
    foreach ($images as $img) {
        //get the src attribute
        $src = $img->getAttribute('src');
        //set the src attribute
        $img->setAttribute('src', 'bogus');
    }
    
    //return the html modified so far (body content only)
    $body = $dom->getElementsByTagName('body')->item(0);
    $bodyChildren = $body->childNodes;
    $bodyContent = '';
    foreach ($bodyChildren as $child) {
        $bodyContent .= $dom->saveHTML($child);
    }
    return $bodyContent;
}

src属性值限制问题

在阅读了评论后，你指出saveHTML返回的是一个html，其中图像src属性值的特殊字符转义了，我做了一些更多的研究...
发生这种情况的原因是DOMDocument希望确保src属性包含有效的url，而{、}不是有效字符。

自定义数据属性不会发生这种情况的证据

例如，如果我添加了一个像data-test="mycustomcontent: {wildlyusingwhatever}"这样的属性，它将原封不动地返回，因为它不需要遵守严格的规则。

快速修复以使其工作（从整体上击败解析器）

现在要解决的问题是，到目前为止，我所能得出的结论是：
https://onlinephp.io/c/0e334

//VERY UNSAFE -- replace the in $bodyContent %7B as { and %7D as }
$bodyContent = str_replace("%7B", "{", $bodyContent);
$bodyContent = str_replace("%7D", "}", $bodyContent);
return $bodyContent;

当然，它既不安全也不聪明，也不是一个很好的解决方案，首先是因为它破坏了使用解析器而不是正则表达式的全部目的，其次是因为它可能严重破坏结果。

使用loadXML和saveXML的更好方法

为了防止html规则生效，可以尝试将文本解析为XML而不是HTML，这样它仍然遵循嵌套的markdown语法（使用regex很难/不可能处理），但不会应用关于内容的所有限制。
我修改了核心逻辑，如下所示：

//loads the html content as xml wrapping it with a root element
$dom->loadXml("<root>${html}</root>");

//...

//returns the xml content of each children in <root> as processed so far
$rootNode = $dom->childNodes[0];
$children = $rootNode->childNodes;
$content = '';
foreach ($children as $child) {
   $content .= $dom->saveXML($child);
}
    
return $content;

这是一个工作演示：https://onlinephp.io/c/f9de1

赞(0）回复(0）举报 2023-01-28

我来回答