Javascript replace()正则表达式太贪婪

w9apscun 于 2023-05-05 发布在 Java

关注(0)|答案(3)|浏览(183)

我试图清理一个HTML输入字段。我想保留一些标记，但不是全部，所以在阅读元素值时不能只使用.text()。我在Safari的JavaScript中遇到了一个正则表达式的问题。下面是代码片段（我从另一个SO线程答案中复制了这段正则表达式）：

aString.replace (/<\s*a.*href=\"(.*?)\".*>(.*?)<\/a>/gi, '$2 (Link->$1)' ) ;

下面是失败的示例输入：

<a href="http://blar.pirates.net/black/ship.html">Go here please.</a></p><p class="p1"><a href="http://blar.pirates.net/black/ship.html">http://blar.pirates.net/black/ship.html</a></p>

这个想法是，href将被拉出来，并输出为纯文本旁边的文本，将已链接。所以上面的输出最终应该是这样的：

Go here please (Link->http://blar.pirates.net/black/ship.html)
http://blar.pirates.net/black/ship.html (Link->http://blar.pirates.net/black/ship.html)

然而，正则表达式在第一个匹配中一直抓取到第二个</a>标记，所以我丢失了第一行输出。（实际上，只要锚元素相邻，它就会抓取列表中的所有元素。）输入是一个长字符串，而不是用CR/LF或其他东西分割成行。
我试过使用这样的非贪婪标志（注意第二个问号）：

/<\s*a.*href=\"(.*?)\".*?>(.*?)<\/a>/ig

但这似乎并没有改变什么（至少在我尝试的几个测试器/解析器中没有，比如https://regex101.com/r/yhmT8w/1）。我也尝试了/U标志，但没有帮助（或者这些解析器没有识别它）。
有什么建议吗？

JavaScript

来源：https://stackoverflow.com/questions/21323041/javascript-replace-regular-expression-too-greedy

3条答案

按热度按时间

4uqofj5v1#

模式中存在几个错误和可能的改进：

/<
\s*    #  not needed (browsers don't recognize "< a" as an "a" tag)

a      #  if you want to avoid a confusion between an "a" tag and the start
       # of an "abbr" tag, you can add a word boundary or better, a "\s+" since
       # there is at least one white character after.

.      #  The dot match all except newlines, if you have an "a" tag on several
       # lines, your pattern will fail. Since Javascript doesn't have the 
       # "singleline" or "dotall" mode, you must replace it with `[\s\S]` that
       # can match all characters (all that is a space + all that is not a space)

*      #  Quantifiers are greedy by default. ".*" will match all until the end of
       # the line, "[\s\S]*" will match all until the end of the string!
       # This will cause to the regex engine a lot of backtracking until the last
       # "href" will be found (and it is not always the one you want)

href=  # You can add a word boundary before the "h" and put optional spaces around
       # the equal sign to make your pattern more "waterproof": \bhref\s*=\s*

\"     #  Don't need to be escaped, as Markasoftware notices it, an attribute
       # value is not always between double quotes. You can have single quotes or
       # no quotes at all. (1)
(.*?)
\"     # same thing
.*     # same thing: match all until the last >
>(.*?)<\/a>/gi

(1)- 〉关于引号和href属性值：
要处理单引号、双引号或无引号，您可以使用捕获组和反向引用：

\bhref\s*=\s*(["']?)([^"'\s>]*)\1

详细内容：

\bhref\s*=\s*
(["']?)     # capture group 1: can contain a single, a double quote or nothing 
([^"'\s>]*) # capture group 2: all that is not a quote to stop before the possible
            # closing quote, a space (urls don't have spaces, however javascript
            # code can contain spaces) or a ">" to stop at the first space or
            # before the end of the tag if quotes are not used. 
\1          # backreference to the capture group 1

请注意，使用此子模式时，您添加了一个捕获组，a标记之间的内容现在位于捕获组3中。考虑将替换字符串$2更改为$3。
最后，你可以这样写你的模式：

aString.replace(/<a\s+[\s\S]*?\bhref\s*=\s*(["']?)([^"'\s>]*)\1[^>]*>([\s\S]*?)<\/a>/gi,
               '$3 (Link->$1)');

赞(0）回复(0）举报 2023-05-05

nuypyhwy2#

使用
href="[^"]+"
而不是
href=\"(.*?)\"
基本上这将抓住任何字符，直到它满足下一个"
虽然实现类似markdown语法的东西可能会更容易，这样你就不必担心剥离错误的标签，只要在显示文本时剥离所有并将markdown替换为它们的html标签对应物。
例如，在SO上，您可以使用
[link text](http://linkurl.com)
而执行替换的正则表达式是

var displayText = "This is just some text [and this is a link](http://example.com) and then more text";
var linkMarkdown = /\[([^\]]+)\]\(([^\)]+)\)/;
displayText.replace(linkMarkdown,'<a href="$2">$1</a>');

或者使用一个已经制作好的库来进行转换。

赞(0）回复(0）举报 2023-05-05

nwo49xxi3#

谢谢大家的建议;它帮助了我很多，并有很多改进它的想法。
但我想我找到了原始正则表达式失败的具体原因。卡西米尔的回答触及到了这一点，但我不明白，直到我碰巧在这个修复。
我一直在错误的地方寻找问题，在这里：

/<\s*a.*href=\"(.*?)\".*>(.*?)<\/a>/gi
                       ^

我能够通过在a.*hre区域后插入一个问号来修复我的原始查询，如下所示：

/<\s*a.*?href=\"(.*?)\".*>(.*?)<\/a>/gi
        ^

我确实打算利用这里的其他建议来进一步改进我的发言。
-- C

赞(0）回复(0）举报 2023-05-05

我来回答

Javascript replace()正则表达式太贪婪

3条答案

相关问题

热门标签

最新问答