regex 使用Javascript查找字符串中的所有span标记

hec6srdp  于 2022-12-05  发布在  Java
关注(0)|答案(2)|浏览(185)

我有一段类似的文本,它基本上是一串HTML代码。

hello
<span dir="auto" class="aDTYNe snByac OvPDhc OIC90c">Professional Referee</span>
<div>....</div>
<span dir="auto" class="aDTYNe snByac OvPDhc OIC90c">Professional Referee</span>
<div>....</div>
<div>
<span dir="auto" class="aDTYNe snByac OvPDhc OIC90c">Professional Referee</span>
</div>
<span dir="auto" class="aDTYNe snByac OvPDhc OIC90c">Professional Referee</span>
<span dir="auto" class="aDTYNe snByac OvPDhc OIC90c">Professional Referee</span>

我想要的是捕获所有span标签innerText(因此在下面的示例中,它将是Professional Referee),并将结果存储在数组中。
Regex -我在想这将是一条路要走-我已经是这样的:

^/(<span)([\a-zA-Z0-9\s]*)(<\/span>)/$

我不是regex上的flash,额外的问题是每个span标记可能有一些属性不等于其他标记。
我想如果我能从这里得到一个数组中的完整span标签,那么我就能设法删除剩下的东西。
这里有一个regex 101链接:https://regex101.com/r/9K90pa/1
有人能帮我选择正确的方法吗?

bkhjykvo

bkhjykvo1#

Regex不是分析HTML的理想工具。DOM API提供了一个DOM Parser

const html = `hello
<span dir="auto" class="aDTYNe snByac OvPDhc OIC90c">Professional Referee</span>
<div>....</div>
<span dir="auto" class="aDTYNe snByac OvPDhc OIC90c">Professional Referee</span>
<div>....</div>
<div>
<span dir="auto" class="aDTYNe snByac OvPDhc OIC90c">Professional Referee</span>
</div>
<span dir="auto" class="aDTYNe snByac OvPDhc OIC90c">Professional Referee</span>
<span dir="auto" class="aDTYNe snByac OvPDhc OIC90c">Professional Referee</span>`;

const doc = new DOMParser().parseFromString(html, "text/html");
const spanTexts = Array.from(doc.querySelectorAll("span"), span => span.textContent);

console.log(spanTexts);
tpxzln5u

tpxzln5u2#

有点糟糕的解决方案,我得到了regex,但我不是flash在js

const regexp = "<span.*?>(.*?)<\/span>";

const html = `hello
<span dir="auto" class="aDTYNe snByac OvPDhc OIC90c">Professional Referee</span>
<div>....</div>
<span dir="auto" class="aDTYNe snByac OvPDhc OIC90c">Professional Referee</span>
<div>....</div>
<div>
<span dir="auto" class="aDTYNe snByac OvPDhc OIC90c">Professional Referee</span>
</div>
<span dir="auto" class="aDTYNe snByac OvPDhc OIC90c">Professional Referee</span>
<span dir="auto" class="aDTYNe snByac OvPDhc OIC90c">Professional Referee</span>`;
const array = [...html.matchAll(regexp)];

console.log(array);

这个输出将放置一个二维数组,每个数组的第二项作为innerText:

> Array [Array ["<span dir="auto" class="aDTYNe snByac OvPDhc OIC90c">Professional Referee</span>", "Professional Referee"], Array ["<span dir="auto" class="aDTYNe snByac OvPDhc OIC90c">Professional Referee</span>", "Professional Referee"], Array ["<span dir="auto" class="aDTYNe snByac OvPDhc OIC90c">Professional Referee</span>", "Professional Referee"], Array ["<span dir="auto" class="aDTYNe snByac OvPDhc OIC90c">Professional Referee</span>", "Professional Referee"], Array ["<span dir="auto" class="aDTYNe snByac OvPDhc OIC90c">Professional Referee</span>", "Professional Referee"]]

如果span结束标记位于另一行,则会导致更多问题。DOMParser要好得多。

相关问题