regex 从xml文件中提取内容

iyfjxgzm  于 12个月前  发布在  其他
关注(0)|答案(1)|浏览(96)

我有xml内容如下

<Artificial name="Artifical name">
    <Machine>
        <MachineEnvironment uri="environment" />
    </Machine>
    <Mobile>taken phone, test

when r1
    100m SUV
then
    FireFly is High
end

when r2
    Order of the Phonenix 
    
then
    Magic is High
end

</Mobile>
</Artificial>

字符串
我想写一个函数,它接受一行(字符串)和内容(字符串),并返回所提供的行所属的最近标记的内容。
例如,如果我提供FireFly is High行,它应该返回以下内容,因为它是所提供行所属的最近标记。

<Mobile>taken phone, test

when r1
    100m SUV
then
    FireFly is High
end

when r2
    Order of the Phonenix 

then
    Magic is High
end

</Mobile>


以下是我的代码

getLineContent(line: string, content: string) {
    const trimmedLine = line.trim()
    const isSelfClosingTag = /\/\s*>$/.test(trimmedLine)
    const isPlainTextLine = !/<|>/.test(trimmedLine)
    const regex = new RegExp(`(${trimmedLine}[^>]*>)([\\s\\S]*?)</(${trimmedLine.split(' ')[0].substr(1)}>)`)
    const isClosingTag = /^<\/\w+>$/.test(trimmedLine)
    const match = content.match(regex)

    if (!isClosingTag) {
      if (isSelfClosingTag) {
        return trimmedLine
      }

      if (match && match[2]) {
        return match[1] + match[2] + match[3]
      }
      if (isPlainTextLine) {
        const regex = new RegExp(`(<[^>]*>)([\\s\\S]*?${trimmedLine.split(' ')[0].substr(1)}[\\s\\S]*?</[a-zA-Z]+>)`)
        const match = content.match(regex)
        console.log('isPlainTextLine', match)
        if (match && match[1] && match[2]) {
          return match[2]
        }
      }
      return trimmedLine
    }
  }


它几乎完美地工作,但并不完全。问题在于这部分代码

if (isPlainTextLine) {
        const regex = new RegExp(`(<[^>]*>)([\\s\\S]*?${trimmedLine.split(' ')[0].substr(1)}[\\s\\S]*?</[a-zA-Z]+>)`)
        const match = content.match(regex)
        console.log('isPlainTextLine', match)
        if (match && match[1] && match[2]) {
          return match[2]
        }
      }


例如:如果我提供FireFly is High,则返回值为

<Machine>
        <MachineEnvironment uri="environment" />
    </Machine>
    <Mobile>taken phone, test

when r1
    100m SUV
then
    FireFly is High
end

when r2
    Order of the Phonenix 

then
    Magic is High
end

</Mobile>


Regex不是我的强项。任何帮助都很感激。

vnzz0bqm

vnzz0bqm1#

正则表达式不是完成这项任务的合适工具。相反,使用XML解析器。有很多可供选择。例如,您可以使用fast-xml-parser。它将XML转换为嵌套的对象结构。演示:

const { XMLParser } = require("fast-xml-parser");

function findText(obj, find, key="") {
    if (typeof obj === "string" && obj.includes(find)) {
        return { [key]: obj };
    }
    if (Object(obj) === obj) {
        for (const key in obj) {
            const result = findText(obj[key], find, key);
            if (result) return result;
       }
    }
}

const xml = `<Artificial name="Artifical name">
    <Machine>
        <MachineEnvironment uri="environment" />
    <\/Machine>
    <Mobile>taken phone, test
    ...
    FireFly is High
    ...
    </Mobile>
<\/Artificial>`;

const obj = new XMLParser().parse(xml);
const result = findText(obj, "FireFly");
console.log(result); // { Mobile: "taken phone, ....... " }

字符串
作为第二个示例,在浏览器上下文中,您可以从WebAPI使用DOMParser

function *iterNodes(doc, whatToShow) { // Generator for createTreeWalker
    const walk = doc.createTreeWalker(doc.documentElement, whatToShow, null, false);
    for (let node; node = walk.nextNode(); null) yield node;
}

function findTagByContent(xml, content) {
    const doc = new DOMParser().parseFromString(xml, "text/xml");
    for (const node of iterNodes(doc, NodeFilter.SHOW_TEXT)) {
        if (node.textContent.includes(content)) return node.parentNode.outerHTML;
    }
}

// Example run

const xml = `<Artificial name="Artifical name">
    <Machine>
        <MachineEnvironment uri="environment" />
    </Machine>
    <Mobile>taken phone, test
    ...
    FireFly is High
    ...
    </Mobile>
</Artificial>`;

console.log(findTagByContent(xml, "FireFly"));

相关问题