PowerShell脚本替换非HTML标签

hjqgdpho  于 12个月前  发布在  Shell
关注(0)|答案(1)|浏览(123)

我正在编写一个PowerShell脚本,该脚本旨在查找HTML文件中包含不属于HTML标签的尖括号的行。该脚本应将上述尖括号替换为<>。然而,我在使用当前脚本时遇到了困难,替换逻辑似乎无法按预期工作。值得注意的是,我正在对markdown文件进行操作,我需要使用来执行此操作。Powershell版本5.1.22621.2428不包含任何外部内容。
为了简化脚本,以下是我提出的标签被解释为这样的必要条件:

  • 任何标签都必须关闭。例如,<hello>将是一个有效的标签,而<hello则不是。第一个单词后面是否有空格并不重要,因为否则将无法正确检测具有属性的标签。
  • 在开始标记和第一个单词之间不能有任何空格。例如,< hello>将不是标记,而<hello><hello >将是。这是为了防止像3 < 4 > 2这样的东西被解释为有效标记。
  • 标记必须包含某些内容。例如,<>将不是有效的标记。
  • 标签不能以数字开头。

下面是我用来测试这些脚本的markdown测试文件:

<
text
>
<ul>
   <li>[Message processing time] < [time to send ack to Azure Service Bus] (about less than 100ms per message)</li>
   <li>[Total process time of a group of messages] < [Message Lock Time] (default: 1 min)<br><b>Strictly REQUIRED</b> to avoid lock loss and messages processed more than one time
   </li>
</ul>
<>
> <
5 > 3 < 2 > 1

<a href="www.google.com">placeholder!!</a>

<hello >
< ciao>
<hello>
<hello></hello>
</b>
< /b>

字符串
其正确输出应为:

&lt;
text
&gt;
<ul>
   <li>[Message processing time] &lt; [time to send ack to Azure Service Bus] (about less than 100ms per message)</li>
   <li>[Total process time of a group of messages] &lt; [Message Lock Time] (default: 1 min)<br><b>Strictly REQUIRED</b> to avoid lock loss and messages processed more than one time
   </li>
</ul>
&lt;&gt;
&gt; &lt;
5 &gt; 3 &lt; 2 &gt; 1

<a href="www.google.com">placeholder!!</a>

<hello >
&lt; hello&gt;
<hello></hello>
</b>
&lt; /b&gt;

我所尝试的

我已经尝试了几种不同的方法。在第一次尝试中,我根据情况的需要使用了一个普通的Powershell脚本。这是我试图修复的脚本。
这对单独的标签很有效,但是当它们一个嵌套在另一个里面时,它会分开。下面是一个示例文本用于演示目的:

<
text
>
<ul>
   <li>[Message processing time] < [time to send ack to Azure Service Bus] (about less than 100ms per message)</li>
   <li>[Total process time of a group of messages] < [Message Lock Time] (default: 1 min)<br><b>Strictly REQUIRED</b> to avoid lock loss and messages processed more than one time
   </li>
</ul>
<>
> <
5 > 3 < 2 > 1

<a href="www.google.com">placeholder!!</a>


其翻译为:

&lt;
text
&gt;
<ul>
   &lt;li&gt;[Message processing time] &lt; [time to send ack to Azure Service Bus] (about less than 100ms per message)&lt;/li&gt;
   &lt;li&gt;[Total process time of a group of messages] &lt; [Message Lock Time] (default: 1 min)&lt;br&gt;&lt;b&gt;Strictly REQUIRED&lt;/b&gt; to avoid lock loss and messages processed more than one time
   </li>
</ul>
<>
&gt; &lt;
5 &gt; 3 &lt; 2 &gt; 1

<a href="www.google.com">placeholder!!</a>


代码如下:

function find-nonHTMLtags($files) {
    foreach ($file in $files) {
        try {
            # Read the content of the file
            $content = Get-Content -Path $file.FullName -Raw

            # Process each line
            $modifiedContent = foreach ($line in $content -split '\r?\n') {
                # Replace < with &lt; if it is not part of a closed HTML tag or has a space after it
                if ($line -notmatch '<\s*(?:[^>]+)?>' -or $line -match '<\s') {
                    $line = $line -replace '<', '&lt;'
                }

                # Replace > with &gt; if it is not part of a closed HTML tag
                if ($line -notmatch '<\s*(?:[^>]+)?>') {
                    $line = $line -replace '>', '&gt;'
                }

                # Output the modified line or the original line if no changes were made
                $line
            }

            # Join the modified lines into the modified content
            $modifiedContent = $modifiedContent -join "`r`n"

            # Check if both $content and modified content are non-empty before determining modification
            if (-not [string]::IsNullOrEmpty($content) -and $content -ne $modifiedContent) {
                # Write the modified content back to the file
                $modifiedContent | Set-Content -Path $file.FullName -Encoding UTF8
                Write-Host "Changed non-HTML tag(s) at: $($file.FullName)"
            }
        }
        catch {
            Write-Host "`nCouldn't changed non-HTML tag(s) at: $($file.FullName). $_"
        }
    }
}

$mdFiles = Get-ChildItem -Path $path -File -Recurse -Filter '*.md'
find-nonHTMLtags $mdFiles


我尝试的第二种方法是通过.dll文件使用HAP。这很好用,但遗憾的是我被告知不能使用这样的文件,因为它们可能会造成安全威胁。下面是代码:

param (
    $path
)

function ReplaceSymbols($files) {
    foreach ($file in $files) {
        try {
            $content = Get-Content -Path $file.FullName -Raw

            Add-Type -Path (Join-Path $PSScriptRoot 'HtmlAgilityPack.dll')

            $htmlDocument = New-Object HtmlAgilityPack.HtmlDocument
            $htmlDocument.LoadHtml($content)

            # Iterate through each HTML node
            foreach ($node in $htmlDocument.DocumentNode.DescendantsAndSelf()) {
                # Check if the node is text
                if ($node.NodeType -eq 'Text') {
                    # Replace < with &lt; and > with &gt; only in text nodes
                    $node.InnerHtml = $node.InnerHtml -replace '<', '&lt;' -replace '>', '&gt;'
                }
            }

            if (-not [string]::IsNullOrEmpty($content) -and $content -ne $htmlDocument.DocumentNode.OuterHtml) {
                $htmlDocument.DocumentNode.OuterHtml | Set-Content -Path $file.FullName -Encoding UTF8
                Write-Host "File content modified: $($file.FullName)"
            }
        }
        catch {
            Write-Host "Error modifying file content: $($file.FullName). $_"
        }
    }
}

$mdFiles = Get-ChildItem -Path $path -File -Recurse -Include '*.md'
Write-Host "Markdown Files Count $($mdFiles.Count)"
ReplaceSymbols $filesToProcess


另一种有效的方法是使用JavaScript和NodeJS,但遗憾的是我不能使用这种方法,因为NodeJS不支持。代码:

const fs = require('fs');
const path = require('path');

function replaceNonHTMLtags(files) {
    files.forEach(filePath => {
        try {
            const content = fs.readFileSync(filePath, 'utf8');

            String.prototype.replaceAt = function (index, char) {
                let arr = this.split('');
                arr[index] = char;
                return arr.join('');
            };
            
            String.prototype.escape = function () {
                let p = /(?:<[a-zA-Z]+\s*[^>]*>)|(?:<\/[a-zA-Z]+>)|(?<lt><)|(?<gt>>)/g,
                result = this,
                match = p.exec(result);
            
                while (match !== null) {
                    if (match.groups.lt !== undefined) {
                        result = result.replaceAt(match.index, '&lt;');
                    } else if (match.groups.gt !== undefined) {
                        result = result.replaceAt(match.index, '&gt;');
                    }
                    match = p.exec(result);
                }
                return result;
            };
            
            

            // Perform modifications on the content
            const modifiedContent = content.escape();

            // Check if both content and modifiedContent aren't empty before doing anything else
            if (content !== '' && content !== modifiedContent) {
                // Write the modified content back to the file
                fs.writeFileSync(filePath, modifiedContent, 'utf8');
                console.log(`Edited: ${modifiedContent}`);
                console.log(`Edited: ${filePath}`);
            }
        } catch (error) {
            console.log(`Couldn't edit: ${filePath}. ${error.message}`);
        }
    });
}

const dynamicPath = ''; // Empty to use only __dirname
const orderFiles = fs.readdirSync(path.join(__dirname, dynamicPath)).filter(file => file.endsWith('.md')).map(file => path.join(dynamicPath, file));

console.log(`Markdown Files Count: ${orderFiles.length}`);
replaceNonHTMLtags(orderFiles);

qgelzfjb

qgelzfjb1#

与任何基于regex的HTML处理一样,以下isn't fully robust,但可能适用于您的情况:

$modifiedContent = 
  (Get-Content -Raw $file) `
    -replace '<(?!(?:/\s*)?[a-z]+(?:\s+[^>]*)?/?>)',  '&lt;' `
    -replace '(?<!<(?:/\s*)?[a-z]+(?:\s+[^>]*)?/?)>', '&gt;'

字符串
这种方法的要点是使用否定的look-ahead Assert((?!…))来确保<之后的内容不是HTML标记的其余部分,类似地,使用否定的look-behind Assert((?<!…))来确保>之前的内容不是HTML标记的开始。
有关正则表达式的详细说明以及使用它们的选项,请参见this regex101.com page;为简单起见,两个正则表达式已通过交替(|)合并为一个正则表达式,并使用 * 占位符 * 替换字符串&[gl]t;来符号化上述代码中的两个不同替换&gt;&lt;

相关问题