我正在编写一个PowerShell脚本，该脚本旨在查找HTML文件中包含不属于HTML标签的尖括号的行。该脚本应将上述尖括号替换为<和>。然而，我在使用当前脚本时遇到了困难，替换逻辑似乎无法按预期工作。值得注意的是，我正在对markdown文件进行操作，我需要使用来执行此操作。Powershell版本5.1.22621.2428不包含任何外部内容。
为了简化脚本，以下是我提出的标签被解释为这样的必要条件：

任何标签都必须关闭。例如，<hello>将是一个有效的标签，而<hello则不是。第一个单词后面是否有空格并不重要，因为否则将无法正确检测具有属性的标签。
在开始标记和第一个单词之间不能有任何空格。例如，< hello>将不是标记，而<hello>或<hello >将是。这是为了防止像3 < 4 > 2这样的东西被解释为有效标记。
标记必须包含某些内容。例如，<>将不是有效的标记。
标签不能以数字开头。

下面是我用来测试这些脚本的markdown测试文件：

<
text
>
<ul>
   <li>[Message processing time] < [time to send ack to Azure Service Bus] (about less than 100ms per message)</li>
   <li>[Total process time of a group of messages] < [Message Lock Time] (default: 1 min)<br><b>Strictly REQUIRED</b> to avoid lock loss and messages processed more than one time
   </li>
</ul>
<>
> <
5 > 3 < 2 > 1

<a href="www.google.com">placeholder!!</a>

<hello >
< ciao>
<hello>
<hello></hello>
</b>
< /b>

字符串
其正确输出应为：

&lt;
text
&gt;
<ul>
   <li>[Message processing time] &lt; [time to send ack to Azure Service Bus] (about less than 100ms per message)</li>
   <li>[Total process time of a group of messages] &lt; [Message Lock Time] (default: 1 min)<br><b>Strictly REQUIRED</b> to avoid lock loss and messages processed more than one time
   </li>
</ul>
&lt;&gt;
&gt; &lt;
5 &gt; 3 &lt; 2 &gt; 1

<a href="www.google.com">placeholder!!</a>

<hello >
&lt; hello&gt;
<hello></hello>
</b>
&lt; /b&gt;

型

我所尝试的

我已经尝试了几种不同的方法。在第一次尝试中，我根据情况的需要使用了一个普通的Powershell脚本。这是我试图修复的脚本。
这对单独的标签很有效，但是当它们一个嵌套在另一个里面时，它会分开。下面是一个示例文本用于演示目的：

<
text
>
<ul>
   <li>[Message processing time] < [time to send ack to Azure Service Bus] (about less than 100ms per message)</li>
   <li>[Total process time of a group of messages] < [Message Lock Time] (default: 1 min)<br><b>Strictly REQUIRED</b> to avoid lock loss and messages processed more than one time
   </li>
</ul>
<>
> <
5 > 3 < 2 > 1

<a href="www.google.com">placeholder!!</a>

型
其翻译为：

&lt;
text
&gt;
<ul>
   &lt;li&gt;[Message processing time] &lt; [time to send ack to Azure Service Bus] (about less than 100ms per message)&lt;/li&gt;
   &lt;li&gt;[Total process time of a group of messages] &lt; [Message Lock Time] (default: 1 min)&lt;br&gt;&lt;b&gt;Strictly REQUIRED&lt;/b&gt; to avoid lock loss and messages processed more than one time
   </li>
</ul>
<>
&gt; &lt;
5 &gt; 3 &lt; 2 &gt; 1

<a href="www.google.com">placeholder!!</a>

型
代码如下：

function find-nonHTMLtags($files) {
    foreach ($file in $files) {
        try {
            # Read the content of the file
            $content = Get-Content -Path $file.FullName -Raw

            # Process each line
            $modifiedContent = foreach ($line in $content -split '\r?\n') {
                # Replace < with &lt; if it is not part of a closed HTML tag or has a space after it
                if ($line -notmatch '<\s*(?:[^>]+)?>' -or $line -match '<\s') {
                    $line = $line -replace '<', '&lt;'
                }

                # Replace > with &gt; if it is not part of a closed HTML tag
                if ($line -notmatch '<\s*(?:[^>]+)?>') {
                    $line = $line -replace '>', '&gt;'
                }

                # Output the modified line or the original line if no changes were made
                $line
            }

            # Join the modified lines into the modified content
            $modifiedContent = $modifiedContent -join "`r`n"

            # Check if both $content and modified content are non-empty before determining modification
            if (-not [string]::IsNullOrEmpty($content) -and $content -ne $modifiedContent) {
                # Write the modified content back to the file
                $modifiedContent | Set-Content -Path $file.FullName -Encoding UTF8
                Write-Host "Changed non-HTML tag(s) at: $($file.FullName)"
            }
        }
        catch {
            Write-Host "`nCouldn't changed non-HTML tag(s) at: $($file.FullName). $_"
        }
    }
}

$mdFiles = Get-ChildItem -Path $path -File -Recurse -Filter '*.md'
find-nonHTMLtags $mdFiles

型
我尝试的第二种方法是通过.dll文件使用HAP。这很好用，但遗憾的是我被告知不能使用这样的文件，因为它们可能会造成安全威胁。下面是代码：

param (
    $path
)

function ReplaceSymbols($files) {
    foreach ($file in $files) {
        try {
            $content = Get-Content -Path $file.FullName -Raw

            Add-Type -Path (Join-Path $PSScriptRoot 'HtmlAgilityPack.dll')

            $htmlDocument = New-Object HtmlAgilityPack.HtmlDocument
            $htmlDocument.LoadHtml($content)

            # Iterate through each HTML node
            foreach ($node in $htmlDocument.DocumentNode.DescendantsAndSelf()) {
                # Check if the node is text
                if ($node.NodeType -eq 'Text') {
                    # Replace < with &lt; and > with &gt; only in text nodes
                    $node.InnerHtml = $node.InnerHtml -replace '<', '&lt;' -replace '>', '&gt;'
                }
            }

            if (-not [string]::IsNullOrEmpty($content) -and $content -ne $htmlDocument.DocumentNode.OuterHtml) {
                $htmlDocument.DocumentNode.OuterHtml | Set-Content -Path $file.FullName -Encoding UTF8
                Write-Host "File content modified: $($file.FullName)"
            }
        }
        catch {
            Write-Host "Error modifying file content: $($file.FullName). $_"
        }
    }
}

$mdFiles = Get-ChildItem -Path $path -File -Recurse -Include '*.md'
Write-Host "Markdown Files Count $($mdFiles.Count)"
ReplaceSymbols $filesToProcess

型
另一种有效的方法是使用JavaScript和NodeJS，但遗憾的是我不能使用这种方法，因为NodeJS不支持。代码：

const fs = require('fs');
const path = require('path');

function replaceNonHTMLtags(files) {
    files.forEach(filePath => {
        try {
            const content = fs.readFileSync(filePath, 'utf8');

            String.prototype.replaceAt = function (index, char) {
                let arr = this.split('');
                arr[index] = char;
                return arr.join('');
            };
            
            String.prototype.escape = function () {
                let p = /(?:<[a-zA-Z]+\s*[^>]*>)|(?:<\/[a-zA-Z]+>)|(?<lt><)|(?<gt>>)/g,
                result = this,
                match = p.exec(result);
            
                while (match !== null) {
                    if (match.groups.lt !== undefined) {
                        result = result.replaceAt(match.index, '&lt;');
                    } else if (match.groups.gt !== undefined) {
                        result = result.replaceAt(match.index, '&gt;');
                    }
                    match = p.exec(result);
                }
                return result;
            };
            
            

            // Perform modifications on the content
            const modifiedContent = content.escape();

            // Check if both content and modifiedContent aren't empty before doing anything else
            if (content !== '' && content !== modifiedContent) {
                // Write the modified content back to the file
                fs.writeFileSync(filePath, modifiedContent, 'utf8');
                console.log(`Edited: ${modifiedContent}`);
                console.log(`Edited: ${filePath}`);
            }
        } catch (error) {
            console.log(`Couldn't edit: ${filePath}. ${error.message}`);
        }
    });
}

const dynamicPath = ''; // Empty to use only __dirname
const orderFiles = fs.readdirSync(path.join(__dirname, dynamicPath)).filter(file => file.endsWith('.md')).map(file => path.join(dynamicPath, file));

console.log(`Markdown Files Count: ${orderFiles.length}`);
replaceNonHTMLtags(orderFiles);

型

1条答案

按热度按时间

qgelzfjb1#

与任何基于regex的HTML处理一样，以下isn't fully robust，但可能适用于您的情况：

$modifiedContent = 
  (Get-Content -Raw $file) `
    -replace '<(?!(?:/\s*)?[a-z]+(?:\s+[^>]*)?/?>)',  '&lt;' `
    -replace '(?<!<(?:/\s*)?[a-z]+(?:\s+[^>]*)?/?)>', '&gt;'

字符串
这种方法的要点是使用否定的look-ahead Assert（(?!…)）来确保<之后的内容不是HTML标记的其余部分，类似地，使用否定的look-behind Assert（(?<!…)）来确保>之前的内容不是HTML标记的开始。
有关正则表达式的详细说明以及使用它们的选项，请参见this regex101.com page;为简单起见，两个正则表达式已通过交替（|）合并为一个正则表达式，并使用 * 占位符 * 替换字符串&[gl]t;来符号化上述代码中的两个不同替换>和<

赞(0）回复(0）举报 12个月前

PowerShell脚本替换非HTML标签

我所尝试的

1条答案

相关问题

热门标签

最新问答