我正在编写一个PowerShell脚本,该脚本旨在查找HTML文件中包含不属于HTML标签的尖括号的行。该脚本应将上述尖括号替换为<
和>
。然而,我在使用当前脚本时遇到了困难,替换逻辑似乎无法按预期工作。值得注意的是,我正在对markdown文件进行操作,我需要使用来执行此操作。Powershell版本5.1.22621.2428不包含任何外部内容。
为了简化脚本,以下是我提出的标签被解释为这样的必要条件:
- 任何标签都必须关闭。例如,
<hello>
将是一个有效的标签,而<hello
则不是。第一个单词后面是否有空格并不重要,因为否则将无法正确检测具有属性的标签。 - 在开始标记和第一个单词之间不能有任何空格。例如,
< hello>
将不是标记,而<hello>
或<hello >
将是。这是为了防止像3 < 4 > 2
这样的东西被解释为有效标记。 - 标记必须包含某些内容。例如,
<>
将不是有效的标记。 - 标签不能以数字开头。
下面是我用来测试这些脚本的markdown测试文件:
<
text
>
<ul>
<li>[Message processing time] < [time to send ack to Azure Service Bus] (about less than 100ms per message)</li>
<li>[Total process time of a group of messages] < [Message Lock Time] (default: 1 min)<br><b>Strictly REQUIRED</b> to avoid lock loss and messages processed more than one time
</li>
</ul>
<>
> <
5 > 3 < 2 > 1
<a href="www.google.com">placeholder!!</a>
<hello >
< ciao>
<hello>
<hello></hello>
</b>
< /b>
字符串
其正确输出应为:
<
text
>
<ul>
<li>[Message processing time] < [time to send ack to Azure Service Bus] (about less than 100ms per message)</li>
<li>[Total process time of a group of messages] < [Message Lock Time] (default: 1 min)<br><b>Strictly REQUIRED</b> to avoid lock loss and messages processed more than one time
</li>
</ul>
<>
> <
5 > 3 < 2 > 1
<a href="www.google.com">placeholder!!</a>
<hello >
< hello>
<hello></hello>
</b>
< /b>
型
我所尝试的
我已经尝试了几种不同的方法。在第一次尝试中,我根据情况的需要使用了一个普通的Powershell脚本。这是我试图修复的脚本。
这对单独的标签很有效,但是当它们一个嵌套在另一个里面时,它会分开。下面是一个示例文本用于演示目的:
<
text
>
<ul>
<li>[Message processing time] < [time to send ack to Azure Service Bus] (about less than 100ms per message)</li>
<li>[Total process time of a group of messages] < [Message Lock Time] (default: 1 min)<br><b>Strictly REQUIRED</b> to avoid lock loss and messages processed more than one time
</li>
</ul>
<>
> <
5 > 3 < 2 > 1
<a href="www.google.com">placeholder!!</a>
型
其翻译为:
<
text
>
<ul>
<li>[Message processing time] < [time to send ack to Azure Service Bus] (about less than 100ms per message)</li>
<li>[Total process time of a group of messages] < [Message Lock Time] (default: 1 min)<br><b>Strictly REQUIRED</b> to avoid lock loss and messages processed more than one time
</li>
</ul>
<>
> <
5 > 3 < 2 > 1
<a href="www.google.com">placeholder!!</a>
型
代码如下:
function find-nonHTMLtags($files) {
foreach ($file in $files) {
try {
# Read the content of the file
$content = Get-Content -Path $file.FullName -Raw
# Process each line
$modifiedContent = foreach ($line in $content -split '\r?\n') {
# Replace < with < if it is not part of a closed HTML tag or has a space after it
if ($line -notmatch '<\s*(?:[^>]+)?>' -or $line -match '<\s') {
$line = $line -replace '<', '<'
}
# Replace > with > if it is not part of a closed HTML tag
if ($line -notmatch '<\s*(?:[^>]+)?>') {
$line = $line -replace '>', '>'
}
# Output the modified line or the original line if no changes were made
$line
}
# Join the modified lines into the modified content
$modifiedContent = $modifiedContent -join "`r`n"
# Check if both $content and modified content are non-empty before determining modification
if (-not [string]::IsNullOrEmpty($content) -and $content -ne $modifiedContent) {
# Write the modified content back to the file
$modifiedContent | Set-Content -Path $file.FullName -Encoding UTF8
Write-Host "Changed non-HTML tag(s) at: $($file.FullName)"
}
}
catch {
Write-Host "`nCouldn't changed non-HTML tag(s) at: $($file.FullName). $_"
}
}
}
$mdFiles = Get-ChildItem -Path $path -File -Recurse -Filter '*.md'
find-nonHTMLtags $mdFiles
型
我尝试的第二种方法是通过.dll文件使用HAP。这很好用,但遗憾的是我被告知不能使用这样的文件,因为它们可能会造成安全威胁。下面是代码:
param (
$path
)
function ReplaceSymbols($files) {
foreach ($file in $files) {
try {
$content = Get-Content -Path $file.FullName -Raw
Add-Type -Path (Join-Path $PSScriptRoot 'HtmlAgilityPack.dll')
$htmlDocument = New-Object HtmlAgilityPack.HtmlDocument
$htmlDocument.LoadHtml($content)
# Iterate through each HTML node
foreach ($node in $htmlDocument.DocumentNode.DescendantsAndSelf()) {
# Check if the node is text
if ($node.NodeType -eq 'Text') {
# Replace < with < and > with > only in text nodes
$node.InnerHtml = $node.InnerHtml -replace '<', '<' -replace '>', '>'
}
}
if (-not [string]::IsNullOrEmpty($content) -and $content -ne $htmlDocument.DocumentNode.OuterHtml) {
$htmlDocument.DocumentNode.OuterHtml | Set-Content -Path $file.FullName -Encoding UTF8
Write-Host "File content modified: $($file.FullName)"
}
}
catch {
Write-Host "Error modifying file content: $($file.FullName). $_"
}
}
}
$mdFiles = Get-ChildItem -Path $path -File -Recurse -Include '*.md'
Write-Host "Markdown Files Count $($mdFiles.Count)"
ReplaceSymbols $filesToProcess
型
另一种有效的方法是使用JavaScript和NodeJS,但遗憾的是我不能使用这种方法,因为NodeJS不支持。代码:
const fs = require('fs');
const path = require('path');
function replaceNonHTMLtags(files) {
files.forEach(filePath => {
try {
const content = fs.readFileSync(filePath, 'utf8');
String.prototype.replaceAt = function (index, char) {
let arr = this.split('');
arr[index] = char;
return arr.join('');
};
String.prototype.escape = function () {
let p = /(?:<[a-zA-Z]+\s*[^>]*>)|(?:<\/[a-zA-Z]+>)|(?<lt><)|(?<gt>>)/g,
result = this,
match = p.exec(result);
while (match !== null) {
if (match.groups.lt !== undefined) {
result = result.replaceAt(match.index, '<');
} else if (match.groups.gt !== undefined) {
result = result.replaceAt(match.index, '>');
}
match = p.exec(result);
}
return result;
};
// Perform modifications on the content
const modifiedContent = content.escape();
// Check if both content and modifiedContent aren't empty before doing anything else
if (content !== '' && content !== modifiedContent) {
// Write the modified content back to the file
fs.writeFileSync(filePath, modifiedContent, 'utf8');
console.log(`Edited: ${modifiedContent}`);
console.log(`Edited: ${filePath}`);
}
} catch (error) {
console.log(`Couldn't edit: ${filePath}. ${error.message}`);
}
});
}
const dynamicPath = ''; // Empty to use only __dirname
const orderFiles = fs.readdirSync(path.join(__dirname, dynamicPath)).filter(file => file.endsWith('.md')).map(file => path.join(dynamicPath, file));
console.log(`Markdown Files Count: ${orderFiles.length}`);
replaceNonHTMLtags(orderFiles);
型
1条答案
按热度按时间qgelzfjb1#
与任何基于regex的HTML处理一样,以下isn't fully robust,但可能适用于您的情况:
字符串
这种方法的要点是使用否定的look-ahead Assert(
(?!…)
)来确保<
之后的内容不是HTML标记的其余部分,类似地,使用否定的look-behind Assert((?<!…)
)来确保>
之前的内容不是HTML标记的开始。有关正则表达式的详细说明以及使用它们的选项,请参见this regex101.com page;为简单起见,两个正则表达式已通过交替(
|
)合并为一个正则表达式,并使用 * 占位符 * 替换字符串&[gl]t;
来符号化上述代码中的两个不同替换>
和<