powershell 从文本文件中提取url

nuypyhwy 于 2023-06-06 发布在 Shell

关注(0)|答案(2)|浏览(431)

我有一个大的文本文件，其中包含的文本查看此电子邮件在您的浏览器，然后一个网址。它可以变化，有时部分URL会转到下一行。
此外，当它确实进入下一行时，在末尾有一个需要删除的等号，而不是任何其他可能存在的等号。
几个例子：

View this email in your browser (https://us15.campaign-archive.com/?e=3D1460&u=3Df6e2bb1612577510b&id=3D2c8be)
View this email in your browser <https://mail.com/?e=3D14=
60&u=3Df612577510b&id=3D2c8be>
View this email in your browser (https://eg.com/?e=3D1460&u=3Df6510b&id=3D2c8be)

我需要使用PowerShell提取该URL，不带方括号（括号），有时可以是< >，以便我可以将其作为HTML文件下载。

if ($str -match '(?<=\()https?://[^)]+') {
 #  # ... remove any line breaks from it, and output the result.
  $Matches.0 -replace '\r?\n'
 }
 if ($str -match '(?<=\<)https?://[^>]+') {
 #  # ... remove any line breaks from it, and output the result.
  $Matches.0 -replace '\r?\n'
 }

powershell

来源：https://stackoverflow.com/questions/76226800/extract-url-from-text-file

2条答案

按热度按时间

uxh89sit1#

因为你试图匹配 * 跨行 *，你需要确保你的文本文件是 * 作为一个整体 * 阅读的，即。作为单个多行字符串，您可以使用Get-Contentcmdlet的-Raw开关来执行此操作。
除此之外，正则表达式中唯一缺少的是匹配并删除换行符之前的=。

以下代码从输入文件file.txt中提取所有URL，并将它们输出为字符串数组（删除换行符和行尾=）：

# Note the '=' before '\r?\n'
[regex]::Matches(
  (Get-Content -Raw file.txt),
  '(?<=[<(])https://[^>)]+'
).Value -replace '=\r?\n'

直接使用[regex]::Matches() .NET API允许您一次提取所有匹配项，而PowerShell的-match运算符只查找一个匹配项。
有关将来引入-matchall运算符的建议，请参阅GitHub issue #7867。
然后使用-replace从匹配项中删除换行符（\r?\n）沿着前面的=。

有关URL匹配正则表达式的解释和使用它的能力，请参阅this regex101.com page。
使用多行字符串文字的示例：

[regex]::Matches('
View this email in your browser (https://us15.campaign-archive.com/?e=3D1460&u=3Df6e2bb1612577510b&id=3D2c8be)
View this email in your browser <https://mail.com/?e=3D14=
60&u=3Df612577510b&id=3D2c8be>
View this email in your browser (https://eg.com/?e=3D1460&u=3Df6510b&id=3D2c8be)
  ',
  '(?<=[<(])https://[^>)]+'
).Value -replace '=\r?\n'

输出：

https://us15.campaign-archive.com/?e=3D1460&u=3Df6e2bb1612577510b&id=3D2c8be
https://mail.com/?e=3D1460&u=3Df612577510b&id=3D2c8be
https://eg.com/?e=3D1460&u=3Df6510b&id=3D2c8be

展开查看全部

赞(0）回复(0）举报 2023-06-06

huus2vyu2#

此解决方案适用于您提供的示例：

$text = @(
    'View this email in your browser (https://us15.campaign-archive.com/?e=3D1460&u=3Df6e2bb1612577510b&id=3D2c8be)',
    'View this email in your browser <https://mail.com/?e=3D14=
60&u=3Df612577510b&id=3D2c8be>',
    'View this email in your browser (https://eg.com/?e=3D1460&u=3Df6510b&id=3D2c8be)'
)
$text = $text | ForEach-Object {
    $PSItem.Replace('<','(').Replace('>',')').Replace("=`n",'').Split('(')[1].Replace(')','')
}

输出如下所示：

https://us15.campaign-archive.com/?e=3D1460&u=3Df6e2bb1612577510b&id=3D2c8be
https://mail.com/?e=3D1460&u=3Df612577510b&id=3D2c8be
https://eg.com/?e=3D1460&u=3Df6510b&id=3D2c8be

我只使用replace而不使用regex。您在拆分url时遇到的困难可以通过执行

.Replace("=`n")

展开查看全部

赞(0）回复(0）举报 2023-06-06

我来回答

powershell 从文本文件中提取url

2条答案

相关问题

热门标签

最新问答