powershell 从网页中提取URL

sycxhyv7 于 2024-01-08 发布在 Shell

关注(0)|答案(1)|浏览(210)

我想从包含多个网址的网页中提取网址，并保存提取到一个txt文件。
网页中的网址开始'127.0.0.1'，但我想从他们中删除'127.0.0.1'，只提取网址。当我运行下面的ps脚本，它只保存'127.0.0.1'。任何帮助，以解决这个问题，请。

$threatFeedUrl = "https://raw.githubusercontent.com/DandelionSprout/adfilt/master/Alternate versions Anti-Malware List/AntiMalwareHosts.txt"
    
    # Download the threat feed data
    $threatFeedData = Invoke-WebRequest -Uri $threatFeedUrl
    
    # Define a regular expression pattern to match URLs starting with '127.0.0.1'
    $pattern = '127\.0\.0\.1(?:[^\s]*)'
    
    # Use the regular expression to find matches in the threat feed data
    $matches = [regex]::Matches($threatFeedData.Content, $pattern)
    
    # Create a list to store the matched URLs
    $urlList = @()
    
    # Populate the list with matched URLs
    foreach ($match in $matches) {
        $urlList += $match.Value
    }
    
    # Specify the output file path
    $outputFilePath = "output.txt"
    
    # Save the URLs to the output file
    $urlList | Out-File -FilePath $outputFilePath
    
    Write-Host "URLs starting with '127.0.0.1' extracted from threat feed have been saved to $outputFilePath."

字符串

powershell

来源：https://stackoverflow.com/questions/77525425/extract-urls-from-webpages

1条答案

按热度按时间

q7solyqu1#

前言：

目标URL恰好是一个（半结构化的）* 纯文本 * 资源，因此基于regex的处理 * 是合适的。
然而，一般来说，对于 HTML 内容，使用专用的解析器是更可取的，因为正则表达式不能 * 鲁棒地 * 解析HTML。[1]参见this answer获取从HTML文档中提取链接的示例。

'127\.0\.0\.1(?:[^\s]*)'

字符串

您错误地使用了 * 非捕获 * 组（(?:…)）而不是 * 捕获 * 组（(…)）
下载的内容中，127.0.0.1后面有 * 空格
因此，请使用以下正则表达式（\S是[^\s]的简单等价物+仅匹配非空白字符的 * 非空 * 运行）：

'127\.0\.0\.1 (\S+)'

型

$matches = …

型

虽然从技术上讲，$matches在这里不会引起问题，但它是自动变量$Matches的名称，因此不应用于自定义目的。

$match.Value

型

$match.Value是你的正则表达式匹配的 whole 文本，而你只需要 capture group 的文本。
使用$match.Groups[1].Value代替。

$urlList +=

型

使用+= * 迭代地 * 构建数组是 * 低效的 *，因为每次迭代都必须在后台分配一个 * 新的 * 数组;只需使用foreach语句 * 作为表达式 *，并让PowerShell为您收集结果。有关详细信息，请参阅this answer。

Invoke-WebRequest -Uri $threatFeedUrl

型

由于您只对响应的 * 文本内容 * 感兴趣，因此使用Invoke-RestMethod比Invoke-WebRequest更简单;前者 * 直接 * 返回内容（不需要访问.Content属性）。

把它们放在一起：

$threatFeedUrl = 'https://raw.githubusercontent.com/DandelionSprout/adfilt/master/Alternate versions Anti-Malware List/AntiMalwareHosts.txt'
    
# Download the threat feed data
$threatFeedData = Invoke-RestMethod -Uri $threatFeedUrl
    
# Define a regular expression pattern to match URLs starting with '127.0.0.1'
$pattern = '127\.0\.0\.1 (\S+)'
    
# Use the regular expression to find matches in the threat feed data
$matchList = [regex]::Matches($threatFeedData, $pattern)
    
# Create and populate the list with matched URLs
$urlList = 
  foreach ($match in $matchList) {
    $match.Groups[1].Value
  }
    
# Specify the output file path
$outputFilePath = 'output.txt'
    
# Save the URLs to the output file
$urlList | Out-File -FilePath $outputFilePath
    
Write-Host "URLs starting with '127.0.0.1' extracted from threat feed have been saved to $outputFilePath."

型
[1]有关背景信息，请参见this blog post。

赞(0）回复(0）举报 2024-01-08

我来回答

powershell 从网页中提取URL

1条答案

相关问题

热门标签

最新问答