powershell HTML表格到CSV

ua4mk5z4  于 2023-06-23  发布在  Shell
关注(0)|答案(2)|浏览(123)

我对powershell很陌生,我使用JohnLBevan的代码将HTML表格转换为CSV:

function ConvertFrom-HtmlTableRow {
    [CmdletBinding()]
    param (
        [Parameter(Mandatory = $true, ValueFromPipeline = $true)]
        $htmlTableRow
        ,
        [Parameter(Mandatory = $false, ValueFromPipeline = $false)]
        $headers
        ,
        [Parameter(Mandatory = $false, ValueFromPipeline = $false)]
        [switch]$isHeader

    )
    process {
        $cols = $htmlTableRow | select -expandproperty td
        if($isHeader.IsPresent) {
            0..($cols.Count - 1) | %{$x=$cols[$_] | out-string; if(($x) -and ($x.Trim() -gt [string]::Empty)) {$x} else {("Column_{0:0000}" -f $_)}} #clean the headers to ensure each col has a name        
        } else {
            $colCount = ($cols | Measure-Object).Count - 1
            $result = new-object -TypeName PSObject
            0..$colCount | %{
                $colName = if($headers[$_]){$headers[$_]}else{("Column_{0:00000} -f $_")} #in case we have more columns than headers 
                $colValue = $cols[$_]
                $result | Add-Member NoteProperty $colName $colValue
            } 
            write-output $result
        }
    }
}

function ConvertFrom-HtmlTable {
    [CmdletBinding()]
    param (
        [Parameter(Mandatory = $true, ValueFromPipeline = $true)]
        $htmlTable
    )
    process {
        #currently only very basic <table><tr><td>...</td></tr></table> structure supported
        #could be improved to better understand tbody, th, nested tables, etc

        #$htmlTable.childNodes | ?{ $_.tagName -eq 'tr' } | ConvertFrom-HtmlTableRow

        #remove anything tags that aren't td or tr (simplifies our parsing of the data
        [xml]$cleanedHtml = ("<!DOCTYPE doctypeName [<!ENTITY nbsp ' '>]><root>{0}</root>" -f ($htmlTable | select -ExpandProperty innerHTML | %{(($_ | out-string) -replace '(</?t[rdh])[^>]*(/?>)|(?:<[^>]*>)','$1$2') -replace '(</?)(?:th)([^>]*/?>)','$1td$2'})) 
        [string[]]$headers = $cleanedHtml.root.tr | select -first 1 | ConvertFrom-HtmlTableRow -isHeader
        if ($headers.Count -gt 0) {
            $cleanedHtml.root.tr | select -skip 1 | ConvertFrom-HtmlTableRow -Headers $headers | select $headers
        }
    }
}

但是每当我从parsedHTML变量执行它并获取elementbytagname“table”时,我都会得到这个错误:

Cannot convert value "<!DOCTYPE doctypeName [<!ENTITY nbsp ' '>]><root>

</root>" to type "System.Xml.XmlDocument". Error: "The 'Tr' start tag on line 16 position 124 does not match the end tag of 'td'. Line 20, position 3."
At line:108 char:9
+         [xml]$cleanedHtml = ("<!DOCTYPE doctypeName [<!ENTITY nbsp '  ...
+         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidArgument: (:) [], RuntimeException
    + FullyQualifiedErrorId : InvalidCastToXmlDocument

我希望有人能帮我。先谢谢你了。
我正在尝试与外部网站合作。这是表格的HTML代码:

<table class="organization-admin__table table">
          <thead>
            <tr>
              <th colspan="2">Name</th>
              <th>Email address</th>
              <th>Timezone</th>
              <th>Last logged in</th>
              <th>Actions</th>
            </tr>
          </thead>
          <tbody>
            
            <tr>
              <td width="48px">
  <a href="/site/users/samluser" class="avatar hz-hint hz-hint--bottom" data-hint="user1 &lt;user1@site.com&gt;" title="user1 &lt;user1@site.com&gt;">
    <img src="https://portal.website.com/avatar/0fd7f51cee04789c617b1cc973e0b245.jpg?s=64&amp;r=g&amp;d=https%3A%2F%2Fportal.website.com%2Fplaceholders%2F64%2F87d37c%2Ffff%26text%3DTM" alt="user1 &lt;user1@site.com&gt;" width="32" height="32">
  </a>

</td>
              <td><a href="/site/users/samluser">user1</a></td>
              <td><a href="mailto:user1@site.com">user1@site.com</a></td>
              <td>Canada/Eastern</td>
              <td>05 Aug 2021</td>
              <td>
                <ul class="button-group">
                  
                  
                  <li>
                    <a href="/site/users/samluser/edit" class="btn btn-sm btn-primary">
                      <i class="fa fa-pencil-alt"></i>
                      Edit
                    </a>
                  </li>
                  
                  
                  <li>
                    <a href="/site/users/samluser/delete" class="btn btn-sm btn-danger">
                      <i class="fa fa-trash-alt"></i>
                      Delete
                    </a>
                  </li>
                  
                </ul>
              </td>
            </tr>
            
            <tr>
              <td width="48px">
  <a href="/site/users/samluser" class="avatar hz-hint hz-hint--bottom" data-hint="user2 &lt;user2@site.ca&gt;" title="user2 &lt;user2@site.ca&gt;">
    <img src="https://portal.website.com/avatar/481355c93fa79e47ca56110da63d6da5.jpg?s=64&amp;r=g&amp;d=https%3A%2F%2Fportal.website.com%2Fplaceholders%2F64%2F044f67%2Ffff%26text%3DVS" alt="user2 &lt;user2@site.ca&gt;" width="32" height="32">
  </a>

</td>
              <td><a href="/site/users/samluser">user2</a></td>
              <td><a href="mailto:user2@site.ca">user2@site.ca</a></td>
              <td>Canada/Eastern</td>
              <td>16 Jul 2021</td>
              <td>
                <ul class="button-group">
                  
                  
                  <li>
                    <a href="/site/users/samluser/edit" class="btn btn-sm btn-primary">
                      <i class="fa fa-pencil-alt"></i>
                      Edit
                    </a>
                  </li>
                  
                  
                  <li>
                    <a href="/site/users/samluser/delete" class="btn btn-sm btn-danger">
                      <i class="fa fa-trash-alt"></i>
                      Delete
                    </a>
                  </li>
                  
                </ul>
              </td>
            </tr>
            
            <tr>
              <td width="48px">
  <a href="/site/users/samluser" class="avatar hz-hint hz-hint--bottom" data-hint="user3 &lt;user3@site.com&gt;" title="user3 &lt;user3@site.com&gt;">
    <img src="https://portal.website.com/avatar/450f564aaba30e75fe70dc5f4bbefaf6.jpg?s=64&amp;r=g&amp;d=https%3A%2F%2Fportal.website.com%2Fplaceholders%2F64%2Fffb61e%2Ffff%26text%3DWP" alt="Wilfred &lt;user3@site.com&gt;" width="32" height="32">
  </a>

</td>
              <td><a href="/site/users/samluser">Wilfred</a></td>
              <td><a href="mailto:user3@site.com">Wilfred@site.com</a></td>
              <td>UTC</td>
              <td>26 Jul 2021</td>
              <td>
                <ul class="button-group">
                  
                  
                  <li>
                    <a href="/site/users/samluser/edit" class="btn btn-sm btn-primary">
                      <i class="fa fa-pencil-alt"></i>
                      Edit
                    </a>
                  </li>
                  
                  
                  <li>
                    <a href="/site/users/samluser/delete" class="btn btn-sm btn-danger">
                      <i class="fa fa-trash-alt"></i>
                      Delete
                    </a>
                  </li>
                  
                </ul>
              </td>
            </tr>
            
          </tbody>
    </table>
gcuhipw9

gcuhipw91#

如前所述,转换为XML有严格的规则,当HTML neglet编写结束标记</tr>时,将其加载为xml将失败。对于没有结束标记</img><img>标记也是如此。
我没有你正在加载的完整HTML,但也许可以尝试下面的函数:

function ConvertFrom_HtmlTable {
    # adapted from: https://www.leeholmes.com/blog/2015/01/05/extracting-tables-from-powershells-invoke-webrequest/
    [CmdletBinding(DefaultParameterSetName = 'ByIndex')]
    param(
        [Parameter(Mandatory = $true, Position = 0)]
        [Microsoft.PowerShell.Commands.HtmlWebResponseObject]$WebRequest,

        [Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByIndex')]
        [int]$TableIndex = 0,

        [Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ById')]
        [string]$TableId,

        [Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByName')]
        [string]$TableName,

        [Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByClass')]
        [string]$TableClassName
    )

    # Extract the table out of the web request
    switch ($PSCmdlet.ParameterSetName) {
        'ById'    { $table = $WebRequest.ParsedHtml.getElementByID($TableId) }
        'ByIndex' { $table = @($WebRequest.ParsedHtml.getElementsByTagName('table'))[$TableIndex]}
        'ByName'  { $table = @($WebRequest.ParsedHtml.getElementsByName($TableName))[0] }
        'ByClass' { $table = @($WebRequest.ParsedHtml.getElementsByClassName($TableClassName))[0] }
    }
    if (!$table) {
        Write-Warning "Could not find the given table."
        return $null
    }

    # load the System.Web assembly to be able to decode HTML entities
    Add-Type -AssemblyName System.Web

    $headers = @()
    # Go through all of the rows in the table
    foreach ($row in $table.Rows) {
        $cells = @($row.Cells)
        # If there is a table header, remember its titles
        if($cells[0].tagName -eq "TH") {
            $i = 0
            $headers = @($cells | ForEach-Object {
                $i++
                # decode HTML entities and double-up quotes that the value may contain
                $th = ([System.Web.HttpUtility]::HtmlDecode($_.InnerText) -replace '"', '""').Trim()
                # if the table header is empty, create it
                if ([string]::IsNullOrEmpty($th)) { "H$i" } else { $th }
            })
            # proceed with the next row
            continue
        }
        # if we haven't found any table headers, make up names "H1", "H2", etc.
        if(-not $headers) {
            $headers = @(1..($cells.Count + 2) | ForEach-Object { "H$_" })
        }

        # Now go through the cells in the the row. For each, try to find the
        # title that represents that column and create a hashtable mapping those
        # titles to content
        $hash = [Ordered]@{}
        for ($i = 0; $i -lt $cells.Count; $i++) {
            # decode HTML entities and double-up quotes that the value may contain
            $value = ([System.Web.HttpUtility]::HtmlDecode($cells[$i].InnerText) -replace '"', '""').Trim()
            $th = $headers[$i]
            $hash[$th] = $value.Trim()
        }
        # And finally cast that hashtable to a PSCustomObject
        [PSCustomObject]$hash
    }
}

这样称呼它:

$request = Invoke-WebRequest $uri
$table = ConvertFrom_HtmlTable -WebRequest $request -TableClassName 'organization-admin__table table'

或者如果您知道它是html中的第一个或第x个表,则使用TableIndex参数,因为它显然没有idname
如果成功了,你可以简单地写到csv:

$table | Export-Csv -Path 'X:\path\to\theTable.csv' -NoTypeInformation

从您的评论来看,似乎出于某种原因,您不能使用Invoke-WebRequest,而必须使用IE com对象进行解析。
请尝试以下版本的函数:

function ConvertFrom_HtmlTable {
    [CmdletBinding(DefaultParameterSetName = 'ByIndex')]
    param(
        [Parameter(ValueFromPipeline = $true, Mandatory = $true, Position = 0)]
        [string]$Url,

        [Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByIndex')]
        [int]$TableIndex = 0,

        [Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ById')]
        [string]$TableId,

        [Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByName')]
        [string]$TableName,
    
        [Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByClass')]
        [string]$TableClassName,

        [switch]$FirstRowHasHeaders
    )

    $ie = New-Object -ComObject 'InternetExplorer.Application'
    $ie.Visible = $false
    $ie.Silent = $true
    $ie.Navigate($Url)
    # wait for IE to fully load the document
    while($ie.Busy) { Start-Sleep -Milliseconds 100 }

    $doc = $ie.Document

    switch ($PSCmdlet.ParameterSetName) {
        'ById'    { $table = $doc.IHTMLDocument3_getElementByID($TableId) }
        'ByIndex' { $table = @($doc.IHTMLDocument3_getElementsByTagName('table'))[$TableIndex]}
        'ByName'  { $table = @($doc.IHTMLDocument3_getElementsByName($TableName))[0] }
        'ByClass' { $table = @($doc.IHTMLDocument3_getElementsByClassName($TableClassName))[0] }
    }

    if ($table) {
        # Extracting table rows as a collection.
        $tbody = $table.childNodes | Where-Object { $_.tagName -eq "tbody" }
        if ($tbody) {
            $rows = $tbody.childNodes | Where-Object { $_.tagName -eq "tr" }
        }
        else {
            $rows = $table.childNodes | Where-Object { $_.tagName -eq "tr" }
        }

        # read or create table headers
        # assume the first row has headers either in <th> or <td> tags
        $firstRow = 1
        $headers = @($rows[0].childNodes | Where-Object { $_.tagName -eq "th" } | Foreach-Object { $_.innerHTML })

        if (!($headers)) {
            # there were no <th> tags found, so either use the first row as headers or create from scratch
            $values = @($rows[0].childNodes | Where-Object { $_.tagName -eq "td" } | Foreach-Object { $_.innerHTML })
            if ($FirstRowHasHeaders) {
                # the headers are considered to be the values from the first row
                $headers = $values
            }
            else {
                # the table has no headers, so dynamically create them
                $firstRow = 0
                $headers =  for ($i = 1; $i -le $values.Count; $i++) { "Column_$i" }
            }
        }

        # create a List object to store the values found as PSObjects
        $result = [System.Collections.Generic.List[object]]::new()
        for ($i = $firstRow; $i -lt $rows.Count; $i++) {
            $values = @($rows[$i].childNodes | Where-Object { $_.tagName -eq "td" } | Foreach-Object { $_.innerHTML })
            $valuesCount = $values.Count
            while ($headers.Count -lt $valuesCount) {
                $colName = "Column_{0}" -f ($headers.Count + 1)
                $headers += $colName
                # we have just added a new header column. Make sure the first item also has this new column
                if ($result.Count) { 
                    $result[0] | Add-Member -MemberType NoteProperty -Name $colName -Value $null
                }
            }
            # create a Hashtable to get store the values
            $data = [ordered]@{}
            for ($j = 0; $j -lt $valuesCount; $j++) { $data[$headers[$j]] = $values[$j] }
            # add the hash cast to PsCustomObject to the list
            $result.Add(([PsCustomObject]$data))
        }
    }
    else { Write-Warning "Could not find the given table." }

    # quit IE and clean up
    $ie.Quit()
    $null = [System.Runtime.Interopservices.Marshal]::ReleaseComObject($ie)
    [System.GC]::Collect()
    [System.GC]::WaitForPendingFinalizers()

    return $result
}

$table = ConvertFrom_HtmlTable -Url 'your URL here' -TableClassName 'organization-admin__table'
$table | Export-Csv -Path 'X:\path\to\theTable.csv' -NoTypeInformation

使用InternetExplorer.Application COM对象的第二个函数需要使用DOM查找表对象。为此,该函数目前使用IHTMLDocument3 interface,对我来说,在Windows 10 Pro,PowerShell 5.1和IE版本11.789.19041.0上进行测试时,例如

ConvertFrom_HtmlTable -Url 'https://www.w3schools.com/html/html_tables.asp' -TableId 'customers'

根据您的评论,您收到错误消息:
方法调用失败,因为[mshtml.HTMLDocumentClass]不包含名为“IHTMLDocument3_getElementsByClassName”的方法。
这意味着您的机器上有一个不同的(未更新/损坏)版本,您必须自己尝试哪种方法有效:
1.首先通过在PowerShell控制台中键入以下内容来测试您的IE版本:

(Get-ItemProperty 'HKLM:\SOFTWARE\Microsoft\Internet Explorer').SvcVersion

如果返回空白,请尝试

(Get-ItemProperty 'HKLM:\SOFTWARE\Microsoft\Internet Explorer').Version

1.接下来,在switch中更改从

$doc.IHTMLDocument3_getElementsByClassName($TableClassName)

$doc.getElementsByClassName($TableClassName)

$doc.documentElement.getElementsByClassName($TableClassName)

如果所有这些都失败了,恐怕你的计算机上有一个严重的问题(也许Invoke-Webrequest也不工作的原因?))。尝试使用fsc /scannow解决此问题

ctzwtxfj

ctzwtxfj2#

我想用一个表转换一个本地HTML文件。我也没有安装IE,所以我对上面Theo的伟大答案做了以下更新,它起作用了:
变更:

$ie = New-Object -ComObject 'InternetExplorer.Application'
$ie.Visible = $false
$ie.Silent = $true
$ie.Navigate($Url)
# wait for IE to fully load the document
while($ie.Busy) { Start-Sleep -Milliseconds 100 }

$doc = $ie.Document

$Source = Get-Content -path $Url -raw
$HTML = New-Object -Com "HTMLFile"
$HTML.IHTMLDocument2_write($Source)

$doc = $HTML.childNodes.ie9_item().document

并移除

$ie.Quit()
$null = [System.Runtime.Interopservices.Marshal]::ReleaseComObject($ie)

相关问题