我对powershell很陌生,我使用JohnLBevan的代码将HTML表格转换为CSV:
function ConvertFrom-HtmlTableRow {
[CmdletBinding()]
param (
[Parameter(Mandatory = $true, ValueFromPipeline = $true)]
$htmlTableRow
,
[Parameter(Mandatory = $false, ValueFromPipeline = $false)]
$headers
,
[Parameter(Mandatory = $false, ValueFromPipeline = $false)]
[switch]$isHeader
)
process {
$cols = $htmlTableRow | select -expandproperty td
if($isHeader.IsPresent) {
0..($cols.Count - 1) | %{$x=$cols[$_] | out-string; if(($x) -and ($x.Trim() -gt [string]::Empty)) {$x} else {("Column_{0:0000}" -f $_)}} #clean the headers to ensure each col has a name
} else {
$colCount = ($cols | Measure-Object).Count - 1
$result = new-object -TypeName PSObject
0..$colCount | %{
$colName = if($headers[$_]){$headers[$_]}else{("Column_{0:00000} -f $_")} #in case we have more columns than headers
$colValue = $cols[$_]
$result | Add-Member NoteProperty $colName $colValue
}
write-output $result
}
}
}
function ConvertFrom-HtmlTable {
[CmdletBinding()]
param (
[Parameter(Mandatory = $true, ValueFromPipeline = $true)]
$htmlTable
)
process {
#currently only very basic <table><tr><td>...</td></tr></table> structure supported
#could be improved to better understand tbody, th, nested tables, etc
#$htmlTable.childNodes | ?{ $_.tagName -eq 'tr' } | ConvertFrom-HtmlTableRow
#remove anything tags that aren't td or tr (simplifies our parsing of the data
[xml]$cleanedHtml = ("<!DOCTYPE doctypeName [<!ENTITY nbsp ' '>]><root>{0}</root>" -f ($htmlTable | select -ExpandProperty innerHTML | %{(($_ | out-string) -replace '(</?t[rdh])[^>]*(/?>)|(?:<[^>]*>)','$1$2') -replace '(</?)(?:th)([^>]*/?>)','$1td$2'}))
[string[]]$headers = $cleanedHtml.root.tr | select -first 1 | ConvertFrom-HtmlTableRow -isHeader
if ($headers.Count -gt 0) {
$cleanedHtml.root.tr | select -skip 1 | ConvertFrom-HtmlTableRow -Headers $headers | select $headers
}
}
}
但是每当我从parsedHTML变量执行它并获取elementbytagname“table”时,我都会得到这个错误:
Cannot convert value "<!DOCTYPE doctypeName [<!ENTITY nbsp ' '>]><root>
</root>" to type "System.Xml.XmlDocument". Error: "The 'Tr' start tag on line 16 position 124 does not match the end tag of 'td'. Line 20, position 3."
At line:108 char:9
+ [xml]$cleanedHtml = ("<!DOCTYPE doctypeName [<!ENTITY nbsp ' ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : InvalidArgument: (:) [], RuntimeException
+ FullyQualifiedErrorId : InvalidCastToXmlDocument
我希望有人能帮我。先谢谢你了。
我正在尝试与外部网站合作。这是表格的HTML代码:
<table class="organization-admin__table table">
<thead>
<tr>
<th colspan="2">Name</th>
<th>Email address</th>
<th>Timezone</th>
<th>Last logged in</th>
<th>Actions</th>
</tr>
</thead>
<tbody>
<tr>
<td width="48px">
<a href="/site/users/samluser" class="avatar hz-hint hz-hint--bottom" data-hint="user1 <user1@site.com>" title="user1 <user1@site.com>">
<img src="https://portal.website.com/avatar/0fd7f51cee04789c617b1cc973e0b245.jpg?s=64&r=g&d=https%3A%2F%2Fportal.website.com%2Fplaceholders%2F64%2F87d37c%2Ffff%26text%3DTM" alt="user1 <user1@site.com>" width="32" height="32">
</a>
</td>
<td><a href="/site/users/samluser">user1</a></td>
<td><a href="mailto:user1@site.com">user1@site.com</a></td>
<td>Canada/Eastern</td>
<td>05 Aug 2021</td>
<td>
<ul class="button-group">
<li>
<a href="/site/users/samluser/edit" class="btn btn-sm btn-primary">
<i class="fa fa-pencil-alt"></i>
Edit
</a>
</li>
<li>
<a href="/site/users/samluser/delete" class="btn btn-sm btn-danger">
<i class="fa fa-trash-alt"></i>
Delete
</a>
</li>
</ul>
</td>
</tr>
<tr>
<td width="48px">
<a href="/site/users/samluser" class="avatar hz-hint hz-hint--bottom" data-hint="user2 <user2@site.ca>" title="user2 <user2@site.ca>">
<img src="https://portal.website.com/avatar/481355c93fa79e47ca56110da63d6da5.jpg?s=64&r=g&d=https%3A%2F%2Fportal.website.com%2Fplaceholders%2F64%2F044f67%2Ffff%26text%3DVS" alt="user2 <user2@site.ca>" width="32" height="32">
</a>
</td>
<td><a href="/site/users/samluser">user2</a></td>
<td><a href="mailto:user2@site.ca">user2@site.ca</a></td>
<td>Canada/Eastern</td>
<td>16 Jul 2021</td>
<td>
<ul class="button-group">
<li>
<a href="/site/users/samluser/edit" class="btn btn-sm btn-primary">
<i class="fa fa-pencil-alt"></i>
Edit
</a>
</li>
<li>
<a href="/site/users/samluser/delete" class="btn btn-sm btn-danger">
<i class="fa fa-trash-alt"></i>
Delete
</a>
</li>
</ul>
</td>
</tr>
<tr>
<td width="48px">
<a href="/site/users/samluser" class="avatar hz-hint hz-hint--bottom" data-hint="user3 <user3@site.com>" title="user3 <user3@site.com>">
<img src="https://portal.website.com/avatar/450f564aaba30e75fe70dc5f4bbefaf6.jpg?s=64&r=g&d=https%3A%2F%2Fportal.website.com%2Fplaceholders%2F64%2Fffb61e%2Ffff%26text%3DWP" alt="Wilfred <user3@site.com>" width="32" height="32">
</a>
</td>
<td><a href="/site/users/samluser">Wilfred</a></td>
<td><a href="mailto:user3@site.com">Wilfred@site.com</a></td>
<td>UTC</td>
<td>26 Jul 2021</td>
<td>
<ul class="button-group">
<li>
<a href="/site/users/samluser/edit" class="btn btn-sm btn-primary">
<i class="fa fa-pencil-alt"></i>
Edit
</a>
</li>
<li>
<a href="/site/users/samluser/delete" class="btn btn-sm btn-danger">
<i class="fa fa-trash-alt"></i>
Delete
</a>
</li>
</ul>
</td>
</tr>
</tbody>
</table>
2条答案
按热度按时间gcuhipw91#
如前所述,转换为XML有严格的规则,当HTML neglet编写结束标记
</tr>
时,将其加载为xml将失败。对于没有结束标记</img>
的<img>
标记也是如此。我没有你正在加载的完整HTML,但也许可以尝试下面的函数:
这样称呼它:
或者如果您知道它是html中的第一个或第x个表,则使用TableIndex参数,因为它显然没有
id
或name
如果成功了,你可以简单地写到csv:
从您的评论来看,似乎出于某种原因,您不能使用
Invoke-WebRequest
,而必须使用IE com对象进行解析。请尝试以下版本的函数:
使用
InternetExplorer.Application
COM对象的第二个函数需要使用DOM查找表对象。为此,该函数目前使用IHTMLDocument3 interface,对我来说,在Windows 10 Pro,PowerShell 5.1和IE版本11.789.19041.0上进行测试时,例如根据您的评论,您收到错误消息:
方法调用失败,因为[mshtml.HTMLDocumentClass]不包含名为“IHTMLDocument3_getElementsByClassName”的方法。
这意味着您的机器上有一个不同的(未更新/损坏)版本,您必须自己尝试哪种方法有效:
1.首先通过在PowerShell控制台中键入以下内容来测试您的IE版本:
如果返回空白,请尝试
1.接下来,在
switch
中更改从到
或
如果所有这些都失败了,恐怕你的计算机上有一个严重的问题(也许
Invoke-Webrequest
也不工作的原因?))。尝试使用fsc /scannow解决此问题ctzwtxfj2#
我想用一个表转换一个本地HTML文件。我也没有安装IE,所以我对上面Theo的伟大答案做了以下更新,它起作用了:
变更:
到
并移除