大数组中的快速线搜索--PowerShell

hrirmatl 于 2022-11-10 发布在 Shell

关注(0)|答案(3)|浏览(115)

我有一些数组(从CSV文件导出)，大约有100k行。磁盘上的文件大小约为22MB，我所需要的就是找到包含一些数据的行，处理它并加载到MSSQL(我需要将CSV数据与MSSQL同步)
问题是搜索几乎需要1秒(~Total毫秒：655,0788)！

$csv.Where({$_.'Device UUID' -eq 'lalala'})

有没有什么办法可以加快速度呢？

powershell

来源：https://stackoverflow.com/questions/67034845/fast-line-search-in-big-array-powershell

3条答案

按热度按时间

kgqe7b3p1#

将所有100K行加载到哈希表中，使用Device UUID属性作为键-这将使查找行比使用.Where({...})迭代整个数组要快得多：

$deviceTable = @{}
Import-Csv .\path\to\device_list.csv |ForEach-Object {
  $deviceTable[$_.'Device UUID'] = $_
}

现在，这将花费大大少于1秒的时间：

$matchingDevice = $deviceTable['lalala']

赞(0）回复(0）举报 2022-11-10

jtoj6r0c2#

如果您只需要一次或几次查找，您可以考虑以下Mathias R. Jessen's helpful answer的替代方案。请注意，与Mathias的解决方案一样，它需要一次将所有行读取到内存中：


# Load all rows into memory.

$allRows = Import-Csv file.csv

# Get the *index* of the row with the column value of interest.

# Note: This lookup is case-SENSITIVE.

$rowIndex = $allRows.'Device UUID'.IndexOf('lalala')

# Retrieve the row of interest by index, if found.

($rowOfInterest = if ($rowIndex -ne -1) { $allRows[$rowIndex] })

一旦行被加载到内存中(作为[pscustomobject]示例，这本身不会很快)，数组查找-通过member-access enumeration-相当快，这要归功于*.NET使用System.Array.IndexOf()方法执行(线性)数组搜索。
.Where({ ... })方法的问题在于多次迭代调用PowerShell*脚本块({ ... })的计算代价很高。
这可以归结为以下权衡：

要么**：花更多的时间预先建立一个数据结构([hashtable])，允许高效查找(Mathias的答案)
或**：读取文件更快，但每次查找花费更多时间(此答案)。

赞(0）回复(0）举报 2022-11-10

axr492tv3#

玩着Sqlite贝壳。

'Device UUID' | set-content file.csv
1..2200kb | % { get-random } | add-content file.csv # 1.44 sec, 25mb 

'.mode csv
.import file.csv file' | sqlite3 file  # 2.92 sec, 81mb

# last row

'select * from file where "device uuid" = 2143292650;' | sqlite3 file

# 'select * from file where "device uuid" > 2143292649 and "device uuid" < 2143292651;' | sqlite3 file

2143292650

(history)[-1] | % { $_.endexecutiontime - $_.startexecutiontime }

Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 0
Milliseconds      : 570
Ticks             : 5706795
TotalDays         : 6.60508680555556E-06
TotalHours        : 0.000158522083333333
TotalMinutes      : 0.009511325
TotalSeconds      : 0.5706795
TotalMilliseconds : 570.6795

# 34 ms after this:

# 'create index deviceindex on file("device uuid");' | sqlite3 file

# with ".timer on", it's 1ms, after the table is loaded

赞(0）回复(0）举报 2022-11-10

我来回答

大数组中的快速线搜索--PowerShell

3条答案

相关问题

热门标签

最新问答