shell 如何使用jq将数组分割成块？

b91juud3 于 2022-11-16 发布在 Shell

关注(0)|答案(5)|浏览(198)

我有一个非常大的JSON文件，其中包含一个数组。是否可以使用jq将该数组拆分为几个固定大小的较小数组？假设我的输入如下：[1,2,3,4,5,6,7,8,9,10]，我想把它分成3个元素长的块。

[1,2,3]
[4,5,6]
[7,8,9]
[10]

实际上，我的输入数组有将近三百万个元素，都是UUID。

shell

来源：https://stackoverflow.com/questions/51412721/how-to-split-an-array-into-chunks-with-jq

5条答案

按热度按时间

j0pj023g1#

有一个（未记录的）内置_nwise满足功能要求：

$ jq -nc '[1,2,3,4,5,6,7,8,9,10] | _nwise(3)'

[1,2,3]
[4,5,6]
[7,8,9]
[10]

还有：

$ jq -nc '_nwise([1,2,3,4,5,6,7,8,9,10];3)' 
[1,2,3]
[4,5,6]
[7,8,9]
[10]

顺便说一下，_nwise既可以用于数组，也可以用于字符串。
（我认为它是没有记录的，因为对一个合适的名称有一些疑问。）

TCO版本

不幸的是，内置版本的定义不小心，对于大型数组的性能不好。下面是一个优化的版本（它应该和非递归版本一样高效）：

def nwise($n):
 def _nwise:
   if length <= $n then . else .[0:$n] , (.[$n:]|_nwise) end;
 _nwise;

对于大小为300万的数组，这是相当高的性能：在旧Mac上为3.91s，最大驻留大小为162746368。
请注意，这个版本（使用尾部调用优化递归）实际上比本页其他地方显示的使用foreach的nwise/2版本更快。

赞(0）回复(0）举报 2022-11-16

z9smfwbn2#

以下window/3的面向流的定义是由Cédric Connes（github：connesc）提出的，它推广了_nwise，并说明了一种“装箱技术”，它避免了使用流结束标记的需要，因此可以在流包含非JSON值nan时使用。
window/3的第一个参数被解释为流。$size是窗口大小，$step指定要跳过的值的数目。例如，

window(1,2,3; 2; 1)

产量：

[1,2]
[2,3]

窗口/3和_nsize/1

def window(values; $size; $step):
  def checkparam(name; value): if (value | isnormal) and value > 0 and (value | floor) == value then . else error("window \(name) must be a positive integer") end;
  checkparam("size"; $size)
| checkparam("step"; $step)
  # We need to detect the end of the loop in order to produce the terminal partial group (if any).
  # For that purpose, we introduce an artificial null sentinel, and wrap the input values into singleton arrays in order to distinguish them.
| foreach ((values | [.]), null) as $item (
    {index: -1, items: [], ready: false};
    (.index + 1) as $index
    # Extract items that must be reused from the previous iteration
    | if (.ready | not) then .items
      elif $step >= $size or $item == null then []
      else .items[-($size - $step):]
      end
    # Append the current item unless it must be skipped
    | if ($index % $step) < $size then . + $item
      else .
      end
    | {$index, items: ., ready: (length == $size or ($item == null and length > 0))};
    if .ready then .items else empty end
  );

def _nwise($n): window(.[]; $n; $n);

来源：

https://gist.github.com/connesc/d6b87cbacae13d4fd58763724049da58

赞(0）回复(0）举报 2022-11-16

67up9zun3#

如果数组太大，内存中容纳不下，那么我将采用@CharlesDuffy建议的策略--也就是说，使用nwise的面向流的版本将数组元素流到jq的第二次调用中，例如：

def nwise(stream; $n):
  foreach (stream, nan) as $x ([];
    if length == $n then [$x] else . + [$x] end;
    if (.[-1] | isnan) and length>1 then .[:-1]
    elif length == $n then .
    else empty
    end);

上述“驱动因素”为：

nwise(inputs; 3)

但请记住使用-n命令行选项。
从任意数组创建流：

$ jq -cn --stream '
    fromstream( inputs | (.[0] |= .[1:])
                | select(. != [[]]) )' huge.json

因此，Shell管道可能如下所示：

$ jq -cn --stream '
    fromstream( inputs | (.[0] |= .[1:])
                | select(. != [[]]) )' huge.json |
  jq -n -f nwise.jq

对于使用nwise/2将300万个项的流分组为3个一组，

/usr/bin/time -lp

对于jq的第二次调用，给出：

user         5.63
sys          0.04
   1261568  maximum resident set size

注意：此定义使用nan作为流结束标记。由于nan不是JSON值，因此处理JSON流时不会出现问题。

赞(0）回复(0）举报 2022-11-16

jum4pzuy4#

下面是黑客攻击，当然--但是 * 内存效率 * 黑客攻击，即使有一个任意长的列表：

jq -c --stream 'select(length==2)|.[1]' <huge.json \
| jq -nc 'foreach inputs as $i (null; null; [$i,try input,try input])'

管道的第一部分在输入JSON文件中流动，每个元素发出一行，假设数组由原子值组成（这里[]和{}是原子值）。因为它在流模式下运行，所以不需要在内存中存储整个内容，尽管它是一个文档。
管道的第二部分重复读取多达三个项目，并将它们组装成一个列表。
这样可以避免内存中一次需要三个以上的数据。

赞(0）回复(0）举报 2022-11-16

6uxekuva5#

有一个简单的方法对我很有效：

def chunk(n):
    range(length/n|ceil) as $i | .[n*$i:n*$i+n];

示例用法：

jq -n \
'def chunk(n): range(length/n|ceil) as $i | .[n*$i:n*$i+n];
[range(5)] | chunk(2)'
[
  0,
  1
]
[
  2,
  3
]
[
  4
]

额外好处：它不使用递归，也不依赖于_nwise，所以它也可以使用jaq。

赞(0）回复(0）举报 2022-11-16

我来回答

shell 如何使用jq将数组分割成块？

5条答案

TCO版本

窗口/3和_nsize/1

来源：

相关问题

热门标签

最新问答