json 如何在jq --stream命令中使用'select'?

egmofgnx  于 2022-12-15  发布在  其他
关注(0)|答案(2)|浏览(93)

我有一个非常大的json文档(~100 GB),我尝试使用jq解析出满足给定条件的特定对象。因为它太大了,我无法将其读入内存,需要使用--stream选项。
我知道如何运行一个select来提取我不流媒体时所需要的内容,但在确定如何正确配置我的命令时可能需要一些帮助。
下面是我的文档example.json的一个示例。

{
  "reporting_entity_name" : "INSURANCE COMPANY",
  "reporting_entity_type" : "INSURER",
  "last_updated_on" : "2022-12-01",
  "version" : "1.0.0",
  "in_network" : [ {
    "negotiation_arrangement" : "ffs",
    "name" : "ER VISIT",
    "billing_code_type" : "CPT",
    "billing_code_type_version" : "2022",
    "billing_code" : "99285",
    "description" : "HIGHEST LEVEL ER VISIT",
    "negotiated_rates" : [ {
      "provider_groups" : [ {
        "npi" : [ 111111111, 222222222],
        "tin" : {
          "type" : "ein",
          "value" : "99-9999999"
        }
      } ],
      "negotiated_prices" : [ {
        "negotiated_type" : "negotiated",
        "negotiated_rate" : 550.50,
        "expiration_date" : "9999-12-31",
        "service_code" : [ "23" ],
        "billing_class" : "institutional"
      } ]
    } ]
  }
]
}

我正在尝试获取in_network对象,其中billing_code等于99285。
如果我能做到这一点没有流,这里是我会如何处理它:

jq '.in_network[] | select(.billing_code == "99285")' example.json

预期产出:

{
  "negotiation_arrangement": "ffs",
  "name": "ER VISIT",
  "billing_code_type": "CPT",
  "billing_code_type_version": "2022",
  "billing_code": "99285",
  "description": "HIGHEST LEVEL ER VISIT",
  "negotiated_rates": [
    {
      "provider_groups": [
        {
          "npi": [
            111111111,
            222222222
          ],
          "tin": {
            "type": "ein",
            "value": "99-9999999"
          }
        }
      ],
      "negotiated_prices": [
        {
          "negotiated_type": "negotiated",
          "negotiated_rate": 550.5,
          "expiration_date": "9999-12-31",
          "service_code": [
            "23"
          ],
          "billing_class": "institutional"
        }
      ]
    }
  ]
}

任何关于如何使用--stream选项配置此选项的帮助都将非常感谢!

dxxyhpgq

dxxyhpgq1#

如果仅.in_network数组中的对象就可以放入内存,则在数组项处截断(两级深度):

jq --stream -n '
  fromstream(2|truncate_stream(inputs | select(.[0][0] == "in_network")))
  | select(.billing_code == "99285")
' example.json
{
  "negotiation_arrangement": "ffs",
  "name": "ER VISIT",
  "billing_code_type": "CPT",
  "billing_code_type_version": "2022",
  "billing_code": "99285",
  "description": "HIGHEST LEVEL ER VISIT",
  "negotiated_rates": [
    {
      "provider_groups": [
        {
          "npi": [
            111111111,
            222222222
          ],
          "tin": {
            "type": "ein",
            "value": "99-9999999"
          }
        }
      ],
      "negotiated_prices": [
        {
          "negotiated_type": "negotiated",
          "negotiated_rate": 550.5,
          "expiration_date": "9999-12-31",
          "service_code": [
            "23"
          ],
          "billing_class": "institutional"
        }
      ]
    }
  ]
}
nle07wnf

nle07wnf2#

您会发现jq —-stream即使对于10GB的数据也非常慢。由于jq旨在补充其他shell工具,因此我建议使用jstream(https://github.com/bcicen/jstream)或我自己的jm或jm.py(https://github.com/pkoppstein/jm)来“splat”数组,并将结果通过管道传输到jq。
例如,要达到与jq过滤器相同的效果:

jm —-pointer /in_network example.json | 
  jq 'select(.billing_code == "99285")'

相关问题