使用jq和awk拆分大型JSON文件

btxsgosb  于 2022-11-26  发布在  其他
关注(0)|答案(2)|浏览(227)

我有一个很大的文件

Metadata_01.json

它由以下结构的块组成:

[
 {
  "Participant_id": "P04_00001",
  "no_of_people": "Multiple",
  "apparent_gender": "F",
  "geographic_location": "AUS",
  "ethnicity": "Caucasian",
  "capture_device_used": "iOS 14",
  "camera_orientation": "Portrait",
  "camera_position": "Side View",
  "indoor_outdoor_env": "Indoors",
  "lighting_condition": "Bright",
  "Occluded": 1,
  "category": "Two Person",
  "camera_movement": "Still",
  "action": "No action",
  "indoor_outdoor_in_moving_car_or_train": "Indoor",
  "daytime_nighttime": "Nighttime"
 },
 {
  "Participant_id": "P04_00002",
  "no_of_people": "Single",
  "apparent_gender": "M",
  "geographic_location": "AUS",
  "ethnicity": "Caucasian",
  "capture_device_used": "iOS 14",
  "camera_orientation": "Portrait",
  "camera_position": "Frontal View",
  "indoor_outdoor_env": "Outdoors",
  "lighting_condition": "Bright",
  "Occluded": "None",
  "category": "Animals",
  "camera_movement": "Still",
  "action": "Small action",
  "indoor_outdoor_in_moving_car_or_train": "Outdoor",
  "daytime_nighttime": "Daytime"
 },

等等......成千上万。
我正在使用以下命令:

jq -cr '.[]' Metadata_01.json | awk '{print > (NR ".json")}'

它正在做预期的工作。
From large file that is structured like this
I am getting tons of files that named like this
And structure like this (in one line)
我需要每个json文件以“Participant_id”命名(例如P04_00002.json),而不是这些结果。我希望保留json结构,使每个文件看起来都像这样

{
  "Participant_id": "P04_00002",
  "no_of_people": "Single",
  "apparent_gender": "M",
  "geographic_location": "AUS",
  "ethnicity": "Caucasian",
  "capture_device_used": "iOS 14",
  "camera_orientation": "Portrait",
  "camera_position": "Frontal View",
  "indoor_outdoor_env": "Outdoors",
  "lighting_condition": "Bright",
  "Occluded": "None",
  "category": "Animals",
  "camera_movement": "Still",
  "action": "Small action",
  "indoor_outdoor_in_moving_car_or_train": "Outdoor",
  "daytime_nighttime": "Daytime"
 }

我应该对上面的命令做些什么调整来达到这个目的?或者有更简单的方法来实现这个目的?谢谢!

kulphzqa

kulphzqa1#

我应该做哪些调整...?
我会说:

jq -cr '.[] | (.Participant_id, .)' Metadata_01.json | awk '
  NR%2==1 {fn="id." $0 ".json"; next} {print >> fn; close(fn); }
'

然后运行类似jq . "$FILE" | sponge "$FILE"的命令来精确打印每个文件。
或者,如果您能够解决转义引号时可能出现的任何问题,那么您可以让awk调用jq:

jq -cr '.[] | (.Participant_id, .)' Metadata_01.json | awk -v q=$'\'' '
  NR%2==1 {fn = "id." $0 ".json"; next}
  {  system( ("jq . <<< " q $0 q " >> \"" fn "\"") );
     close(fn);
  }
'

“大数据”

当然,如果输入文件对于jq empty来说太大或太慢,那么您将需要考虑替代方法,例如jq的--stream选项、jstream或我自己的jm。例如,如果您希望JSON在每个文件中打印得很漂亮:

while read -r json
do
   fn=$(jq -r .Participant_id <<< "$json")
   <<< "$json" jq . > "id.$fn.json"
done < <(jm Metadata_01.json)
pexxcrt2

pexxcrt22#

建议使用PowerShell,因为处理对象总体上更容易。幸运的是,PowerShell有一个ConvertFrom-Json cmdlet,您可以使用它将返回的文本转换为PS对象,以便通过点标记引用属性(.Participant_id)。然后,只需将每个迭代转换回JSON格式并导出即可。在这里,我使用New-Item创建带有输出的文件,但也可以通过管道传输到Out-File

$json = Get-Content -Path '.\Metadata_01.json' -Raw | ConvertFrom-Json 
foreach ($json_object in $json)
{
    New-Item -Path ".\Desktop\" -Name "$($json_object.Participant_id).json" -Value (ConvertTo-Json -InputObject $json_object) -ItemType 'File' -Force
}

我可以看到您遇到的问题 * 可能 * 是内存不足,这是由于该文件的大小,因为在本例中您将首先保存到一个变量中。有很多方法可以解决这个问题,但这只是出于演示目的。

相关问题