sql—如何通过snowsql从存储在snowflake中的复杂json中提取数据?

v1uwarro  于 2021-08-09  发布在  Java
关注(0)|答案(1)|浏览(452)

我在snowflake的一个变量列表中存储了数百万个json。它们的格式如下,尽管每个json的行数不同。
有人能给我一些关于如何将数据提取到一个平面表格中的指导吗?我对处理json文件还比较陌生,而且行数不一致,而且缺少定义对象名的指示符,这让我很困惑。
下面是一个json示例:

{
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AB2 Weight on Bit": 0.2714572,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AB2 Weight on Bit unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AD Diff Press Gain SP": 0,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AD Diff Press Gain SP unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AD ROP": 0,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AD ROP unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Calculated Pipe Displacement": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Calculated Pipe Displacement unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Cumulative Delta Displacement": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Cumulative Delta Displacement unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.FD Svy Quality": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.FD Svy Quality unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.GWEX SampleFlow": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.GWEX SampleFlow unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.MP3_STK": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.MP3_STK unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.PT Correction": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.PT Correction unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Pit 11 Jumps": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Pit 11 Jumps unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.ROP - #1 Ref Time": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.ROP - #1 Ref Time unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.TANK2_VOL": 8.732743,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.TANK2_VOL unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.TANK4_VOL": 16.13105,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.TANK4_VOL unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Time On Slip": 1.3,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Time On Slip unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.WPDA - Mud Motor Torque": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.WPDA - Mud Motor Torque unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Washout Factor": 4.167005,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Washout Factor unit": "",
  "DeviceId": "streamingdevice",
  "EventEnqueuedUtcTime": "2020-05-04T22:12:21.5310000Z",
  "EventProcessedUtcTime": "2020-05-04T22:12:35.6868329Z",
  "IoTHub": {
    "ConnectionDeviceGenerationId": "637199801617320690",
    "ConnectionDeviceId": "streamingdevice",
    "CorrelationId": null,
    "EnqueuedTime": "2020-05-04T22:12:21.0000000",
    "MessageId": null,
    "StreamId": null
  },
  "PartitionId": 1,
  "Timestamp": "2019-10-30 13:48:05.000000"
}

“edge 93 belgium 43-23-19 1932”是对象名称;每个json只针对一个对象。
“time_1_avg.ab2 weight on bit”是读取类型,基本上由tag1.tag2组成。
行的最后一部分是读取值。
json底部的时间戳是读取时间。
本节不是必需的:

"DeviceId": "streamingdevice",
  "EventEnqueuedUtcTime": "2020-05-04T22:12:21.5310000Z",
  "EventProcessedUtcTime": "2020-05-04T22:12:35.6868329Z",
  "IoTHub": {
    "ConnectionDeviceGenerationId": "637199801617320690",
    "ConnectionDeviceId": "streamingdevice",
    "CorrelationId": null,
    "EnqueuedTime": "2020-05-04T22:12:21.0000000",
    "MessageId": null,
    "StreamId": null
  },
  "PartitionId": 1,

此数据的理想输出为:

但只要得到这样的东西就很有帮助了:

谢谢你的帮助!

6l7fqoea

6l7fqoea1#

假设所需的键始终有3个周期分隔的组件,则以下是解决方案的一种形式:
使用 FLATTEN 表函数取任意 VARIANT 从表中键入列(示例中为1行常量),并将其分解为多行
依赖于生成的 THIS 列(从 FLATTEN 表)发出行常量值( Timestamp )对于每个分解的行
使用 NOT IN 筛选以排除不需要的键名
使用 SPLIT 具有索引的函数将提取的键划分为多列

SELECT
  SPLIT(KEY, '.')[0] AS "Object Name"
, SPLIT(KEY, '.')[1] AS "Tag 1"
, SPLIT(KEY, '.')[2] AS "Tag 2"
, VALUE AS "Value"
, THIS:Timestamp::TIMESTAMP AS "Timestamp"
FROM TABLE(FLATTEN(PARSE_JSON('
{
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AB2 Weight on Bit": 0.2714572,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AB2 Weight on Bit unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AD Diff Press Gain SP": 0,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AD Diff Press Gain SP unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AD ROP": 0,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AD ROP unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Calculated Pipe Displacement": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Calculated Pipe Displacement unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Cumulative Delta Displacement": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Cumulative Delta Displacement unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.FD Svy Quality": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.FD Svy Quality unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.GWEX SampleFlow": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.GWEX SampleFlow unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.MP3_STK": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.MP3_STK unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.PT Correction": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.PT Correction unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Pit 11 Jumps": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Pit 11 Jumps unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.ROP - #1 Ref Time": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.ROP - #1 Ref Time unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.TANK2_VOL": 8.732743,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.TANK2_VOL unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.TANK4_VOL": 16.13105,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.TANK4_VOL unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Time On Slip": 1.3,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Time On Slip unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.WPDA - Mud Motor Torque": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.WPDA - Mud Motor Torque unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Washout Factor": 4.167005,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Washout Factor unit": "",
  "DeviceId": "streamingdevice",
  "EventEnqueuedUtcTime": "2020-05-04T22:12:21.5310000Z",
  "EventProcessedUtcTime": "2020-05-04T22:12:35.6868329Z",
  "IoTHub": {
    "ConnectionDeviceGenerationId": "637199801617320690",
    "ConnectionDeviceId": "streamingdevice",
    "CorrelationId": null,
    "EnqueuedTime": "2020-05-04T22:12:21.0000000",
    "MessageId": null,
    "StreamId": null
  },
  "PartitionId": 1,
  "Timestamp": "2019-10-30 13:48:05.000000"
}
')))
WHERE
  KEY NOT IN ('DeviceId', 'IoTHub', 'PartitionId', 'Timestamp', 'EventEnqueuedUtcTime', 'EventProcessedUtcTime');

这将产生一个类似于第一个屏幕截图的结果:

相关问题