Spark SQL:JSON路径表达式的另一列的数组索引值给出错误:PARSE_SYNTAX_ERROR

wxclj1h5  于 2023-10-23  发布在  Apache
关注(0)|答案(1)|浏览(158)

Databricks运行时:13.2 ML(包括Apache Spark 3.4.0,Scala 2.12)
我试图使用SQL在Databricks中从JSON数组访问单个对象。然而,要选择的对象的数组索引是另一列的一部分。SQL适用于静态数组值,但不适用于另一列的整数值。例如:

-- Here it is a:[1]
SELECT a:[1] FROM VALUES('[{
        "name": "Let Us C",
        "author": "Yashavant Kanetkar"
    },
    {
        "name": "Rich Dad Poor Dad",
        "author": "Robert Kiyosaki "
    },
    {
        "name": "Introduction to Algorithms",
        "author": "Cormen"
    }
]', 1) AS T(a,b);

| 输出|
| --|
| {“name”:“Rich Dad Poor Dad”,“author”:“Robert Kiyosaki“}|
当索引是动态值时,我会出错

-- Here it is a:[b]
SELECT a:[b] FROM VALUES('[{
        "name": "Let Us C",
        "author": "Yashavant Kanetkar"
    },
    {
        "name": "Rich Dad Poor Dad",
        "author": "Robert Kiyosaki "
    },
    {
        "name": "Introduction to Algorithms",
        "author": "Cormen"
    }
]', 1) AS T(a,b);

上述SQL错误

Error in SQL statement: ParseException: 
[PARSE_SYNTAX_ERROR] Syntax error at or near 'b'.(line 1, pos 10)
...
... SQL 
...
    at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:337)
    at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:168)
    at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:109)
    at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:110)
    at com.databricks.sql.parser.DatabricksSqlParser.$anonfun$parsePlan$1(DatabricksSqlParser.scala:77)
...
...
...
    at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:649)
    at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:492)
    at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:411)
    at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:259)
    at java.lang.Thread.run(Thread.java:750)

请让我知道,如果提供的信息是不够的。任何解决方案或变通办法将不胜感激。谢谢!

最终目标:使用SQL将JSON数组值放在不同的行中,而不指定详细的schema。即分解JSON数组。数组中对象的schema非常复杂和动态,因此无法传递schema。我只需要不同行中的数组值。

mmvthczy

mmvthczy1#

下面的解决方案对我来说很有效,因为我真的在用数组的索引查找所有的值。

with json_array ( 
select '[{
        "name": "Let Us C",
        "author": "Yashavant Kanetkar"
    },
    {
        "name": "Rich Dad Poor Dad",
        "author": "Robert Kiyosaki "
    },
    {
        "name": "Introduction to Algorithms",
        "author": "Cormen"
    }
]' as  a, 1 as b ),
array_idx as( select posexplode(from_json(a,"Array<String>") ) as (idx, idx_json) , b  from json_array )
select * from array_idx
where idx = b
 ;

输出:

SQL函数from_json通过Array的部分模式传递,这就完成了任务。

相关问题