为什么spark schema的.simpleString()方法截断了我的输出？

voj3qocg 于 2023-06-30 发布在 Apache

关注(0)|答案(1)|浏览(191)

我有一个很长的模式，我想返回字符串

import org.apache.spark.SparkConf;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
...
SparkSession spark = SparkSession.builder().config(new SparkConf().setAppName("YourApp").setMaster("local")).getOrCreate();
Dataset<Row> parquetData = spark.read().parquet("/Users/demo/test.parquet");
String schemaString = parquetData.schema().simpleString();

问题是生成的模式看起来像（见“10个字段”）：

struct<test:struct<countryConfidence:struct<value:double>,... 10 more fields> etc etc>

使用

<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-sql_2.12</artifactId>
  <version>3.2.4</version>
</dependency>

是否有一些配置选项可以使用，这意味着.simpleString不截断？我试过parquetData.schema().toDDL()，但它不能打印我需要的格式。

apache-spark

来源：https://stackoverflow.com/questions/76575614/why-is-simplestring-method-of-spark-schema-truncating-my-output

1条答案

按热度按时间

u3r8eeie1#

如果你深入了解simpleString方法，你可以看到Spark使用了truncatedString，其中SQLConf.get.maxToStringFields作为第三个参数传递。
此配置的定义如下所述：

val MAX_TO_STRING_FIELDS = buildConf("spark.sql.debug.maxToStringFields")
  .doc("Maximum number of fields of sequence-like entries can be converted to strings " +
    "in debug output. Any elements beyond the limit will be dropped and replaced by a" +
    """ "... N more fields" placeholder.""")
  .version("3.0.0")
  .intConf
  .createWithDefault(25)

Solution
将spark.sql.debug.maxToStringFields调整为大于25的数字，例如50（任意，但应根据您的用例确定），例如：

SparkSession spark = SparkSession.builder()
  .appName("Spark app name")
  .master("local[*]")
  .config("spark.sql.debug.maxToStringFields", 50)
  .getOrCreate();

祝你好运！

展开查看全部

赞(0）回复(0）举报 2023-06-30

我来回答

为什么spark schema的.simpleString()方法截断了我的输出？

1条答案

相关问题

热门标签

最新问答