java—数据集从csv文件加载，模式要求字段不可为空然后，printschema()描述可为null的字段

我加载一个csv文件 Apache Spark .

Dataset<Row> csv = session.read().schema(schema()).format("csv")
  .option("header","true").option("delimiter", ";").load("myFile.csv").selectExpr("*");

为此，我提供了一个模式：

public StructType schema(boolean renamed) {
   StructType schema = new StructType();

   schema = schema.add("CODGEO", StringType, false)
     .add("P16_POP1564", DoubleType, false)
     .add("P16_POP1524", DoubleType, false)
     .add("P16_POP2554", DoubleType, false)
     .add("P16_POP5564", DoubleType, false)
     .add("P16_H1564", DoubleType, false)
      ....
   return schema;
}

数据集已加载。一 printSchema() 在控制台上显示：

root
 |-- CODGEO: string (nullable = true)
 |-- P16_POP1564: double (nullable = true)
 |-- P16_POP1524: double (nullable = true)
 |-- P16_POP2554: double (nullable = true)
 |-- P16_POP5564: double (nullable = true)
 |-- P16_H1564: double (nullable = true)
 ...

但每个字段都标记为nullable=true。
我明确要求它们中的每一个都不能为null。
有什么问题吗？

对我来说，效果不错-
默认情况下，读取csv时，空字符串（“”）被视为null

测试-1。具有null的数据集schema nullable=false

String data = "id  Col_1 Col_2 Col_3 Col_4 Col_5\n" +
                "1    A     B      C     D     E\n" +
                "2    X     Y      Z     P     \"\"";

        List<String> list = Arrays.stream(data.split(System.lineSeparator()))
                .map(s -> s.replaceAll("\\s+", ","))
                .collect(Collectors.toList());
        List<StructField> fields = Arrays.stream("id  Col_1 Col_2 Col_3 Col_4 Col_5".split("\\s+"))
                .map(s -> new StructField(s, DataTypes.StringType, false, Metadata.empty()))
                .collect(Collectors.toList());
        Dataset<Row> df1 = spark.read()
                .schema(new StructType(fields.toArray(new StructField[fields.size()])))
                .option("header", true)
                .option("sep", ",")
                .csv(spark.createDataset(list, Encoders.STRING()));
        df1.show();
        df1.printSchema();

输出-

java.lang.NullPointerException
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.createExternalRow_0_0$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply(Unknown Source)
    at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collectFromPlan$1.apply(Dataset.scala:3387)
    at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collectFromPlan$1.apply(Dataset.scala:3384)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)

结论-预期行为通过

2. 具有null的数据集schema nullable=true

String data = "id  Col_1 Col_2 Col_3 Col_4 Col_5\n" +
                "1    A     B      C     D     E\n" +
                "2    X     Y      Z     P     \"\"";

        List<String> list = Arrays.stream(data.split(System.lineSeparator()))
                .map(s -> s.replaceAll("\\s+", ","))
                .collect(Collectors.toList());
        List<StructField> fields = Arrays.stream("id  Col_1 Col_2 Col_3 Col_4 Col_5".split("\\s+"))
                .map(s -> new StructField(s, DataTypes.StringType, true, Metadata.empty()))
                .collect(Collectors.toList());
        Dataset<Row> df1 = spark.read()
                .schema(new StructType(fields.toArray(new StructField[fields.size()])))
                .option("header", true)
                .option("sep", ",")
                .csv(spark.createDataset(list, Encoders.STRING()));
        df1.show();
        df1.printSchema();

输出-

+---+-----+-----+-----+-----+-----+
| id|Col_1|Col_2|Col_3|Col_4|Col_5|
+---+-----+-----+-----+-----+-----+
|  1|    A|    B|    C|    D|    E|
|  2|    X|    Y|    Z|    P| null|
+---+-----+-----+-----+-----+-----+

root
 |-- id: string (nullable = true)
 |-- Col_1: string (nullable = true)
 |-- Col_2: string (nullable = true)
 |-- Col_3: string (nullable = true)
 |-- Col_4: string (nullable = true)
 |-- Col_5: string (nullable = true)

结论-预期行为通过

3. 不带null的数据集schema nullable=true

String data1 = "id  Col_1 Col_2 Col_3 Col_4 Col_5\n" +
                "1    A     B      C     D     E\n" +
                "2    X     Y      Z     P     F";

        List<String> list1 = Arrays.stream(data1.split(System.lineSeparator()))
                .map(s -> s.replaceAll("\\s+", ","))
                .collect(Collectors.toList());
        List<StructField> fields1 = Arrays.stream("id  Col_1 Col_2 Col_3 Col_4 Col_5".split("\\s+"))
                .map(s -> new StructField(s, DataTypes.StringType, true, Metadata.empty()))
                .collect(Collectors.toList());
        Dataset<Row> df2 = spark.read()
                .schema(new StructType(fields1.toArray(new StructField[fields.size()])))
                .option("header", true)
                .option("sep", ",")
                .csv(spark.createDataset(list1, Encoders.STRING()));
        df2.show();
        df2.printSchema();

输出-

+---+-----+-----+-----+-----+-----+
| id|Col_1|Col_2|Col_3|Col_4|Col_5|
+---+-----+-----+-----+-----+-----+
|  1|    A|    B|    C|    D|    E|
|  2|    X|    Y|    Z|    P|    F|
+---+-----+-----+-----+-----+-----+

root
 |-- id: string (nullable = true)
 |-- Col_1: string (nullable = true)
 |-- Col_2: string (nullable = true)
 |-- Col_3: string (nullable = true)
 |-- Col_4: string (nullable = true)
 |-- Col_5: string (nullable = true)

结论-预期行为通过

java—数据集从csv文件加载，模式要求字段不可为空然后，printschema()描述可为null的字段

1条答案

测试-1。具有null的数据集schema nullable=false

2. 具有null的数据集schema nullable=true

3. 不带null的数据集schema nullable=true

相关问题

热门标签

最新问答