java—对具有字符串列的数据集进行分区和存储,字符串列的值看起来是数字再次读取时,数据仍然是“字符串”,但丢失了零

dfddblmv  于 2021-07-09  发布在  Spark
关注(0)|答案(2)|浏览(287)

Spark 3.0.2 ,我在写一篇文章 Dataset 在Parquet锉里。我的代码就是这样结束的:

etablissements = etablissements.repartition(col("codeDepartement"));
etablissements = etablissements.sortWithinPartitions(col("siret"));
etablissements = etablissements.persist();

// Write it in a file named with the year of data, selections, and sorting in it's name.
// Underlying statement writing the parquet file is :
// ds.write().partitionBy(colonnesPartionnement /* = codeDepartement */)
saveToStore(etablissements, new String[] {"codeDepartement"}, 
   "{0}_{1,number,#0}_{2}_{3}", "etablissements", anneeSIRENE,  actifsSeulement, 
   communesValides);

这个 codeDepartment 有一个 StringType ,因为法国的部门代码是三个字符的代码。


# schema() :

|-- codeDepartement: string (nullable = true)

它在最后三分之一处可见 show() 输出(城市名称前三列大写),并具有for值: "01" :

+---------+-----+--------------+----------------------------+-------------------------+----------------------+---------------------------+---------------------------------+---------------------+------------------+---------------------------+--------------------------+----------+----------------+----------+-------------------------+----------+-------------------+--------------------+-----------+-----+---------------------+----------------+---------------+---------------------------+--------------------+--------------------------+--------------------+---------------------+--------------------+--------------------+-----------------------------+------------------------------+---------------------+---------------+----------------------+--------------------------+-------------------------+----------------------+------------------------------+-------------------+---------+---------+-------------------------+------------------+------------------------------+-------------------------------+------+----------------------------+------------------+-----+--------------+-----------+----------+---------------+--------------+------------------+------------------------+------------------------+------------------------+----------+------------------+-------------+------------+----------------+--------------------+---------------------+------------------+---------+----------------------------------+---------------------------------------------------------------------------------------------+
|siren    |nic  |siret         |statutDiffusionEtablissement|dateCreationEtablissement|trancheEffectifSalarie|anneeEffectifsEtablissement|activiteArtisanRegistreDesMetiers|dateDernierTraitement|etablissementSiege|nombrePeriodesEtablissement|complementAdresse         |numeroVoie|indiceRepetition|typeDeVoie|libelleVoie              |codePostal|nomCommuneEtrangere|distributionSpeciale|codeCommune|cedex|libelleCedex         |codePaysEtranger|nomPaysEtranger|complementAdresseSecondaire|numeroVoieSecondaire|indiceRepetitionSecondaire|typeDeVoieSecondaire|libelleVoieSecondaire|codePostalSecondaire|nomCommuneSecondaire|nomCommuneEtrangereSecondaire|distributionSpecialeSecondaire|codeCommuneSecondaire|cedexSecondaire|libelleCedexSecondaire|codePaysEtrangerSecondaire|nomPaysEtrangerSecondaire|dateDebutHistorisation|etatAdministratifEtablissement|enseigne1          |enseigne2|enseigne3|denominationEtablissement|activitePrincipale|nomenclatureActivitePrincipale|caractereEmployeurEtablissement|active|anneeValiditeEffectifSalarie|caractereEmployeur|siege|nombrePeriodes|typeCommune|codeRegion|codeDepartement|arrondissement|typeNomEtCharniere|nomMajuscules           |nomCommune              |libelle                 |codeCanton|codeCommuneParente|strateCommune|sirenCommune|populationTotale|populationMunicipale|populationCompteApart|sirenCommuneMembre|codeEPCI |nomEPCI                           |libelleNAF                                                                                   |
+---------+-----+--------------+----------------------------+-------------------------+----------------------+---------------------------+---------------------------------+---------------------+------------------+---------------------------+--------------------------+----------+----------------+----------+-------------------------+----------+-------------------+--------------------+-----------+-----+---------------------+----------------+---------------+---------------------------+--------------------+--------------------------+--------------------+---------------------+--------------------+--------------------+-----------------------------+------------------------------+---------------------+---------------+----------------------+--------------------------+-------------------------+----------------------+------------------------------+-------------------+---------+---------+-------------------------+------------------+------------------------------+-------------------------------+------+----------------------------+------------------+-----+--------------+-----------+----------+---------------+--------------+------------------+------------------------+------------------------+------------------------+----------+------------------+-------------+------------+----------------+--------------------+---------------------+------------------+---------+----------------------------------+---------------------------------------------------------------------------------------------+
|015850944|00024|01585094400024|O                           |2007-04-01               |11                    |2017                       |null                             |2019-11-14T14:00:12  |false             |2                          |ZONE INDUSTRIELLE         |null      |null            |CHE       |DE THIL                  |01700     |null               |null                |01376      |null |null                 |null            |null           |null                       |null                |null                      |null                |null                 |null                |null                |null                         |null                          |null                 |null           |null                  |null                      |null                     |2008-01-01            |A                             |null               |null     |null     |null                     |25.73B            |NAFRev2                       |O                              |true  |2017                        |true              |false|2             |COM        |84        |01             |012           |0                 |SAINT MAURICE DE BEYNOST|Saint-Maurice-de-Beynost|Saint-Maurice-de-Beynost|0113      |null              |5            |210103768   |4006            |3967                |39                   |210103768         |240100800|CC de Miribel et du Plateau       |Fabrication d'autres outillages                                                              |
|015851793|00479|01585179300479|O                           |2005-01-01               |11                    |2017                       |null                             |2019-06-24T13:04:28  |false             |2                          |null                      |null      |null            |null      |ZONE INDUST LA FONTAINE  |01290     |null               |null                |01134      |null |null                 |null            |null           |null                       |null                |null                      |null                |null                 |null                |null                |null                         |null                          |null                 |null           |null                  |null                      |null                     |2008-01-01            |A                             |null               |null     |null     |null                     |46.73A            |NAFRev2                       |O                              |true  |2017                        |true              |false|2             |COM        |84        |01             |012           |0                 |CROTTET                 |Crottet                 |Crottet                 |0123      |null              |3            |210101341   |1777            |1734                |43                   |210101341         |200070555|CC de la Veyle                    |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction         |
|015851793|00743|01585179300743|O                           |2012-09-01               |02                    |2017                       |null                             |2019-06-24T13:04:28  |false             |1                          |ZA ACTIPARC               |null      |null            |null      |PRE LION                 |01190     |null               |null                |01057      |null |null                 |null            |null           |null                       |null                |null                      |null                |null                 |null                |null                |null                         |null                          |null                 |null           |null                  |null                      |null                     |2012-09-01            |A                             |null               |null     |null     |DORAS                    |46.73A            |NAFRev2                       |O                              |true  |2017                        |true              |false|1             |COM        |84        |01             |012           |0                 |BOZ                     |Boz                     |Boz                     |0117      |null              |3            |210100574   |519             |512                 |7                    |210100574         |200071371|CC Bresse et Saône                |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction         |
|015851793|00917|01585179300917|O                           |2020-01-01               |null                  |null                       |null                             |2020-01-31T16:13:25  |false             |1                          |null                      |28        |null            |AV        |DE MARBOZ                |01000     |null               |null                |01053      |null |null                 |null            |null           |null                       |null                |null                      |null                |null                 |null                |null                |null                         |null                          |null                 |null           |null                  |null                      |null                     |2020-01-01            |A                             |CLEAU              |null     |null     |null                     |46.73A            |NAFRev2                       |O                              |true  |null                        |true              |false|1             |COM        |84        |01             |012           |0                 |BOURG EN BRESSE         |Bourg-en-Bresse         |Bourg-en-Bresse         |0199      |null              |8            |210100533   |43306           |41527               |1779                 |210100533         |200071751|CA du Bassin de Bourg-en-Bresse   |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction         |

我看到我的Parquet文件下的文件夹很好:

codeDepartement=01
codeDepartement=2A
codeDepartement=75
codeDepartement=971

注意:由于某些值 2A (对于corse)部门代码不能转换为数值。
这个 snappy.parquet 每个块存储一个 /data/tmp/etablissements_2020_true_true/codeDepartement=01 文件夹之类的:没关系。
在阅读时,我试图阅读该商店的内容。搜索城市代码(在法国以部门代码开头)以 "01" :读取适当的Parquet文件和块:

2021-03-24 07:14:33.825  INFO 13860 --- [er for task 106] o.a.s.s.e.datasources.FileScanRDD        : Reading File path: file:/data/tmp/etablissements_2020_true_true/codeDepartement=01/part-00024-f7d33eea-6d79-4f1a-bf35-0666dcc5e0f5.c000.snappy.parquet, range: 0-5246504, partition values: [1]

显示部门时(现在位于数据集的末尾) show() 命令),它现在有值了 "1" 而不是 "01" :

+---------+-----+--------------+----------------------------+-------------------------+----------------------+---------------------------+---------------------------------+---------------------+------------------+---------------------------+--------------------------+----------+----------------+----------+-------------------------+----------+-------------------+--------------------+-----------+-----+---------------------+----------------+---------------+---------------------------+--------------------+--------------------------+--------------------+---------------------+--------------------+--------------------+-----------------------------+------------------------------+---------------------+---------------+----------------------+--------------------------+-------------------------+----------------------+------------------------------+-------------------+---------+---------+-------------------------+------------------+------------------------------+-------------------------------+------+----------------------------+------------------+-----+--------------+-----------+----------+--------------+------------------+------------------------+------------------------+------------------------+----------+------------------+-------------+------------+----------------+--------------------+---------------------+------------------+---------+----------------------------------+---------------------------------------------------------------------------------------------+---------------+
|siren    |nic  |siret         |statutDiffusionEtablissement|dateCreationEtablissement|trancheEffectifSalarie|anneeEffectifsEtablissement|activiteArtisanRegistreDesMetiers|dateDernierTraitement|etablissementSiege|nombrePeriodesEtablissement|complementAdresse         |numeroVoie|indiceRepetition|typeDeVoie|libelleVoie              |codePostal|nomCommuneEtrangere|distributionSpeciale|codeCommune|cedex|libelleCedex         |codePaysEtranger|nomPaysEtranger|complementAdresseSecondaire|numeroVoieSecondaire|indiceRepetitionSecondaire|typeDeVoieSecondaire|libelleVoieSecondaire|codePostalSecondaire|nomCommuneSecondaire|nomCommuneEtrangereSecondaire|distributionSpecialeSecondaire|codeCommuneSecondaire|cedexSecondaire|libelleCedexSecondaire|codePaysEtrangerSecondaire|nomPaysEtrangerSecondaire|dateDebutHistorisation|etatAdministratifEtablissement|enseigne1          |enseigne2|enseigne3|denominationEtablissement|activitePrincipale|nomenclatureActivitePrincipale|caractereEmployeurEtablissement|active|anneeValiditeEffectifSalarie|caractereEmployeur|siege|nombrePeriodes|typeCommune|codeRegion|arrondissement|typeNomEtCharniere|nomMajuscules           |nomCommune              |libelle                 |codeCanton|codeCommuneParente|strateCommune|sirenCommune|populationTotale|populationMunicipale|populationCompteApart|sirenCommuneMembre|codeEPCI |nomEPCI                           |libelleNAF                                                                                   |codeDepartement|
+---------+-----+--------------+----------------------------+-------------------------+----------------------+---------------------------+---------------------------------+---------------------+------------------+---------------------------+--------------------------+----------+----------------+----------+-------------------------+----------+-------------------+--------------------+-----------+-----+---------------------+----------------+---------------+---------------------------+--------------------+--------------------------+--------------------+---------------------+--------------------+--------------------+-----------------------------+------------------------------+---------------------+---------------+----------------------+--------------------------+-------------------------+----------------------+------------------------------+-------------------+---------+---------+-------------------------+------------------+------------------------------+-------------------------------+------+----------------------------+------------------+-----+--------------+-----------+----------+--------------+------------------+------------------------+------------------------+------------------------+----------+------------------+-------------+------------+----------------+--------------------+---------------------+------------------+---------+----------------------------------+---------------------------------------------------------------------------------------------+---------------+
|015850944|00024|01585094400024|O                           |2007-04-01               |11                    |2017                       |null                             |2019-11-14T14:00:12  |false             |2                          |ZONE INDUSTRIELLE         |null      |null            |CHE       |DE THIL                  |01700     |null               |null                |01376      |null |null                 |null            |null           |null                       |null                |null                      |null                |null                 |null                |null                |null                         |null                          |null                 |null           |null                  |null                      |null                     |2008-01-01            |A                             |null               |null     |null     |null                     |25.73B            |NAFRev2                       |O                              |true  |2017                        |true              |false|2             |COM        |84        |012           |0                 |SAINT MAURICE DE BEYNOST|Saint-Maurice-de-Beynost|Saint-Maurice-de-Beynost|0113      |null              |5            |210103768   |4006            |3967                |39                   |210103768         |240100800|CC de Miribel et du Plateau       |Fabrication d'autres outillages                                                              |1              |
|015851793|00479|01585179300479|O                           |2005-01-01               |11                    |2017                       |null                             |2019-06-24T13:04:28  |false             |2                          |null                      |null      |null            |null      |ZONE INDUST LA FONTAINE  |01290     |null               |null                |01134      |null |null                 |null            |null           |null                       |null                |null                      |null                |null                 |null                |null                |null                         |null                          |null                 |null           |null                  |null                      |null                     |2008-01-01            |A                             |null               |null     |null     |null                     |46.73A            |NAFRev2                       |O                              |true  |2017                        |true              |false|2             |COM        |84        |012           |0                 |CROTTET                 |Crottet                 |Crottet                 |0123      |null              |3            |210101341   |1777            |1734                |43                   |210101341         |200070555|CC de la Veyle                    |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction         |1              |
|015851793|00743|01585179300743|O                           |2012-09-01               |02                    |2017                       |null                             |2019-06-24T13:04:28  |false             |1                          |ZA ACTIPARC               |null      |null            |null      |PRE LION                 |01190     |null               |null                |01057      |null |null                 |null            |null           |null                       |null                |null                      |null                |null                 |null                |null                |null                         |null                          |null                 |null           |null                  |null                      |null                     |2012-09-01            |A                             |null               |null     |null     |DORAS                    |46.73A            |NAFRev2                       |O                              |true  |2017                        |true              |false|1             |COM        |84        |012           |0                 |BOZ                     |Boz                     |Boz                     |0117      |null              |3            |210100574   |519             |512                 |7                    |210100574         |200071371|CC Bresse et Saône                |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction         |1              |
|015851793|00917|01585179300917|O                           |2020-01-01               |null                  |null                       |null                             |2020-01-31T16:13:25  |false             |1                          |null                      |28        |null            |AV        |DE MARBOZ                |01000     |null               |null                |01053      |null |null                 |null            |null           |null                       |null                |null                      |null                |null                 |null                |null                |null                         |null                          |null                 |null           |null                  |null                      |null                     |2020-01-01            |A                             |CLEAU              |null     |null     |null                     |46.73A            |NAFRev2                       |O                              |true  |null                        |true              |false|1             |COM        |84        |012           |0                 |BOURG EN BRESSE         |Bourg-en-Bresse         |Bourg-en-Bresse         |0199      |null              |8            |210100533   |43306           |41527               |1779                 |210100533         |200071751|CA du Bassin de Bourg-en-Bresse   |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction         |1              |

即使Parquet文件上还声明 StringType :

|-- codeDepartement: string (nullable = true)

发生什么事了?
我倾向于让 repartition() 声明是造成这场混乱的原因,但我不知道是怎么回事。如果这个命令是骗人的,而且分区不能按字符串值进行分区,那么程序怎么能按字母中的红色、蓝色和黄色对数据进行分区呢?
我不明白我面临的整体行为(问题?)。

f0ofjuux

f0ofjuux1#

您可以禁用该选项 spark.sql.sources.partitionColumnTypeInference.enabled .
从docs分区发现:
[…]有时用户可能不希望自动推断分区列的数据类型。对于这些用例,可以通过 spark.sql.sources.partitionColumnTypeInference.enabled ,默认为true。当类型推断被禁用时,字符串类型将用于分区列。
设置选项:

spark.conf.set("spark.sql.sources.partitionColumnTypeInference.enabled", "false")
ddhy6vgd

ddhy6vgd2#

我可以重现这个问题。

spark.sql("select '01' key, 123 val union all select 'ab', 456").show()
+---+---+
|key|val|
+---+---+
| 01|123|
| ab|456|
+---+---+

spark.sql("select '01' key, 123 val union all select 'ab', 456").write().partitionBy("key").parquet("test")

spark.read().parquet("test").show()
+---+---+
|val|key|
+---+---+
|456| ab|
|123|  1|
+---+---+

要解决此问题,可以在读取时提供架构:

spark.read().schema(spark.read().parquet("test").schema).parquet("test").show()
+---+---+
|val|key|
+---+---+
|456| ab|
|123| 01|
+---+---+

(在pyspark中测试,希望可以在java中使用)

相关问题