在 Spark 3.0.2
,我在写一篇文章 Dataset
在Parquet锉里。我的代码就是这样结束的:
etablissements = etablissements.repartition(col("codeDepartement"));
etablissements = etablissements.sortWithinPartitions(col("siret"));
etablissements = etablissements.persist();
// Write it in a file named with the year of data, selections, and sorting in it's name.
// Underlying statement writing the parquet file is :
// ds.write().partitionBy(colonnesPartionnement /* = codeDepartement */)
saveToStore(etablissements, new String[] {"codeDepartement"},
"{0}_{1,number,#0}_{2}_{3}", "etablissements", anneeSIRENE, actifsSeulement,
communesValides);
这个 codeDepartment
有一个 StringType
,因为法国的部门代码是三个字符的代码。
# schema() :
|-- codeDepartement: string (nullable = true)
它在最后三分之一处可见 show()
输出(城市名称前三列大写),并具有for值: "01"
:
+---------+-----+--------------+----------------------------+-------------------------+----------------------+---------------------------+---------------------------------+---------------------+------------------+---------------------------+--------------------------+----------+----------------+----------+-------------------------+----------+-------------------+--------------------+-----------+-----+---------------------+----------------+---------------+---------------------------+--------------------+--------------------------+--------------------+---------------------+--------------------+--------------------+-----------------------------+------------------------------+---------------------+---------------+----------------------+--------------------------+-------------------------+----------------------+------------------------------+-------------------+---------+---------+-------------------------+------------------+------------------------------+-------------------------------+------+----------------------------+------------------+-----+--------------+-----------+----------+---------------+--------------+------------------+------------------------+------------------------+------------------------+----------+------------------+-------------+------------+----------------+--------------------+---------------------+------------------+---------+----------------------------------+---------------------------------------------------------------------------------------------+
|siren |nic |siret |statutDiffusionEtablissement|dateCreationEtablissement|trancheEffectifSalarie|anneeEffectifsEtablissement|activiteArtisanRegistreDesMetiers|dateDernierTraitement|etablissementSiege|nombrePeriodesEtablissement|complementAdresse |numeroVoie|indiceRepetition|typeDeVoie|libelleVoie |codePostal|nomCommuneEtrangere|distributionSpeciale|codeCommune|cedex|libelleCedex |codePaysEtranger|nomPaysEtranger|complementAdresseSecondaire|numeroVoieSecondaire|indiceRepetitionSecondaire|typeDeVoieSecondaire|libelleVoieSecondaire|codePostalSecondaire|nomCommuneSecondaire|nomCommuneEtrangereSecondaire|distributionSpecialeSecondaire|codeCommuneSecondaire|cedexSecondaire|libelleCedexSecondaire|codePaysEtrangerSecondaire|nomPaysEtrangerSecondaire|dateDebutHistorisation|etatAdministratifEtablissement|enseigne1 |enseigne2|enseigne3|denominationEtablissement|activitePrincipale|nomenclatureActivitePrincipale|caractereEmployeurEtablissement|active|anneeValiditeEffectifSalarie|caractereEmployeur|siege|nombrePeriodes|typeCommune|codeRegion|codeDepartement|arrondissement|typeNomEtCharniere|nomMajuscules |nomCommune |libelle |codeCanton|codeCommuneParente|strateCommune|sirenCommune|populationTotale|populationMunicipale|populationCompteApart|sirenCommuneMembre|codeEPCI |nomEPCI |libelleNAF |
+---------+-----+--------------+----------------------------+-------------------------+----------------------+---------------------------+---------------------------------+---------------------+------------------+---------------------------+--------------------------+----------+----------------+----------+-------------------------+----------+-------------------+--------------------+-----------+-----+---------------------+----------------+---------------+---------------------------+--------------------+--------------------------+--------------------+---------------------+--------------------+--------------------+-----------------------------+------------------------------+---------------------+---------------+----------------------+--------------------------+-------------------------+----------------------+------------------------------+-------------------+---------+---------+-------------------------+------------------+------------------------------+-------------------------------+------+----------------------------+------------------+-----+--------------+-----------+----------+---------------+--------------+------------------+------------------------+------------------------+------------------------+----------+------------------+-------------+------------+----------------+--------------------+---------------------+------------------+---------+----------------------------------+---------------------------------------------------------------------------------------------+
|015850944|00024|01585094400024|O |2007-04-01 |11 |2017 |null |2019-11-14T14:00:12 |false |2 |ZONE INDUSTRIELLE |null |null |CHE |DE THIL |01700 |null |null |01376 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |2008-01-01 |A |null |null |null |null |25.73B |NAFRev2 |O |true |2017 |true |false|2 |COM |84 |01 |012 |0 |SAINT MAURICE DE BEYNOST|Saint-Maurice-de-Beynost|Saint-Maurice-de-Beynost|0113 |null |5 |210103768 |4006 |3967 |39 |210103768 |240100800|CC de Miribel et du Plateau |Fabrication d'autres outillages |
|015851793|00479|01585179300479|O |2005-01-01 |11 |2017 |null |2019-06-24T13:04:28 |false |2 |null |null |null |null |ZONE INDUST LA FONTAINE |01290 |null |null |01134 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |2008-01-01 |A |null |null |null |null |46.73A |NAFRev2 |O |true |2017 |true |false|2 |COM |84 |01 |012 |0 |CROTTET |Crottet |Crottet |0123 |null |3 |210101341 |1777 |1734 |43 |210101341 |200070555|CC de la Veyle |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction |
|015851793|00743|01585179300743|O |2012-09-01 |02 |2017 |null |2019-06-24T13:04:28 |false |1 |ZA ACTIPARC |null |null |null |PRE LION |01190 |null |null |01057 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |2012-09-01 |A |null |null |null |DORAS |46.73A |NAFRev2 |O |true |2017 |true |false|1 |COM |84 |01 |012 |0 |BOZ |Boz |Boz |0117 |null |3 |210100574 |519 |512 |7 |210100574 |200071371|CC Bresse et Saône |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction |
|015851793|00917|01585179300917|O |2020-01-01 |null |null |null |2020-01-31T16:13:25 |false |1 |null |28 |null |AV |DE MARBOZ |01000 |null |null |01053 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |2020-01-01 |A |CLEAU |null |null |null |46.73A |NAFRev2 |O |true |null |true |false|1 |COM |84 |01 |012 |0 |BOURG EN BRESSE |Bourg-en-Bresse |Bourg-en-Bresse |0199 |null |8 |210100533 |43306 |41527 |1779 |210100533 |200071751|CA du Bassin de Bourg-en-Bresse |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction |
我看到我的Parquet文件下的文件夹很好:
codeDepartement=01
codeDepartement=2A
codeDepartement=75
codeDepartement=971
注意:由于某些值 2A
(对于corse)部门代码不能转换为数值。
这个 snappy.parquet
每个块存储一个 /data/tmp/etablissements_2020_true_true/codeDepartement=01
文件夹之类的:没关系。
在阅读时,我试图阅读该商店的内容。搜索城市代码(在法国以部门代码开头)以 "01"
:读取适当的Parquet文件和块:
2021-03-24 07:14:33.825 INFO 13860 --- [er for task 106] o.a.s.s.e.datasources.FileScanRDD : Reading File path: file:/data/tmp/etablissements_2020_true_true/codeDepartement=01/part-00024-f7d33eea-6d79-4f1a-bf35-0666dcc5e0f5.c000.snappy.parquet, range: 0-5246504, partition values: [1]
显示部门时(现在位于数据集的末尾) show()
命令),它现在有值了 "1"
而不是 "01"
:
+---------+-----+--------------+----------------------------+-------------------------+----------------------+---------------------------+---------------------------------+---------------------+------------------+---------------------------+--------------------------+----------+----------------+----------+-------------------------+----------+-------------------+--------------------+-----------+-----+---------------------+----------------+---------------+---------------------------+--------------------+--------------------------+--------------------+---------------------+--------------------+--------------------+-----------------------------+------------------------------+---------------------+---------------+----------------------+--------------------------+-------------------------+----------------------+------------------------------+-------------------+---------+---------+-------------------------+------------------+------------------------------+-------------------------------+------+----------------------------+------------------+-----+--------------+-----------+----------+--------------+------------------+------------------------+------------------------+------------------------+----------+------------------+-------------+------------+----------------+--------------------+---------------------+------------------+---------+----------------------------------+---------------------------------------------------------------------------------------------+---------------+
|siren |nic |siret |statutDiffusionEtablissement|dateCreationEtablissement|trancheEffectifSalarie|anneeEffectifsEtablissement|activiteArtisanRegistreDesMetiers|dateDernierTraitement|etablissementSiege|nombrePeriodesEtablissement|complementAdresse |numeroVoie|indiceRepetition|typeDeVoie|libelleVoie |codePostal|nomCommuneEtrangere|distributionSpeciale|codeCommune|cedex|libelleCedex |codePaysEtranger|nomPaysEtranger|complementAdresseSecondaire|numeroVoieSecondaire|indiceRepetitionSecondaire|typeDeVoieSecondaire|libelleVoieSecondaire|codePostalSecondaire|nomCommuneSecondaire|nomCommuneEtrangereSecondaire|distributionSpecialeSecondaire|codeCommuneSecondaire|cedexSecondaire|libelleCedexSecondaire|codePaysEtrangerSecondaire|nomPaysEtrangerSecondaire|dateDebutHistorisation|etatAdministratifEtablissement|enseigne1 |enseigne2|enseigne3|denominationEtablissement|activitePrincipale|nomenclatureActivitePrincipale|caractereEmployeurEtablissement|active|anneeValiditeEffectifSalarie|caractereEmployeur|siege|nombrePeriodes|typeCommune|codeRegion|arrondissement|typeNomEtCharniere|nomMajuscules |nomCommune |libelle |codeCanton|codeCommuneParente|strateCommune|sirenCommune|populationTotale|populationMunicipale|populationCompteApart|sirenCommuneMembre|codeEPCI |nomEPCI |libelleNAF |codeDepartement|
+---------+-----+--------------+----------------------------+-------------------------+----------------------+---------------------------+---------------------------------+---------------------+------------------+---------------------------+--------------------------+----------+----------------+----------+-------------------------+----------+-------------------+--------------------+-----------+-----+---------------------+----------------+---------------+---------------------------+--------------------+--------------------------+--------------------+---------------------+--------------------+--------------------+-----------------------------+------------------------------+---------------------+---------------+----------------------+--------------------------+-------------------------+----------------------+------------------------------+-------------------+---------+---------+-------------------------+------------------+------------------------------+-------------------------------+------+----------------------------+------------------+-----+--------------+-----------+----------+--------------+------------------+------------------------+------------------------+------------------------+----------+------------------+-------------+------------+----------------+--------------------+---------------------+------------------+---------+----------------------------------+---------------------------------------------------------------------------------------------+---------------+
|015850944|00024|01585094400024|O |2007-04-01 |11 |2017 |null |2019-11-14T14:00:12 |false |2 |ZONE INDUSTRIELLE |null |null |CHE |DE THIL |01700 |null |null |01376 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |2008-01-01 |A |null |null |null |null |25.73B |NAFRev2 |O |true |2017 |true |false|2 |COM |84 |012 |0 |SAINT MAURICE DE BEYNOST|Saint-Maurice-de-Beynost|Saint-Maurice-de-Beynost|0113 |null |5 |210103768 |4006 |3967 |39 |210103768 |240100800|CC de Miribel et du Plateau |Fabrication d'autres outillages |1 |
|015851793|00479|01585179300479|O |2005-01-01 |11 |2017 |null |2019-06-24T13:04:28 |false |2 |null |null |null |null |ZONE INDUST LA FONTAINE |01290 |null |null |01134 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |2008-01-01 |A |null |null |null |null |46.73A |NAFRev2 |O |true |2017 |true |false|2 |COM |84 |012 |0 |CROTTET |Crottet |Crottet |0123 |null |3 |210101341 |1777 |1734 |43 |210101341 |200070555|CC de la Veyle |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction |1 |
|015851793|00743|01585179300743|O |2012-09-01 |02 |2017 |null |2019-06-24T13:04:28 |false |1 |ZA ACTIPARC |null |null |null |PRE LION |01190 |null |null |01057 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |2012-09-01 |A |null |null |null |DORAS |46.73A |NAFRev2 |O |true |2017 |true |false|1 |COM |84 |012 |0 |BOZ |Boz |Boz |0117 |null |3 |210100574 |519 |512 |7 |210100574 |200071371|CC Bresse et Saône |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction |1 |
|015851793|00917|01585179300917|O |2020-01-01 |null |null |null |2020-01-31T16:13:25 |false |1 |null |28 |null |AV |DE MARBOZ |01000 |null |null |01053 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |2020-01-01 |A |CLEAU |null |null |null |46.73A |NAFRev2 |O |true |null |true |false|1 |COM |84 |012 |0 |BOURG EN BRESSE |Bourg-en-Bresse |Bourg-en-Bresse |0199 |null |8 |210100533 |43306 |41527 |1779 |210100533 |200071751|CA du Bassin de Bourg-en-Bresse |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction |1 |
即使Parquet文件上还声明 StringType
:
|-- codeDepartement: string (nullable = true)
发生什么事了?
我倾向于让 repartition()
声明是造成这场混乱的原因,但我不知道是怎么回事。如果这个命令是骗人的,而且分区不能按字符串值进行分区,那么程序怎么能按字母中的红色、蓝色和黄色对数据进行分区呢?
我不明白我面临的整体行为(问题?)。
2条答案
按热度按时间f0ofjuux1#
您可以禁用该选项
spark.sql.sources.partitionColumnTypeInference.enabled
.从docs分区发现:
[…]有时用户可能不希望自动推断分区列的数据类型。对于这些用例,可以通过
spark.sql.sources.partitionColumnTypeInference.enabled
,默认为true。当类型推断被禁用时,字符串类型将用于分区列。设置选项:
ddhy6vgd2#
我可以重现这个问题。
要解决此问题,可以在读取时提供架构:
(在pyspark中测试,希望可以在java中使用)