pyspark-从hive分区列逻辑获取最新分区

tzdcorbm  于 2021-06-27  发布在  Hive
关注(0)|答案(1)|浏览(789)

我是Pypark的新手。我正在尝试使用pyspark dataframes获取配置单元表的最新分区(日期分区),如下所示。但我确信有一种更好的方法可以使用Dataframe函数(而不是编写sql)。你能分享一些更好的方法吗。
这个解决方案是扫描配置单元表上的所有数据以获得它。

df_1 = sqlContext.table("dbname.tablename");

df_1_dates = df_1.select('partitioned_date_column').distinct().orderBy(df_1['partitioned_date_column'].desc())

lat_date_dict=df_1_dates.first().asDict()

lat_dt=lat_date_dict['partitioned_date_column']
1qczuiv0

1qczuiv01#

我同意@philantroft在评论中提到的观点。您可以使用下面的分区修剪过滤方法来限制为配置单元表扫描的分区数。

>>> spark.sql("""show partitions test_dev_db.newpartitiontable""").show();
+--------------------+
|           partition|
+--------------------+
|tran_date=2009-01-01|
|tran_date=2009-02-01|
|tran_date=2009-03-01|
|tran_date=2009-04-01|
|tran_date=2009-05-01|
|tran_date=2009-06-01|
|tran_date=2009-07-01|
|tran_date=2009-08-01|
|tran_date=2009-09-01|
|tran_date=2009-10-01|
|tran_date=2009-11-01|
|tran_date=2009-12-01|
+--------------------+

>>> max_date=spark.sql("""show partitions test_dev_db.newpartitiontable""").rdd.flatMap(lambda x:x).map(lambda x : x.replace("tran_date=","")).max()
>>> print max_date
2009-12-01
>>> query = "select city,state,country from test_dev_db.newpartitiontable where tran_date ='{}'".format(max_date)

>>> spark.sql(query).show();
+--------------------+----------------+--------------+
|                city|           state|       country|
+--------------------+----------------+--------------+
|         Southampton|         England|United Kingdom|
|W Lebanon        ...|              NH| United States|
|               Comox|British Columbia|        Canada|
|           Gasperich|      Luxembourg|    Luxembourg|
+--------------------+----------------+--------------+

>>> spark.sql(query).explain(True)
== Parsed Logical Plan ==
'Project ['city, 'state, 'country]
+- 'Filter ('tran_date = 2009-12-01)
   +- 'UnresolvedRelation `test_dev_db`.`newpartitiontable`

== Analyzed Logical Plan ==
city: string, state: string, country: string
Project [city#9, state#10, country#11]
+- Filter (tran_date#12 = 2009-12-01)
   +- SubqueryAlias newpartitiontable
      +- Relation[city#9,state#10,country#11,tran_date#12] orc

== Optimized Logical Plan ==
Project [city#9, state#10, country#11]
+- Filter (isnotnull(tran_date#12) && (tran_date#12 = 2009-12-01))
   +- Relation[city#9,state#10,country#11,tran_date#12] orc

== Physical Plan ==

* (1) Project [city#9, state#10, country#11]

+- *(1) FileScan orc test_dev_db.newpartitiontable[city#9,state#10,country#11,tran_date#12] Batched: true, Format: ORC, Location: PrunedInMemoryFileIndex[hdfs://xxx.host.com:8020/user/xxx/dev/hadoop/database/test_dev..., PartitionCount: 1, PartitionFilters: [isnotnull(tran_date#12), (tran_date#12 = 2009-12-01)], PushedFilters: [], ReadSchema: struct<city:string,state:string,country:string>

您可以在上面的计划中看到partitioncount:1它只扫描了12个可用分区中的一个分区。

相关问题