python—如何重命名Dataframe中的列

gcuhipw9  于 2021-05-27  发布在  Spark
关注(0)|答案(4)|浏览(540)

我有一个名为d2的数据框,有两列(dest\u country\u name,count)
我创建了一个新的数据框,如下所示:

df3 = df2.groupBy("DEST_COUNTRY_NAME").sum('count')

我打算将“sum(count)”列的名称改为“destination\u total”:

df5 = df3.selectExpr("cast(DEST_COUNTRY_NAME as string) DEST_COUNTRY_NAME", "cast(sum(count) as int) destination_total")

但是,出现以下错误:

Py4JJavaError                             Traceback (most recent call last)
/content/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/utils.py in deco(*a,**kw)
     62         try:
---> 63             return f(*a,**kw)
     64         except py4j.protocol.Py4JJavaError as e:

4 frames
Py4JJavaError: An error occurred while calling o1195.selectExpr.
: org.apache.spark.sql.AnalysisException: cannot resolve '`count`' given input columns: [DEST_COUNTRY_NAME, sum(count)]; line 1 pos 5;
'Project [cast(DEST_COUNTRY_NAME#1110 as string) AS DEST_COUNTRY_NAME#1154, cast('count as int) AS count#1155]
+- AnalysisBarrier
      +- Aggregate [DEST_COUNTRY_NAME#1110], [DEST_COUNTRY_NAME#1110, sum(cast(count#1112 as bigint)) AS sum(count)#1120L]
         +- Project [cast(DEST_COUNTRY_NAME#1090 as string) AS DEST_COUNTRY_NAME#1110, cast(ORIGIN_COUNTRY_NAME#1091 as string) AS ORIGIN_COUNTRY_NAME#1111, cast(count#1092 as int) AS count#1112]
            +- Relation[DEST_COUNTRY_NAME#1090,ORIGIN_COUNTRY_NAME#1091,count#1092] csv

    at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:88)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107)
    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:118)
    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:122)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.AbstractTraversable.map(Traversable.scala:104)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:122)
    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:127)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:127)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:95)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:85)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
    at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:92)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105)
    at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
    at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
    at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
    at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74)
    at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:3296)
    at org.apache.spark.sql.Dataset.select(Dataset.scala:1307)
    at org.apache.spark.sql.Dataset.selectExpr(Dataset.scala:1342)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

During handling of the above exception, another exception occurred:

AnalysisException                         Traceback (most recent call last)
/content/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/utils.py in deco(*a,**kw)
     67                                              e.java_exception.getStackTrace()))
     68             if s.startswith('org.apache.spark.sql.AnalysisException: '):
---> 69                 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
     70             if s.startswith('org.apache.spark.sql.catalyst.analysis'):
     71                 raise AnalysisException(s.split(': ', 1)[1], stackTrace)

AnalysisException: "cannot resolve '`count`' given input columns: [DEST_COUNTRY_NAME, sum(count)]; line 1 pos 5;\n'Project [cast(DEST_COUNTRY_NAME#1110 as string) AS DEST_COUNTRY_NAME#1154, cast('count as int) AS count#1155]\n+- AnalysisBarrier\n      +- Aggregate [DEST_COUNTRY_NAME#1110], [DEST_COUNTRY_NAME#1110, sum(cast(count#1112 as bigint)) AS sum(count)#1120L]\n         +- Project [cast(DEST_COUNTRY_NAME#1090 as string) AS DEST_COUNTRY_NAME#1110, cast(ORIGIN_COUNTRY_NAME#1091 as string) AS ORIGIN_COUNTRY_NAME#1111, cast(count#1092 as int) AS count#1112]\n            +- Relation[DEST_COUNTRY_NAME#1090,ORIGIN_COUNTRY_NAME#1091,count#1092] csv\n"

我打算将“sum(count)”列重命名为“destination\u total”。我怎样才能解决这个问题?我不是和Pandas一起工作,而是和spark一起工作。

iezvtpos

iezvtpos1#

假设您的Dataframe中只有两列,下面是两种重命名方法。

df3 = df2.groupBy("DEST_COUNTRY_NAME").sum('count').toDF(*['DEST_COUNTRY_NAME', 'destination_total'])

也可以在调用alias函数时对其进行重命名,如下所示:

df3.select("DEST_COUNTRY_NAME", col("sum(count)").alias("destination_total"))

p、 s:别忘了导入col。

from pyspark.sql.functions import col
woobm2wo

woobm2wo2#

df5 = df3.withColumnRenamed("sum(count)","destination_total")
bfrts1fy

bfrts1fy3#

也可以进行聚合而不是直接求和。

df3 = df2.groupBy("DEST_COUNTRY_NAME").agg(sum('count').alias('count'))
k5ifujac

k5ifujac4#

from pyspark.sql.functions import *
df3 = df2.groupBy("DEST_COUNTRY_NAME") \
         .agg(sum("count").alias("destination_total"))

相关问题