从一年中提取十年

xriantvc  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(341)

Dataframe:

+--------------------+--------------------+---------------------------------+----+
|             core_id|                guid|movie_theatrical_release_date_upd|year|
+--------------------+--------------------+---------------------------------+----+
|12f99f04-5168-438...|98e199d5-37b6-40a...|              2003-04-16 00:00:00|2003|
|32c7d12f-6bf2-4e5...|871e14c1-d046-41a...|              2004-05-28 00:00:00|2004|
|9f067041-3b49-4db...|419d8142-3e1f-489...|              2014-11-26 00:00:00|2014|
|c6d203cb-afcf-4e8...|6a2248de-7024-44c...|              2015-02-06 00:00:00|2015|
|b02416f9-5761-48f...|d7b505c2-5bc6-439...|              2008-06-27 00:00:00|2008|
|4b8a824d-a4f1-4f1...|3843b77d-61ae-427...|              2013-02-14 00:00:00|2013|
|2e522688-8332-4b3...|65e825ec-0486-42f...|              2003-11-14 00:00:00|2003|
|89632328-9a2c-499...|ac307c5e-f55a-40e...|              2012-08-17 00:00:00|2012|
|b670e071-6e9c-437...|e2490660-2fbe-44e...|              1995-12-15 00:00:00|1995|
|064d1587-0b18-434...|b84a04aa-013a-4bf...|              2011-07-22 00:00:00|2011|
|cfac2d11-81b6-408...|f9db54bc-6dc3-471...|              2015-03-13 00:00:00|2015|

我想做的是创建一个 decade 列,其内容如下:

+--------------------+--------------------+---------------------------------+----+------+
|             core_id|                guid|movie_theatrical_release_date_upd|year|decade|
+--------------------+--------------------+---------------------------------+----+------+
|12f99f04-5168-438...|98e199d5-37b6-40a...|              2003-04-16 00:00:00|2003|2000  | 
|32c7d12f-6bf2-4e5...|871e14c1-d046-41a...|              2004-05-28 00:00:00|2004|2000  |
|9f067041-3b49-4db...|419d8142-3e1f-489...|              2014-11-26 00:00:00|2014|2010  |
|c6d203cb-afcf-4e8...|6a2248de-7024-44c...|              2015-02-06 00:00:00|2015|2010  |
|b02416f9-5761-48f...|d7b505c2-5bc6-439...|              2008-06-27 00:00:00|2008|2000  |

我是pyspark的新手,所以任何帮助都将不胜感激。

lc8prwob

lc8prwob1#

使用 floor() 年小数 (year/10) ,然后通过 multiplying by 10 .

from pyspark.sql import functions as F

df.withColumn("decade", (F.floor(F.col("year")/10)*10).cast("int")).show()

# +----+------+

# |year|decade|

# +----+------+

# |2003|  2000|

# |2004|  2000|

# |2014|  2010|

# |2015|  2010|

# |2008|  2000|

# +----+------+

我们也可以通过 replacing the last number in year with 0 :
使用 concat and substring :

from pyspark.sql import functions as F

df.withColumn("decade", F.expr("""concat(substring(year,1,length(year)-1),0)""")).show()

使用 regexp_replace :

from pyspark.sql import functions as F

df.withColumn("decade", F.regexp_replace("year",'\d(?!.*\d)','0')).show()

使用 right 以及 subtract from year :

from pyspark.sql import functions as F

df.withColumn("decade", F.expr("""int(year-right(year,1))""")).show()

相关问题