Dataframe:
+--------------------+--------------------+---------------------------------+----+
| core_id| guid|movie_theatrical_release_date_upd|year|
+--------------------+--------------------+---------------------------------+----+
|12f99f04-5168-438...|98e199d5-37b6-40a...| 2003-04-16 00:00:00|2003|
|32c7d12f-6bf2-4e5...|871e14c1-d046-41a...| 2004-05-28 00:00:00|2004|
|9f067041-3b49-4db...|419d8142-3e1f-489...| 2014-11-26 00:00:00|2014|
|c6d203cb-afcf-4e8...|6a2248de-7024-44c...| 2015-02-06 00:00:00|2015|
|b02416f9-5761-48f...|d7b505c2-5bc6-439...| 2008-06-27 00:00:00|2008|
|4b8a824d-a4f1-4f1...|3843b77d-61ae-427...| 2013-02-14 00:00:00|2013|
|2e522688-8332-4b3...|65e825ec-0486-42f...| 2003-11-14 00:00:00|2003|
|89632328-9a2c-499...|ac307c5e-f55a-40e...| 2012-08-17 00:00:00|2012|
|b670e071-6e9c-437...|e2490660-2fbe-44e...| 1995-12-15 00:00:00|1995|
|064d1587-0b18-434...|b84a04aa-013a-4bf...| 2011-07-22 00:00:00|2011|
|cfac2d11-81b6-408...|f9db54bc-6dc3-471...| 2015-03-13 00:00:00|2015|
我想做的是创建一个 decade
列,其内容如下:
+--------------------+--------------------+---------------------------------+----+------+
| core_id| guid|movie_theatrical_release_date_upd|year|decade|
+--------------------+--------------------+---------------------------------+----+------+
|12f99f04-5168-438...|98e199d5-37b6-40a...| 2003-04-16 00:00:00|2003|2000 |
|32c7d12f-6bf2-4e5...|871e14c1-d046-41a...| 2004-05-28 00:00:00|2004|2000 |
|9f067041-3b49-4db...|419d8142-3e1f-489...| 2014-11-26 00:00:00|2014|2010 |
|c6d203cb-afcf-4e8...|6a2248de-7024-44c...| 2015-02-06 00:00:00|2015|2010 |
|b02416f9-5761-48f...|d7b505c2-5bc6-439...| 2008-06-27 00:00:00|2008|2000 |
我是pyspark的新手,所以任何帮助都将不胜感激。
1条答案
按热度按时间lc8prwob1#
使用
floor()
年小数(year/10)
,然后通过multiplying by 10
.我们也可以通过
replacing the last number in year with 0
:使用
concat and substring
:使用
regexp_replace
:使用
right
以及subtract from year
: