如何重置索引并找到特定的id?

col17t5w  于 2021-07-12  发布在  Spark
关注(0)|答案(1)|浏览(230)

我有一个 id 每个人的列(具有相同id的数据属于一个人)。我想要这些:
现在 id 列不是基于编号的,它是10位数字。如何重置 id 用整数,例如1,2,3,4?
例如:

id     col1
12a4   summer
12a4   goest
3b     yes
3b     No
3b     why
4t     Hi

输出:

id   col1
1    summer
1    goest
2    yes
2    No
2    why
3    Hi

我怎样才能得到 id=2 ?
在上述示例中:

id   col1
2    yes
2    No
2    why
k0pti3hp

k0pti3hp1#

from pyspark.sql import SparkSession
from pyspark.sql import Window, functions as F

spark = SparkSession.builder.getOrCreate()

data = [
('12a4', 'summer'),
('12a4', 'goest'),
('3b', 'yes'),
('3b', 'No'),
('3b', 'why'),
('4t', 'Hi')
]
df1 = spark.createDataFrame(data, ['id', 'col1'])
df1.show()

# +----+------+

# |  id|  col1|

# +----+------+

# |12a4|summer|

# |12a4| goest|

# |  3b|   yes|

# |  3b|    No|

# |  3b|   why|

# |  4t|    Hi|

# +----+------+

df = df1.select('id').distinct()
df = df.withColumn('new_id', F.row_number().over(Window.orderBy('id')))
df.show()

# +----+------+

# |  id|new_id|

# +----+------+

# |12a4|     1|

# |  3b|     2|

# |  4t|     3|

# +----+------+

df = df.join(df1, 'id', 'full')
df.show()

# +----+------+------+

# |  id|new_id|  col1|

# +----+------+------+

# |12a4|     1|summer|

# |12a4|     1| goest|

# |  4t|     3|    Hi|

# |  3b|     2|   yes|

# |  3b|     2|    No|

# |  3b|     2|   why|

# +----+------+------+

df = df.drop('id').withColumnRenamed('new_id', 'id')
df.show()

# +---+------+

# | id|  col1|

# +---+------+

# |  1|summer|

# |  1| goest|

# |  3|    Hi|

# |  2|   yes|

# |  2|    No|

# |  2|   why|

# +---+------+

df = df.filter(F.col('id') == 2)
df.show()

# +---+----+

# | id|col1|

# +---+----+

# |  2| yes|

# |  2|  No|

# |  2| why|

# +---+----+

相关问题