如何在pysparkDataframe中删除重复项但保持第一个？

yuvru6vn 于 2021-05-27 发布在 Spark

关注(0)|答案(1)|浏览(600)

我试图从Dataframe中删除重复项，但不应删除第一个条目。除第一条记录外，其他所有的副本都应存储在一个单独的Dataframe中。
例如，如果Dataframe类似于：

col1,col2,col3,col4
r,t,s,t
a,b,c,d
b,m,c,d
a,b,c,d
a,b,c,d
g,n,d,f
e,f,g,h
t,y,u,o
e,f,g,h
e,f,g,h

在这种情况下，我应该有两个Dataframe。

df1:
r,t,s,t
a,b,c,d
b,m,c,d
g,n,d,f
e,f,g,h
t,y,u,o

其他Dataframe应为：

a,b,c,d
a,b,c,d
e,f,g,h
e,f,g,h

python apache-spark pyspark

来源：https://stackoverflow.com/questions/63343958/how-to-drop-duplicates-but-keep-first-in-pyspark-dataframe

1条答案

按热度按时间

swvgeqrz1#

尝试使用 window row_number() 功能。 Example: ```
df.show()

+----+----+----+----+

|col1|col2|col3|col4|

+----+----+----+----+

| r| t| s| t|

| a| b| c| d|

| b| m| c| d|

| a| b| c| d|

| g| n| d| f|

| e| f| g| h|

| t| y| u| o|

| e| f| g| h|

+----+----+----+----+

from pyspark.sql import *
from pyspark.sql.functions import *

w=Window.partitionBy("col1","col2","col3","col4").orderBy(lit(1))

df1=df.withColumn("rn",row_number().over(w)).filter(col("rn")==1).drop("rn")

df1.show()

+----+----+----+----+

|col1|col2|col3|col4|

+----+----+----+----+

| b| m| c| d|

| r| t| s| t|

| g| n| d| f|

| t| y| u| o|

| a| b| c| d|

| e| f| g| h|

+----+----+----+----+

df2=df.withColumn("rn",row_number().over(w)).filter(col("rn")>1).drop("rn")
df2.show()

+----+----+----+----+

|col1|col2|col3|col4|

+----+----+----+----+

| a| b| c| d|

| e| f| g| h|

+----+----+----+----+

展开查看全部

赞(0）回复(0）举报 2021-05-27

我来回答

如何在pysparkDataframe中删除重复项但保持第一个？

1条答案

+----+----+----+----+

|col1|col2|col3|col4|

+----+----+----+----+

| r| t| s| t|

| a| b| c| d|

| b| m| c| d|

| a| b| c| d|

| a| b| c| d|

| g| n| d| f|

| e| f| g| h|

| t| y| u| o|

| e| f| g| h|

| e| f| g| h|

+----+----+----+----+

+----+----+----+----+

|col1|col2|col3|col4|

+----+----+----+----+

| b| m| c| d|

| r| t| s| t|

| g| n| d| f|

| t| y| u| o|

| a| b| c| d|

| e| f| g| h|

+----+----+----+----+

+----+----+----+----+

|col1|col2|col3|col4|

+----+----+----+----+

| a| b| c| d|

| a| b| c| d|

| e| f| g| h|

| e| f| g| h|

+----+----+----+----+

相关问题

热门标签

最新问答