如何在Pandas中找到重复的列？[副本]

t5fffqht 于 2023-05-15 发布在其他

关注(0)|答案(2)|浏览(170)

此问题已在此处有答案：

Identifying/removing redundant columns in a pandas dataframe（5个答案）
昨天关门了。
我需要删除重复的列在Pandas的所有值是相同的所有记录动态。
例如：
df：

Id  ProductName  ProductSize ProductSize  ProductDesc  Quantity SoldCount  Sales
1   Shoes        9            9           Shoes         143     143         6374
2   Bag          XL           XL          Bag           342     342         2839
3   Laptop       16INCH       16INCH      Laptop        452     452         8293
4   Shoes        9            9           Shoes         143     143         3662
5   Laptop       14INCH       14INCH      Laptop        452     452         7263

在上面的列中，您可以看到有一些名称完全相同的重复列，并且在不同列名下的所有记录中都有重复的值。我正试着把那些柱子移走。默认情况下，我保留第一个出现的列。
df_output：

Id  ProductName  ProductSize Quantity Sales
1   Shoes        9           143     6374
2   Bag          XL          342     2839
3   Laptop       16INCH      452     8293
4   Shoes        9           143     3662
5   Laptop       14INCH      452     7263

pandas

来源：https://stackoverflow.com/questions/76247222/how-to-find-duplicate-column-in-pandas

2条答案

按热度按时间

km0tfn4u1#

方法1 -使用转置然后使用duplicated（）方法查找重复的列，仅保留第一次出现的列。接下来，它获取唯一的列名并将DataFrame转置回其原始形式，只保留唯一的列。最后，它将结果DataFrame分配给df_output。

# Transpose the DataFrame to make columns as rows
transposed_df = df.transpose()

# Find duplicate columns (excluding the first occurrence)
duplicate_columns = transposed_df.duplicated(keep='first')

# Get the unique column names
unique_columns = transposed_df[~duplicate_columns].index

# Transpose the DataFrame back and keep only the unique columns
df_output = df[unique_columns].copy()

# Print the resulting DataFrame
print(df_output)

ID是否包含重复项？在这个更新的版本中，索引在开始时使用df.reset_index（inplace=True）重置，将ID列转换为常规列。删除重复列后，使用df_output.set_index（'Id'，inplace=True）再次将ID列设置为索引。
通过重置和重新分配索引，可以确保在生成的DataFrame中保留重复的ID。

# Reset the index to convert the Id column to a regular column
df.reset_index(inplace=True)

# Transpose the DataFrame to make columns as rows
transposed_df = df.transpose()

# Find duplicate columns (excluding the first occurrence)
duplicate_columns = transposed_df.duplicated(keep='first')

# Get the unique column names
unique_columns = transposed_df[~duplicate_columns].index

# Transpose the DataFrame back and keep only the unique columns
df_output = df[unique_columns].copy()

# Set the Id column as the index again
df_output.set_index('Id', inplace=True)

print(df_output)

方法2 -利用nunique（）方法来标识只有一个唯一值的列

# Get the counts of unique values per column
value_counts = df.apply(lambda x: x.nunique())

# Filter columns with only one unique value
unique_columns = value_counts[value_counts > 1].index

# Keep only the unique columns
df_output = df[unique_columns].copy()

# Print the resulting DataFrame
print(df_output)

如果身份证是重复的？在只保留唯一列之后，我们使用df_output.index.duplicated（）来标识重复的ID。然后，我们重置索引以将ID列转换为常规列，并使用df_output[~df_output ['Id'].duplicated（）]删除具有重复ID的行。最后，再次使用df_output.set_index（'Id'，inplace=True）将ID列设置为索引。
这样，您就可以处理重复的ID，同时根据值的唯一性删除重复的列。

# Get the counts of unique values per column
value_counts = df.apply(lambda x: x.nunique())

# Filter columns with only one unique value
unique_columns = value_counts[value_counts > 1].index

# Keep only the unique columns
df_output = df[unique_columns].copy()

# Identify duplicate IDs
duplicate_ids = df_output.index[df_output.index.duplicated()]

# Reset index for duplicate IDs
df_output.reset_index(inplace=True)

# Remove duplicate IDs from the DataFrame
df_output = df_output[~df_output['Id'].duplicated()]

# Set the ID column as the index again
df_output.set_index('Id', inplace=True)

print(df_output)

赞(0）回复(0）举报 2023-05-15

ne5o7dgx2#

使用Transpose使用drop_duplicates