使用Pandas.remove_duplicates()时出错

dfddblmv  于 2023-01-24  发布在  其他
关注(0)|答案(4)|浏览(198)

我尝试使用Pandas. drop_duplicates(),只考虑某个子集,但得到错误KeyError: Index(['days'], dtype='object')
该指数如下:id, event_description, attribute1, attribute 2, attribute 3, days, days_supply, days_equivalent
我希望忽略属性2和属性3,因此运行了以下命令

df = df.drop_duplicates(subset=['id', 'event_description', 'attribute1', 'days', 'days_supply', 'days_equivalent'])

该函数返回:

eyError                                  Traceback (most recent call last)
<ipython-input-4-3f7da32b380f> in <module>
      7 
      8 df = df.drop_duplicates(subset=['id', 'event_description', 'attribute1', 'days', 
->    9 'days_supply', 'days_equivalent'])
     10 
     11 print(df)

/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in drop_duplicates(self, subset, keep, inplace)
   4892 
   4893         inplace = validate_bool_kwarg(inplace, "inplace")
-> 4894         duplicated = self.duplicated(subset, keep=keep)
   4895 
   4896         if inplace:

/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in duplicated(self, subset, keep)
   4949         diff = Index(subset).difference(self.columns)
   4950         if not diff.empty:
-> 4951             raise KeyError(diff)
   4952 
   4953         vals = (col.values for name, col in self.items() if name in subset)

KeyError: Index(['days'], dtype='object')

一旦我删除了days,删除重复项就可以正常运行,但是我需要确保我考虑了days

6ojccjat

6ojccjat1#

必须重新检查列名。Daysdays

wn9m85ua

wn9m85ua2#

还要检查列名称是否由于某种原因而丢失。可能是合并的结果
df.columns

yyyllmsg

yyyllmsg3#

试试

df.drop_duplicates(subset=['id', 'event_description', 'attribute1', 'days', 'days_supply', 'days_equivalent'],inplace=True)

出发地:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html
试试
也许你的df格式不好,无论如何如果你认为这个问题与dtype有关,你可以使用函数apply来检查df ['date']的整个数据,如下所示:

def checkType(someDate):
    ##Do verification
    return dateCorrected

df['date'] = df['date'].apply(checkType)
anhgbhbe

anhgbhbe4#

我重现了一个有点类似的情况:列配置错误(一对多余的方括号)的DataFrame返回一个看起来不错的结果(图1)。

array = [
    ['001', 3, 3, 3, 1, 5, 4, 3],
    ['002', 7, 2, 1, 1, 1, 5, 1],
    ['003', 1, 6, 7, 6, 6, 7, 7]]

# NG configuration of the columns.
df_NG = pd.DataFrame(
    array,
    columns=[
        ['id', 'event_description', 'attribute1', 'attribute 2', 'attribute 3',
         'days', 'days_supply', 'days_equivalent']])

图1伪OK数据框(内部腐烂)x1c 0d1x
但如果你想删除重复的,

df_NG = df_NG.drop_duplicates(
    subset=[
        'id', 'event_description', 'attribute1',
        'days', 'days_supply', 'days_equivalent'])

Pandas归来:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Input In [71], in <cell line: 1>()
----> 1 df_NG = df_NG.drop_duplicates(
      2     subset=[
      3         'id', 'event_description', 'attribute1',
      4         'days', 'days_supply', 'days_equivalent'])

File /usr/local/lib/python3.9/site-packages/pandas/util/_decorators.py:311, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    305 if len(args) > num_allow_args:
    306     warnings.warn(
    307         msg.format(arguments=arguments),
    308         FutureWarning,
    309         stacklevel=stacklevel,
    310     )
--> 311 return func(*args, **kwargs)

File /usr/local/lib/python3.9/site-packages/pandas/core/frame.py:6125, in DataFrame.drop_duplicates(self, subset, keep, inplace, ignore_index)
   6123 inplace = validate_bool_kwarg(inplace, "inplace")
   6124 ignore_index = validate_bool_kwarg(ignore_index, "ignore_index")
-> 6125 duplicated = self.duplicated(subset, keep=keep)
   6127 result = self[-duplicated]
   6128 if ignore_index:

File /usr/local/lib/python3.9/site-packages/pandas/core/frame.py:6259, in DataFrame.duplicated(self, subset, keep)
   6257 diff = Index(subset).difference(self.columns)
   6258 if not diff.empty:
-> 6259     raise KeyError(diff)
   6261 vals = (col.values for name, col in self.items() if name in subset)
   6262 labels, shape = map(list, zip(*map(f, vals)))

KeyError: Index(['attribute1', 'days', 'days_equivalent', 'days_supply',
       'event_description', 'id'],
      dtype='object')

所以我跟踪David's suggestion找到了罪魁祸首!

>>> df_NG.columns

MultiIndex([(               'id',),
        ('event_description',),
        (       'attribute1',),
        (      'attribute 2',),
        (      'attribute 3',),
        (             'days',),
        (      'days_supply',),
        (  'days_equivalent',)],
       )

当然,正确的配置如下所示:)

df_OK = pd.DataFrame(
    array,
    columns=[
        'id', 'event_description', 'attribute1', 'attribute 2', 'attribute 3',
        'days', 'days_supply', 'days_equivalent'])

相关问题