pandas 查找组中公共值的连续索引

vdzxcuhz  于 2023-02-02  发布在  其他
关注(0)|答案(1)|浏览(111)

对于以下DataFrame

df = pd.DataFrame({'Name': ['Tesla','Tesla','Tesla','Toyota','Ford','Ford','Ford','BMW','BMW','BMW','Mercedes','Mercedes','Mercedes'],
                   'Type': ['Model X','Model X','Model X','Corolla','Bronco','Bronco','Mustang','3 Series','-','-','Alpha','Alpha','S-Class'],
                   'Year': [2015, 2015, 2015, 2017, 2018, 2018, 2020, 2015, 2015, 2017, 2018, 2018, 2020],
                   'Price': [85000, 90000, 95000, 20000, 35000, 35000, 45000, 40000, 40000, 65000, 50000, 50000, 75000],
                   'Color': ['White','White','White','Red','Blue','Blue','Yellow','Silver','Silver','Black','White','White','Black']
                  })

df

Name    Type     Year   Price   Color
0   Tesla   Model X  2015   85000   White
1   Tesla   Model X  2015   90000   White
2   Tesla   Model X  2015   95000   White
3   Toyota  Corolla  2017   20000   Red
4   Ford    Bronco   2018   35000   Blue
5   Ford    Bronco   2018   35000   Blue
6   Ford    Mustang  2020   45000   Yellow
7   BMW     3 Series 2015   40000   Silver
8   BMW  -   2015   40000   Silver
9   BMW  -   2017   65000   Black
10  Mercedes Alpha   2018   50000   White
11  Mercedes Alpha   2018   50000   White
12  Mercedes S-Class 2020   75000   Black

对于每个Name列组,我试图找到其他列中连续公共值的对应start indexend index。如果值连续重复,则应将其视为公共值,并将其存储到字典中,其中键作为开始索引,值作为结束索引。
例如,TeslaType列中有Model X公共,所以在Type列中,我应该得到Model Xstart indexModel Xend index作为字典中的第一个值,start index:'end index作为字典的键值对。
类似地,对于Color列,Blue应该具有作为4的start index和作为5的end index,并且对于Year列,2015应该具有作为0的start index和作为2的end index
预期输出:
Name

{0: 2, 3: 3, 4: 6, 7: 9, 10: 12}

Type

{0: 2, 3: 3, 4: 5, 6: 6, 7: 7, 8: 9, 10: 11, 12: 12}

Year

{0: 2, 3: 3, 4: 5, 6: 6, 7: 8, 9: 9, 10: 11, 12: 12] }

Price

{0: 0, 1: 1, 2: 2, 3: 3, 4: 5, 6: 6, 7: 8, 9: 9, 10: 11, 12: 12}

Color

{0: 2, 3: 3, 4: 5, 6: 6, 7: 8, 9: 9, 10: 11, 12:12}

我所尝试的

def find_indices(df, column):
    df1 = df.index.to_series().groupby(df[column]).agg(['first', 'last']).reset_index()
    df1 = df1.sort_values("first").reset_index()
    first_last_rows = df1.set_index('first')['last'].to_dict()
    print(column + ":")
    print(first_last_rows)

我得到的输出

Name:
{0: 2, 3: 3, 4: 6, 7: 9, 10: 12}
Type:
{0: 2, 3: 3, 4: 5, 6: 6, 7: 7, 8: 9, 10: 11, 12: 12}
Year:
{0: 8, 3: 9, 4: 11, 6: 12}
Price:
{0: 0, 1: 1, 2: 2, 3: 3, 4: 5, 6: 6, 7: 8, 9: 9, 10: 11, 12: 12}
Color:
{0: 11, 3: 3, 4: 5, 6: 6, 7: 8, 9: 12}

在输出中,我得到YearColor列的最终值来自不相交组,因此无法找到连续的公共值(或不同Name的值)

jyztefdp

jyztefdp1#

看起来你需要一组连续的值,为此一个众所周知的方法是使用shiftcumsum,然后使用groupby,然后对于你的问题,你可以使用每组的第一个索引和最后一个索引,然后按如下所示构建一个dict:

col = "Year"
col_dict = dict(
    df.groupby(df[col].ne(df[col].shift()).cumsum())[col]
    .agg(lambda x: (x.index[0], x.index[-1]))
    .values
)
print(col_dict)

{0: 2, 3: 3, 4: 5, 6: 6, 7: 8, 9: 9, 10: 11, 12: 12}

相关问题