pandas 范围匹配返回值[重复]

to94eoyn  于 5个月前  发布在  其他
关注(0)|答案(2)|浏览(44)

此问题在此处已有答案

How to join two dataframes for which column values are within a certain range?(9个回答)
Merge pandas dataframes where one value is between two others [duplicate](4个答案)
15天前关闭
我很难找到这个问题的解决方案。通常是merge()来获得类似的东西,但我找不到它是否会额外附加
数据类型:

import pandas as pd
import numpy as np

num = {'serial':[10,20,30,50]}
df = pd.DataFrame(num)
cols = {'StartSerial':[9,19,29,39],'StopSerial':[15,25,35,45],'Job':[564,859,748,125]}
df2 = pd.DataFrame(cols)

字符串
尝试做类似的事情,但由于索引不匹配而不可能:

df['Job'] = np.where((df['serial'] >= df2['StartSerial']) & (df['serial'] <= df2['StopSerial']),df2['Job'],'')


这也不起作用:

df.loc[(df['serial'] >= df2['StartSerial']) & (df['serial'] <= df2['StopSerial']),Job] = df2['Job']


所需输出:

serial | Job
------------
10     | 564

20     | 859

30     | 748

50     |

7xzttuei

7xzttuei1#

您可以尝试创建pd.IntervalIndex,然后使用此索引查找正确的值:

idx = pd.IntervalIndex.from_arrays(df2.StartSerial, df2.StopSerial, closed="both")
good_keys = df2.index.intersection(idx.get_indexer(df.serial))

df["job"] = df2.loc[good_keys, "Job"]
print(df)

字符串
印刷品:

serial    job
0      10  564.0
1      20  859.0
2      30  748.0
3      50    NaN

mqkwyuun

mqkwyuun2#

# links each row of df to all rows of df2
df = df.merge(df2, how='cross')
#
    serial  StartSerial  StopSerial  Job
0       10            9          15  564
1       10           19          25  859
2       10           29          35  748
3       10           39          45  125
4       20            9          15  564
5       20           19          25  859
6       20           29          35  748
7       20           39          45  125
8       30            9          15  564
9       30           19          25  859
10      30           29          35  748
11      30           39          45  125
12      50            9          15  564
13      50           19          25  859
14      50           29          35  748
15      50           39          45  125
#
df['Job'] = np.where((df['serial'] >= df['StartSerial']) & (df['serial'] <= df['StopSerial']),df['Job'],np.NaN)
# the tricky part is here, since a sum will transform NaNs into 0 which is not desirable
df.groupby("serial",as_index=False)["Job"].sum().replace({0.0: np.NaN})
#
   serial    Job
0      10  564.0
1      20  859.0
2      30  748.0
3      50    NaN

字符串
确实存在其他解决办法。
更容易理解的解决方案是将Start和Stop转换为值并填充它们。

df2["serial"] = df2.apply(lambda x: [*range(x["StartSerial"], x["StopSerial"]+1)], axis=1)
#
   StartSerial  StopSerial  Job                        serial
0            9          15  564   [9, 10, 11, 12, 13, 14, 15]
1           19          25  859  [19, 20, 21, 22, 23, 24, 25]
2           29          35  748  [29, 30, 31, 32, 33, 34, 35]
3           39          45  125  [39, 40, 41, 42, 43, 44, 45]
#
df2 = df2.explode("serial")[["Job","serial"]].reset_index(drop=True)
#
    Job serial
0   564      9
1   564     10
2   564     11
3   564     12
4   564     13
5   564     14
6   564     15
7   859     19
8   859     20
9   859     21
10  859     22
11  859     23
12  859     24
13  859     25
14  748     29
15  748     30
16  748     31
17  748     32
18  748     33
19  748     34
20  748     35
21  125     39
22  125     40
23  125     41
24  125     42
25  125     43
26  125     44
27  125     45
#
df.merge(df2, how="left", )

相关问题