Python中的Pandas数据框架排序

lnlaulya  于 12个月前  发布在  Python
关注(0)|答案(1)|浏览(97)

我试图找到一种方法来做一个pandas框架上的python特殊类型的排序:serpentine排序。
下面是对serpentine sorting的简要描述:
当每个边界都被越过以获得更高级别的排序变量时,Serbian排序会颠倒排序顺序,从而有助于确保相邻记录在尽可能多的排序变量方面是相似的。
下面是一个使用三个变量的蛇形排序的示例,每个变量有三个类别(低、中、高)。在变量1的类别中对变量2进行排序时,排序顺序是相反的,在变量2的类别中对变量3进行排序时,排序顺序又是相反的,这样相邻的记录对于所有三个变量都是相似的。
An example of serpentine sorting
当我能够在SAS中工作时,我使用SAS程序(Proc Survey Select)来进行蛇形排序:

proc surveyselect data=&outfile sort=serp method=seqrate=1;
 
/* Grouping variables*/
 strata &class;
 
 /* Sort Varaibles */ 
 control &sortvar;

run;

字符串
R包中还有一个实用程序可以执行蛇形排序:
https://rdrr.io/github/adamMaier/samplingTools/man/serpentine.html
有没有人知道在python中使用Pandas的相应方法?
这是我最好的尝试。它可以工作,但它在几个变量之后就崩溃了。我永远无法使用上面的其他方法(例如PROC SURVEY SELECT)来匹配输出。

def serpentine_sort(testframe, classvar , isortvar):
    
    
    newframe=pandas.DataFrame(testframe)
    
    #create a list of true for each element in the list.
    #These will specify that
    ascenlist = [True for x in isortvar]
    
    
    
    # Perform an initial sort where we first sort by the class, and then the sort variables
    # in ascending order within the class
    newframe = newframe.sort_values(classvar, ascending=True) \
    .groupby(classvar, sort=True) \
    .apply(lambda x: x.sort_values(isortvar, ascending=ascenlist)) \
    .reset_index(drop=True).copy(deep=True)
    
    
    
    
    
    ##### SERPENTINE SORT THE DATA ONE COLUMN AT A TIME                   #####
    ##### ==============================================                  #####
    ##### Create a sort variable that is a cumulative count within        #####
    ##### each group in the preceding variable. Use modulus division      #####
    ##### to reverse the count. If the count from the prior groping       #####
    ##### variable is 0, sort ascending, if the not, sort descending      #####
    
    
    grouplist = classvar
    for i in range(1, len(isortvar)):

        print ("Iteration: " , i )
        ranklist=[]
        ranklist.append(classvar[0])
        ranklist.append(isortvar[i-1])
        
        
        grouplist.append(isortvar[i-1])
        print (ranklist)
        print (grouplist)
        
        
        # Count the groups within the prior column within the class
        newframe["counter"] = newframe.groupby([isortvar[i-1]]).ngroup()
        
        newframe["serpvar"] = newframe.loc[ newframe["counter"] % 2 == 0 ].groupby(ranklist)[isortvar[i-1]].cumcount(ascending=True ) + 1
        newframe["t"]       = newframe.loc[ newframe["counter"] % 2 != 0 ].groupby(ranklist)[isortvar[i-1]].cumcount(ascending=False) + 1
        newframe.loc[ newframe["serpvar"].isna() , "serpvar" ] = newframe["t"]
        
        #print ("")
        #print ("Pre sorted")
        #print ("==========")
        #print ("")
        #print (newframe)
    
    
        newframe = newframe.groupby(grouplist, sort=False) \
        .apply(lambda x: x.sort_values("serpvar", ascending=[True])) \
        .reset_index(drop=True).copy(deep=True)
    
    
    return newframe


nscg=serpentine_sort(testframe=df, classvar=["BTHRGN"], isortvar=["AGEGRP", "WAPRI", "PRMBR"])


注意:classvar参数是一个分组变量。这里我想在组内排序。isortvar参数是我的排序变量的列表。
任何帮助都将受到真诚的感谢。

kyvafyod

kyvafyod1#

假设此示例输入与您的输入类似:

from itertools import product

cat = pd.CategoricalDtype(categories=['Low', 'Medium', 'High'], ordered=True)

df = (pd.DataFrame(product(['Low', 'Medium', 'High'], repeat=3), dtype=cat)
        .rename(columns=lambda x: f'Variable{x+1}')
     )

字符串
你可以先按正常顺序排序,然后计算每个重复值的秩(通过self +较低权重的列的组),并反转奇数值,最后在奇数值乘以-1后再次排序:

cols = ['Variable1', 'Variable2', 'Variable3']

tmp =(df[cols]
 .sort_values(by=cols)
 .apply(lambda x: x.cat.codes)
)

order = (tmp.apply(lambda x: tmp.groupby(list(tmp.loc[:, x.name:]))
                                .cumcount().mod(2).mul(2).rsub(1))
            .mul(tmp).sort_values(by=cols).index
        )

out = df.reindex(order)

  • 注意:我不确定它如何推广到其他数据集,如果您认为这不适合所有用例,请随时更新问题。

输出量:

Variable1 Variable2 Variable3
0        Low       Low       Low
1        Low       Low    Medium
2        Low       Low      High
5        Low    Medium      High
4        Low    Medium    Medium
3        Low    Medium       Low
6        Low      High       Low
7        Low      High    Medium
8        Low      High      High
17    Medium      High      High
16    Medium      High    Medium
15    Medium      High       Low
12    Medium    Medium       Low
13    Medium    Medium    Medium
14    Medium    Medium      High
11    Medium       Low      High
10    Medium       Low    Medium
9     Medium       Low       Low
18      High       Low       Low
19      High       Low    Medium
20      High       Low      High
23      High    Medium      High
22      High    Medium    Medium
21      High    Medium       Low
24      High      High       Low
25      High      High    Medium
26      High      High      High


中间体:

# tmp
    Variable1  Variable2  Variable3
0           0          0          0
1           0          0          1
2           0          0          2
3           0          1          0
4           0          1          1
5           0          1          2
6           0          2          0
7           0          2          1
8           0          2          2
9           1          0          0
10          1          0          1
11          1          0          2
12          1          1          0
13          1          1          1
14          1          1          2
15          1          2          0
16          1          2          1
17          1          2          2
18          2          0          0
19          2          0          1
20          2          0          2
21          2          1          0
22          2          1          1
23          2          1          2
24          2          2          0
25          2          2          1
26          2          2          2

# tmp.apply(lambda x: tmp.groupby(list(tmp.loc[:, x.name:]))
#                        .cumcount().mod(2).mul(2).rsub(1))
    Variable1  Variable2  Variable3
0           1          1          1
1           1          1          1
2           1          1          1
3           1          1         -1
4           1          1         -1
5           1          1         -1
6           1          1          1
7           1          1          1
8           1          1          1
9           1         -1         -1
10          1         -1         -1
11          1         -1         -1
12          1         -1          1
13          1         -1          1
14          1         -1          1
15          1         -1         -1
16          1         -1         -1
17          1         -1         -1
18          1          1          1
19          1          1          1
20          1          1          1
21          1          1         -1
22          1          1         -1
23          1          1         -1
24          1          1          1
25          1          1          1
26          1          1          1

相关问题