python 优化Excel到Pandas的导入和从宽数据到长数据的转换

zengzsys  于 2022-10-30  发布在  Python
关注(0)|答案(2)|浏览(168)

我需要导入和转换xlsx文件。它们是以宽格式编写的,我需要从每一行复制一些单元格信息,并将其与所有其他行的信息配对:
[Edit:更改格式以表示更复杂的要求]

源格式

| 识别码|属性|活动1名称|活动1时间戳|活动2名称|活动2时间戳|
| - -|- -|- -|- -|- -|- -|
| 一个|A级|一种|1月1日22时00分|B| 2.1.22上午10时05分|
| 2个|B|一种|1月1日22时03分|B| 5.1.22 20时16分|

目标格式

| 识别码|属性|活动|时间戳记|
| - -|- -|- -|- -|
| 一个|A级|一种|1月1日22时00分|
| 一个|A级|B| 2.1.22上午10时05分|
| 2个|B|一种|1月1日22时03分|
| 2个|B| B| 5.1.22 20时16分|
下面的代码可以很好地转换数据,但是转换过程非常非常慢:

def transform(data_in):
    data = pd.DataFrame(columns=columns)
    # Determine number of processes entered in a single row of the original file
    steps_per_row = int((data_in.shape[1] - (len(columns) - 2)) / len(process_matching) + 1)
    data_in = data_in.to_dict("records") # Convert to dict for speed optimization
    for row_dict in tqdm(data_in): # Iterate over each row of the original file
        new_row = {}
        # Set common columns for each process step
        for column in column_matching:
            new_row[column] = row_dict[column_matching[column]]
        for step in range(0, steps_per_row):
            rep = str(step+1) if step > 0 else ""
            # Iterate for as many times as there are process steps in one row of the original file and
            # set specific columns for each process step, keeping common column values identical for current row
            for column in process_matching:
                new_row[column] = row_dict[process_matching[column]+rep]
            data = data.append(new_row, ignore_index=True) # append dict of new_row to existing data
    data.index.name = "SortKey"
    data[timestamp].replace(r'.000', '', regex=True, inplace=True) # Remove trailing zeros from timestamp # TODO check if works as intended
    data.replace(r'^\s*$', float('NaN'), regex=True, inplace=True) # Replace cells with only spaces with nan
    data.dropna(axis=0, how="all", inplace=True) # Remove empty rows
    data.dropna(axis=1, how="all", inplace=True) # Remove empty columns
    data.dropna(axis=0, subset=[timestamp], inplace=True) # Drop rows with empty Timestamp
    data.fillna('', inplace=True) # Replace NaN values with empty cells
    return data

显然,遍历每行甚至每列根本不是正确使用Pandas的方法,但我看不出这种转换如何可以矢量化。
我试过使用并行化(modin),也试过使用dict,但它不起作用/没有帮助。脚本的其余部分实际上只是打开和保存文件,所以问题就在这里。
如果有任何关于如何提高速度的想法,我将非常感激!

2w3kk1z5

2w3kk1z51#

df.melt函数应该能够更快地执行这种类型的操作。

df = pd.DataFrame({'ID' : [1, 2],
                   'Property' : ['A', 'B'],
                   'Info1' : ['x', 'a'],
                   'Info2' : ['y', 'b'],
                   'Info3' : ['z', 'c'],
                   })

data=df.melt(id_vars=['ID','Property'], value_vars=['Info1', 'Info2', 'Info3'])

编辑以解决修改后的问题df.meltdf.pivot操作组合。


# create data

df = pd.DataFrame({'ID' : [1, 2, 3],
                   'Property' : ['A', 'B', 'C'],
                   'Activity1name' : ['a', 'a', 'a'],
                   'Activity1timestamp' : ['1_1_22', '1_1_23', '1_1_24'],
                   'Activity2name' : ['b', 'b', 'b'],
                   'Activity2timestamp' : ['2_1_22', '2_1_23', '2_1_24'],
                   })

# melt dataframe

df_melted = df.melt(id_vars=['ID','Property'], 
             value_vars=['Activity1name', 'Activity1timestamp',
                         'Activity2name', 'Activity2timestamp',],
             )

# merge categories, i.e. Activity1name Activity2name become Activity

df_melted.loc[df_melted['variable'].str.contains('name'), 'variable'] = 'Activity'
df_melted.loc[df_melted['variable'].str.contains('timestamp'),'variable'] = 'Timestamp'

# add category ids (dataframe may need to be sorted before this operation)

u_category_ids = np.arange(1,len(df_melted.variable.unique())+1)
category_ids = np.repeat(u_category_ids,len(df)*2).astype(str)
df_melted.insert(0, 'unique_id', df_melted['ID'].astype(str) +'_'+ category_ids)

# pivot table

table = df_melted.pivot_table(index=['unique_id','ID','Property',], 
                              columns='variable', values='value',
                                    aggfunc=lambda x: ' '.join(x))
table = table.reset_index().drop(['unique_id'], axis=1)
8ehkhllq

8ehkhllq2#

使用pd.melt,正如@Pantelis所建议的,我能够极大地加快这个转换,这是令人难以置信的。以前,一个大约13 k行的文件在一个全新的ThinkPad X1上需要4-5个小时,现在只需要不到2分钟!这是一个150倍的速度,真是太棒了。:)
下面是我的新代码,如果有人有类似的数据结构,可以作为灵感/参考:

def transform(data_in):
    # Determine number of processes entered in a single row of the original file
    steps_per_row = int((data_in.shape[1] - len(column_matching)) / len(process_matching) )
    # Specify columns for pd.melt, transforming wide data format to long format
    id_columns = column_matching.values()
    var_names = {"Erledigungstermin Auftragsschrittbeschreibung":data_in["Auftragsschrittbeschreibung"].replace(" ", np.nan).dropna().values[0]}
    var_columns = ["Erledigungstermin Auftragsschrittbeschreibung"]
    for _ in range(2, steps_per_row+1):
        try:
            var_names["Erledigungstermin Auftragsschrittbeschreibung" + str(_)] = data_in["Auftragsschrittbeschreibung" + str(_)].replace(" ", np.nan).dropna().values[0]
        except IndexError:
            var_names["Erledigungstermin Auftragsschrittbeschreibung" + str(_)] = data_in.loc[0,"Auftragsschrittbeschreibung" + str(_)]
        var_columns.append("Erledigungstermin Auftragsschrittbeschreibung" + str(_))
    data = pd.melt(data_in, id_vars=id_columns, value_vars=var_columns, var_name="ActivityName", value_name=timestamp)
    data.replace(var_names, inplace=True) # Replace "Erledigungstermin Auftragsschrittbeschreibung" with ActivityName
    data.sort_values(["Auftrags-\npositionsnummer",timestamp], ascending=True, inplace=True)
    # Improve column names
    data.index.name = "SortKey"
    column_names = {v: k for k, v in column_matching.items()}
    data.rename(mapper=column_names, axis="columns", inplace=True)
    data[timestamp].replace(r'.000', '', regex=True, inplace=True) # Remove trailing zeros from timestamp
    data.replace(r'^\s*$', float('NaN'), regex=True, inplace=True) # Replace cells with only spaces with nan
    data.dropna(axis=0, how="all", inplace=True) # Remove empty rows
    data.dropna(axis=1, how="all", inplace=True) # Remove empty columns
    data.dropna(axis=0, subset=[timestamp], inplace=True) # Drop rows with empty Timestamp
    data.fillna('', inplace=True) # Replace NaN values with empty cells
    return data

相关问题