Python -解析CSV以将具有subbalances_ids的多行连接为一行

avkwfej4  于 2024-01-03  发布在  Python
关注(0)|答案(4)|浏览(181)

嘿,我在使用Pandas在python中解析CSV时遇到了一些问题。
我的档案有这样的结构:

user_id, user_main_account_id, user_subbalance_id 
abc1,uuid1,toJoin1 
abc1,uuid1,toJoin2
abc2,uuid2,toJoin3
abc2,uuid2,toJoin4

字符串
我需要将它转换为包含多个子余额列的结尾格式,一个列接一个列,如下所示:

user_id ,user_main_account_id ,user_subbalance1_id ,user_subbalance2_id 
abc1    ,uuid1                ,toJoin1             ,toJoin2             
abc2    ,uuid2                ,toJoin3             ,toJoin4


如何使用python和Pandas轻松完成这一点?在我的尝试中,我最终得到了相同user_id的重复行,这是我想要避免的。
尝试通过唯一的user_id + main_account_id和main_account_id + subbalance_id分割CSV,但在合并时,我再次得到重复的结果。

66bbxpm5

66bbxpm51#

代码:

df['subbalance_label'] = df.groupby(['user_id', 'user_main_account_id']).cumcount() + 1
result = (
    df.pivot_table(index=['user_id', 'user_main_account_id'], columns='subbalance_label', values='user_subbalance_id', aggfunc='first').reset_index()
)
result.columns = [f'user_subbalance{i-1}_id' if i > 0 else col for i, col in enumerate(result.columns)]
print(result)

字符串
输出量:

user_id user_subbalance0_id user_subbalance1_id user_subbalance2_id
0    abc1               uuid1             toJoin1             toJoin2
1    abc2               uuid2             toJoin3             toJoin4

pdtvr36n

pdtvr36n2#

我建议你想想为什么你想这样做.但这是可行的:

import pandas as pd

columns = ['user_id', 'user_main_account_id', 'user_subbalance_id']

rows = [
    ['abc1', 'uuid1', 'toJoin1'],
    ['abc1', 'uuid1', 'toJoin2'],
    ['abc2', 'uuid2', 'toJoin3'],
    ['abc2', 'uuid2', 'toJoin4']
]

df = pd.DataFrame(rows, columns=columns)

# store all matching user_id and main_account as list
df = df.groupby(
    ['user_id', 'user_main_account_id']
)['user_subbalance_id'].agg(list).reset_index()

# how many new columns you will need
num_new_columns = df['user_subbalance_id'].apply(len).max()

# names of new columns
new_columns = [
    f'user_subbalance{i+1}_id' for i in range(num_new_columns)
]

# ensure all lists match number of next columns
df['user_subbalance_id'] = df['user_subbalance_id'].apply(
    lambda x: x + [None] * (num_new_columns-len(x))
)

# create new columns
df[new_columns] = pd.DataFrame(
    df.pop('user_subbalance_id').tolist(), 
    index=df.index
)

字符串

uyto3xhc

uyto3xhc3#

代码:

import pandas as pd

df = pd.read_csv('file.csv')

final_df = df.pivot_table(index=['user_id', 'user_main_account_id'],
                      columns=df.groupby(['user_id', 'user_main_account_id']).cumcount() + 1,
                      values='user_subbalance_id',
                      aggfunc='first').reset_index()

final_df.columns = ['user_id', 'user_main_account_id'] + 
[f'user_subbalance{i}_id' for i in range(1, len(final_df.columns)-1)]
print(final_df)
final_df.to_csv('output.csv', index=False)

字符串

93ze6v8z

93ze6v8z4#

如果你不必使用Pandas,或者只是不想使用,你可以使用Python的csv模块和一个简单的中间dict来完成这一点,该中间dict将其他列聚合在user_id下,比如:

users = {
    "abc1": {
        "main_id": "uuid1",
        "sub_ids": ["toJoin1 ", "toJoin2"],
    },
    "abc2": {
        "main_id": "uuid2",
        "sub_ids": ["toJoin3", "toJoin4"],
    },
    "abc3": {
        "main_id": "uuid3",
        "sub_ids": ["toJoin5"],
    },
    "abc4": {
        "main_id": "uuid4",
        "sub_ids": ["toJoin6", "toJoin7", "toJoin8"],
    },
}

字符串
您可以读取输入并填充该结构,如下所示:

import csv

users = {}
with open("input.csv", newline="") as f:
    reader = csv.reader(f, skipinitialspace=True)
    header = next(reader)

    for row in reader:
        uid = row[0]
        data = users.get(uid, {"main_id": "", "sub_ids": []})
        data["main_id"] = row[1]
        data["sub_ids"].append(row[2])

        users[uid] = data


您需要计算要创建的子余额列的最大数量:

max_subids = max([len(data["sub_ids"]) for data in users.values()])
subid_cols = [f"user_subbalance{i}_id" for i in range(1, max_subids + 1)]


然后循环遍历用户将每个子余额写入自己的行,确保用空空格填充短行以填充所有子余额列:

with open("output.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(header[:2] + subid_cols)

    for uid, data in users.items():
        row = [uid, data["main_id"], *data["sub_ids"]]
        # Pad row with empty strings to fill all columns (for proper CSV)
        row += [""] * (max_subids - len(data["sub_ids"]))

        writer.writerow(row)
| user_id | user_main_account_id | user_subbalance1_id | user_subbalance2_id | user_subbalance3_id |
|---------|----------------------|---------------------|---------------------|---------------------|
| abc1    | uuid1                | toJoin1             | toJoin2             |                     |
| abc2    | uuid2                | toJoin3             | toJoin4             |                     |
| abc3    | uuid3                | toJoin5             |                     |                     |
| abc4    | uuid4                | toJoin6             | toJoin7             | toJoin8             |

使用TypedDict可以让我在IDE中更轻松地完成此操作,并提供自动完成/建议和错误:

from typing import TypedDict

class UserData(TypedDict):
    main_id: str
    sub_ids: list[str]

users: dict[str, UserData] = {}

其他一切都保持不变。

相关问题