如何添加更多的列到dataframe提取列内的一个列来自CSV?

wn9m85ua  于 2023-03-27  发布在  其他
关注(0)|答案(2)|浏览(145)

I have a .CSV with multiple columns, okay, that's normal, but one of the columns have multiple columns inside, and I'm trying to get those columns and add it as new columns to the dataframe, to work with it, doing some correlations tests, etc...

Just to ilustrate, this is the format of the table coming from the CSV:

Those keys need to be new columns and the values new rows.

  • This is the first row of the CSV, it contains the name of the columns:

time,"labels_stats","lockers","keys","panels","glass","std_glass","avg_glass","sand","std_sand","avg_sand","gas","std_gas","avg_gas","temperature","std_ temperature","avg_ temperature","cracks","std_cracks","avg_cracks","cracks_forjed"

  • Here is the second row, where de values begins(it begins in second row and goes until row 8500):

03/24/2018 00:00:00,"color=123,brightness=16,rowling=9,rowling_gone=5,clipper=304,avg_clipper=19,std_clipper=51.917883880861964,billedclipper=152,avg_billedclipper=9.5,std_clipper=25.958941940430982,billedbox=2,avg_billedbox=0.125,std_billedbox=0.3415650255319866,box=4,avg_box=0.25,std_box=0.6831300510639732 color=1251,brightness=33,rowling=2,rowling_gone=13,clipper=0,avg_clipper=0,std_clipper=0,billedclipper=0,avg_billedclipper=0,std_clipper=0,billedbox=0,avg_billedbox=0,std_billedbox=0,box=0,avg_box=0,std_box=0 color=1252,brightness=13,rowling=1,rowling_gone=13,clipper=0,avg_clipper=0,std_clipper=0,billedclipper=0,avg_billedclipper=0,std_clipper=0,billedbox=0,avg_billedbox=0,std_billedbox=0,box=0,avg_box=0,std_box=0 color=443,brightness=506,rowling=186,rowling_gone=181,clipper=256995539,avg_clipper=507896.32213438733,std_clipper=2496550.2621190706,billedclipper=119301198,avg_billedclipper=235773.1185770751,std_clipper=1028066.899577621,billedbox=1025151,avg_billedbox=2025.990118577075,std_billedbox=7164.158497695328,box=2222381,avg_box=4392.057312252964,std_box=16220.701791372332 color=5228,brightness=2,rowling=2,rowling_gone=1,clipper=46370,avg_clipper=23185,std_clipper=13382.702940736599,billedclipper=14734,avg_billedclipper=7367,std_clipper=5987.780223087684,billedbox=313,avg_billedbox=156.5,std_billedbox=142.12846301849606,box=644,avg_box=322,std_box=284.2569260369921 color=53,brightness=38,rowling=15,rowling_gone=9,clipper=3570,avg_clipper=93.94736842105263,std_clipper=149.71828453003778,billedclipper=1171,avg_billedclipper=30.81578947368421,std_clipper=41.77924951337,billedbox=14,avg_billedbox=0.3684210526315789,std_billedbox=0.4888515295293514,box=28,avg_box=0.7368421052631579,std_box=0.9777030590587028 color=80,brightness=77,rowling=21,rowling_gone=41,clipper=1025327,avg_clipper=13315.935064935065,std_clipper=89590.98844622735,billedclipper=293454,avg_billedclipper=3811.090909090909,std_clipper=31167.5384803779,billedbox=2879,avg_billedbox=37.38961038961039,std_billedbox=297.0974568016429,box=7217,avg_box=93.72727272727273,std_box=765.3906060988116","77787.1720811062","97.85714285714286","33.714285714285715","686.9673467265001","35284.503610805674","317.1961642884739","937679.6209745132","181.6107821426806","67.57147890866946","6071.24803753883","386395.40130831","2682.444513691296",258071110,263,685,236,2230274,119610709,1028359
If you look closer, will realise that in the respective second column values, exists another columns inside with some sort of values. To be clear, they are inside one big quotation mark, begining in the "color=123(...) and finishing in the (..),std_box=765.3906060988116", then, the other columns have their values normally.

  • Highlighting the column that has multiple columns inside:

03/24/2018 00:00:00,***"color=123,brightness=16,rowling=9,rowling_gone=5,clipper=304,avg_clipper=19,std_clipper=51.917883880861964,billedclipper=152,avg_billedclipper=9.5,std_clipper=25.958941940430982,billedbox=2,avg_billedbox=0.125,std_billedbox=0.3415650255319866,box=4,avg_box=0.25,std_box=0.6831300510639732 color=1251,brightness=33,rowling=2,rowling_gone=13,clipper=0,avg_clipper=0,std_clipper=0,billedclipper=0,avg_billedclipper=0,std_clipper=0,billedbox=0,avg_billedbox=0,std_billedbox=0,box=0,avg_box=0,std_box=0 color=1252,brightness=13,rowling=1,rowling_gone=13,clipper=0,avg_clipper=0,std_clipper=0,billedclipper=0,avg_billedclipper=0,std_clipper=0,billedbox=0,avg_billedbox=0,std_billedbox=0,box=0,avg_box=0,std_box=0 color=443,brightness=506,rowling=186,rowling_gone=181,clipper=256995539,avg_clipper=507896.32213438733,std_clipper=2496550.2621190706,billedclipper=119301198,avg_billedclipper=235773.1185770751,std_clipper=1028066.899577621,billedbox=1025151,avg_billedbox=2025.990118577075,std_billedbox=7164.158497695328,box=2222381,avg_box=4392.057312252964,std_box=16220.701791372332 color=5228,brightness=2,rowling=2,rowling_gone=1,clipper=46370,avg_clipper=23185,std_clipper=13382.702940736599,billedclipper=14734,avg_billedclipper=7367,std_clipper=5987.780223087684,billedbox=313,avg_billedbox=156.5,std_billedbox=142.12846301849606,box=644,avg_box=322,std_box=284.2569260369921 color=53,brightness=38,rowling=15,rowling_gone=9,clipper=3570,avg_clipper=93.94736842105263,std_clipper=149.71828453003778,billedclipper=1171,avg_billedclipper=30.81578947368421,std_clipper=41.77924951337,billedbox=14,avg_billedbox=0.3684210526315789,std_billedbox=0.4888515295293514,box=28,avg_box=0.7368421052631579,std_box=0.9777030590587028 color=80,brightness=77,rowling=21,rowling_gone=41,clipper=1025327,avg_clipper=13315.935064935065,std_clipper=89590.98844622735,billedclipper=293454,avg_billedclipper=3811.090909090909,std_clipper=31167.5384803779,billedbox=2879,avg_billedbox=37.38961038961039,std_billedbox=297.0974568016429,box=7217,avg_box=93.72727272727273,std_box=765.3906060988116"***,"77787.1720811062","97.85714285714286","33.714285714285715","686.9673467265001","35284.503610805674","317.1961642884739","937679.6209745132","181.6107821426806","67.57147890866946","6071.24803753883","386395.40130831","2682.444513691296",258071110,263,685,236,2230274,119610709,1028359

  • I tried to do it using python lists and creating a function, but some columns have missing data and it's hard to reconstruct back, correctly, every row together with the new columns that were extracted, with the right time. But I have a pattern, that's: In the column with multiple columns inside, the inside columns begins with the field "color=(...)". I want to create new columns with the fields and therespective data that are after the field "color=(...)".

An example: For the field "color=123(...)", would be like this:
"color=123,brightness123=16,rowling123=9,rowling_gone123=5,clipper123=304,avg_clipper123=19,std_clipper123=51.917883880861964,billedclipper123=152,avg_billedclipper123=9.5,std_clipper123=25.958941940430982,billedbox123=2,avg_billedbox123=0.125,std_billedbox123=0.3415650255319866,box123=4,avg_box123=0.25,std_box123=0.6831300510639732"
And for the "color=5228(...)" would be like:
"color=5228,brightness5228=16,rowling5228=9,rowling_gone5228=5,clipper5228=304,avg_clipper5228=19,std_clipper5228=51.917883880861964,billedclipper5228=152,avg_billedclipper5228=9.5,std_clipper5228=25.958941940430982,billedbox5228=2,avg_billedbox5228=0.125,std_billedbox5228=0.3415650255319866,box5228=4,avg_box5228=0.25,std_box5228=0.6831300510639732 "
I tried to use python lists, but turns out that I can't rebuild the table without messing the rows sequence.
I need some ideas of how to do it using pandas or something from scikit learn. Any clues?

In the end, to ilustrate, I would like something like this:

2skhul33

2skhul331#

这并不优雅,但从一个像你这样的dummy df开始:

df = pd.DataFrame([
    {"time": 10, "labels": "color=123,brightness=456"}, 
    {"time": 10, "labels": "color=234,brightness=567"}
])

您可以应用lambda来拆分labels列,并创建一个新的DataFrame,如下所示:

attrs_df = pd.DataFrame(
    df["labels"].apply(
        lambda x: {
            y.split("=")[0]: y.split('=')[1] for y in x.split(',')
        }
    ).to_list()
)
attrs_df

attrs_df

color brightness
0   123 456
1   234 567

从那里你可以连接两个DataFrame。如果我有更多的时间,我会让它更优雅。如果可能的话,我可能会第一次使用cross-sections
编辑:更好地阅读你的问题(大量的文本)我看到你有多种颜色,并希望color=value在行中有value。这是对上面的一个小调整,但attrs_df不会在=上拆分键,而只是键:

pd.DataFrame(
    df["labels"].apply(
        lambda x: {
            y: y.split('=')[1] for y in x.split(',')
        }
    ).to_list()
)
ssm49v7z

ssm49v7z2#

尝试单独处理列:

# Extract labels_stats Series and flatten it
df1 = (df.pop('labels_stats').str.split().explode()
         .str.extractall(r'(?P<key>[^=]+)=(?P<val>[^,]+),?')
         .droplevel('match'))

# Add the numeric id (123, 1251, etc)
df1['key'] += df1['val'].where(df1['key'] == 'color').ffill().astype(str)

# Reshape the dataframe as the original one
df1 = df1.pivot_table(index=df1.index, columns='key', values='val', sort=False)

# Get the expected output
out = pd.concat([df, df1], axis=1)

输出:

time       lockers       keys     panels       glass     std_glass   avg_glass           sand    std_sand  ...  std_clipper80  billedclipper80  avg_billedclipper80  billedbox80  avg_billedbox80  std_billedbox80   box80  avg_box80   std_box80
0  03/24/2018 00:00:00  77787.172081  97.857143  33.714286  686.967347  35284.503611  317.196164  937679.620975  181.610782  ...   89590.988446         293454.0          3811.090909       2879.0         37.38961       297.097457  7217.0  93.727273  765.390606

[1 rows x 124 columns]

相关问题