csv 使用pandas将堆叠样式数据文件重新塑造为一个框架

wvt8vs2t  于 2024-01-03  发布在  其他
关注(0)|答案(3)|浏览(154)

我有一个CSV输入文件,格式如下,我正在寻找一种相当简单的方法来转换为正常形状的pandas中的框架。CSV数据文件将所有数据堆叠成两列,每个数据块由空行分隔,如下所示。注意,为了便于解释,我为三个块集设置了相同的时间戳值,但实际上它们可以不同

Trace Name,SignalName1
Signal,<signal info>
Timestamp,Value
2023-10-04 15:36:43.757193 EDT,13
2023-10-04 15:36:43.829083 EDT,14
2023-10-04 15:36:43.895651 EDT,17
2023-10-04 15:36:43.931145 EDT,11
,
Trace Name,SignalName2
Signal,<signal info>
Timestamp,Value
2023-10-04 15:36:43.757193 EDT,131
2023-10-04 15:36:43.829083 EDT,238
2023-10-04 15:36:43.895651 EDT,413
2023-10-04 15:36:43.931145 EDT,689
,
Trace Name,SignalName3
Signal,<signal info>
Timestamp,Value
2023-10-04 15:36:43.757193 EDT,9867
2023-10-04 15:36:43.829083 EDT,1257
2023-10-04 15:36:43.895651 EDT,5736
2023-10-04 15:36:43.931145 EDT,4935

字符串
整形后所需的输出应如下所示:

Timestamp           SignalName1 SignalName2 SignalName3
0   10/4/2023 15:36:43  13          131         9867
1   10/4/2023 15:36:43  14          238         1257
2   10/4/2023 15:36:43  17          413         5736
3   10/4/2023 15:36:43  11          689         4935

e3bfsja2

e3bfsja21#

您可以使用多个换行符拆分文件,然后在每个块上使用read_csv,然后使用concat

import re
import io

with open('csv_file.csv') as f:
    out = (pd.concat([pd.read_csv(io.StringIO(chunk),
                                  header=0, skiprows=[1,2])
                        .set_index('Trace Name')
                      for chunk in re.split('(?:\n,)+\n', f.read())
                      if chunk], axis=1)
             .rename_axis('Timestamp').reset_index()
          )

字符串
假设(如果需要,可以调整):

  • 每个标题中有3行
  • 每个标题的第一行是“跟踪名称”,然后是要用作列的名称

输出量:

Timestamp  SignalName1  SignalName2  SignalName3
0  2023-10-04 15:36:43           13          131         9867
1  2023-10-04 15:36:43           14          238         1257
2  2023-10-04 15:36:43           17          413         5736
3  2023-10-04 15:36:43           11          689         4935

zpf6vheq

zpf6vheq2#

有时候,我发现截取一个文本流(例如,从一个TextIOWrapper打开的文件)并以流的方式修改它很有用。
该方法适用于任意大的文件。示例应用程序包括:

  • 纠正已知错误;
  • 跳过(过滤掉)文件的一部分;
  • 注入额外的数据

你可以在这里这样做,并在每一行中注入一个“key”。当处理器遇到Trace Name,key时,就会获得这个key。
使用适当的自定义Proc和实用程序类FileModifier(定义如下),用法很简单:

with open('test.csv') as f:
    df = pd.read_csv(FileModifier(f, Proc().proc))

字符串
然后,针对您的案例的自定义处理器将是:

class Proc:
    def __init__(self):
        self.key = None
    
    def proc(self, line):
        if (m := re.match(r'^Trace\s+Name,(.*)', line)):
            send_header = self.key == None 
            self.key = m.group(1)
            if send_header:
                return 'SignalName,Timestamp,Value\n'
        elif re.match(r'^\d{4}', line):
            return f'{self.key},{line}'


最后,FileModifier类(可以用于许多不同类型的动态修改)是:

class FileModifier:
    def __init__(self, f, proc):
        self.f = f
        self.buffer = io.StringIO()
        self.proc = proc
        self.linestream = iter(self._lines())

    def _lines(self):
        for line in self.f:
            out = self.proc(line)
            if out is not None:
                yield out
        
    def read(self, size=-1):
        try:
            while size < 0 or self.buffer.tell() < size:
                self.buffer.write(next(self.linestream))
        except StopIteration:
            pass
        if size >= 0 and self.buffer.tell() > size:
            newbuf = io.StringIO()
            self.buffer.seek(0)
            out = self.buffer.read(size)
            newbuf.write(self.buffer.read())
            self.buffer = newbuf
        else:
            out = self.buffer.getvalue()
            self.buffer.seek(0)
            self.buffer.truncate()
        return out

示例

使用包含示例数据的文件test.csv

with open('test.csv') as f:
    df = pd.read_csv(FileModifier(f, Proc().proc))
>>> df
     SignalName                       Timestamp  Value
0   SignalName1  2023-10-04 15:36:43.757193 EDT     13
1   SignalName1  2023-10-04 15:36:43.829083 EDT     14
2   SignalName1  2023-10-04 15:36:43.895651 EDT     17
3   SignalName1  2023-10-04 15:36:43.931145 EDT     11
4   SignalName2  2023-10-04 15:36:43.757193 EDT    131
5   SignalName2  2023-10-04 15:36:43.829083 EDT    238
6   SignalName2  2023-10-04 15:36:43.895651 EDT    413
7   SignalName2  2023-10-04 15:36:43.931145 EDT    689
8   SignalName3  2023-10-04 15:36:43.757193 EDT   9867
9   SignalName3  2023-10-04 15:36:43.829083 EDT   1257
10  SignalName3  2023-10-04 15:36:43.895651 EDT   5736
11  SignalName3  2023-10-04 15:36:43.931145 EDT   4935


当然,你可以重塑:

>>> df.pivot(index='Timestamp', columns='SignalName', values='Value')
SignalName                      SignalName1  SignalName2  SignalName3
Timestamp                                                            
2023-10-04 15:36:43.757193 EDT           13          131         9867
2023-10-04 15:36:43.829083 EDT           14          238         1257
2023-10-04 15:36:43.895651 EDT           17          413         5736
2023-10-04 15:36:43.931145 EDT           11          689         4935

9cbw7uwe

9cbw7uwe3#

由于文件始终由两个逗号分隔的行组成,因此另一种可能性是使用最小的csv_read参数读取它,然后进行后处理。

1.最小read_csv

raw = pd.read_csv('your_file.csv', 
                  sep=',', 
                  header=None,                  # Prevents first file line from being mistaken for column names
                  names=['Timestamp','Value'])  # Select columns names yourself instead
raw

                         Timestamp          Value
0                       Trace Name    SignalName1
1                           Signal  <signal info>
2                        Timestamp          Value
3   2023-10-04 15:36:43.757193 EDT             13
4   2023-10-04 15:36:43.829083 EDT             14
5   2023-10-04 15:36:43.895651 EDT             17
6   2023-10-04 15:36:43.931145 EDT             11
7                              NaN            NaN
8                       Trace Name    SignalName2
9                           Signal  <signal info>
10                       Timestamp          Value
11  2023-10-04 15:36:43.757193 EDT            131
12  2023-10-04 15:36:43.829083 EDT            238
13  2023-10-04 15:36:43.895651 EDT            413
14  2023-10-04 15:36:43.931145 EDT            689
15                             NaN            NaN
16                      Trace Name    SignalName3
17                          Signal  <signal info>
18                       Timestamp          Value
19  2023-10-04 15:36:43.757193 EDT           9867
20  2023-10-04 15:36:43.829083 EDT           1257
21  2023-10-04 15:36:43.895651 EDT           5736
22  2023-10-04 15:36:43.931145 EDT           4935

字符串
在这个阶段,raw.dtypes都是object,因为行不是同质的,但是可以快速排序。

Timestamp    object
Value        object
dtype: object

2.后期处理

# Move signal name to a new column
df['SignalName'] = np.where(df['Timestamp']=='Trace Name', df['Value'], np.nan)
df['SignalName'].ffill(inplace=True)

# Drop all non numerical rows
df = df[df['Value'].str.match(r'^([\d\.]+)$', na=False).values] # regex matches only numbers, either integers or decimal
df

                         Timestamp Value   SignalName
3   2023-10-04 15:36:43.757193 EDT    13  SignalName1
4   2023-10-04 15:36:43.829083 EDT    14  SignalName1
5   2023-10-04 15:36:43.895651 EDT    17  SignalName1
6   2023-10-04 15:36:43.931145 EDT    11  SignalName1
11  2023-10-04 15:36:43.757193 EDT   131  SignalName2
12  2023-10-04 15:36:43.829083 EDT   238  SignalName2
13  2023-10-04 15:36:43.895651 EDT   413  SignalName2
14  2023-10-04 15:36:43.931145 EDT   689  SignalName2
19  2023-10-04 15:36:43.757193 EDT  9867  SignalName3
20  2023-10-04 15:36:43.829083 EDT  1257  SignalName3
21  2023-10-04 15:36:43.895651 EDT  5736  SignalName3
22  2023-10-04 15:36:43.931145 EDT  4935  SignalName3
# Regroup by signal name
pd.pivot_table(data = df,
               values = 'Value',
               columns = 'SignalName',
               index = 'Timestamp')

SignalName                     SignalName1 SignalName2 SignalName3
Timestamp                                                         
2023-10-04 15:36:43.757193 EDT        13.0       131.0      9867.0
2023-10-04 15:36:43.829083 EDT        14.0       238.0      1257.0
2023-10-04 15:36:43.895651 EDT        17.0       413.0      5736.0
2023-10-04 15:36:43.931145 EDT        11.0       689.0      4935.0

相关问题