如何合并多个带有TIME头的CSV文件?

hrirmatl  于 2024-01-03  发布在  其他
关注(0)|答案(4)|浏览(171)

我有多个以下格式的CSV文件:假设我有以下文件
file1.csv:

  1. 15:06:00,16:06:00,00:06:00,18:06:00
  2. 111,133,166,126,0

字符串
file2.csv:

  1. 15:08:00,16:08:00,17:06:00,18:06:00
  2. 123,133,166,126,0


file3.csv:

  1. 15:06:00,16:06:00,17:08:00,18:08:00
  2. 123,133,166,126,0


如何获得以下格式?

  1. 15:06:00,15:08:00,16:06:00,16:08:00,17:06:00,17:08:00,18:06:00,18:08:00,00:06:00,18:06:00
  2. 111,,133,,,,,,166,126,0
  3. ,123,,133,166,,126,,,,0
  4. 123,,133,,,166,,126,,,0


我如何才能实现这一点?PS:请没有硬编码的答案只为这些文件。我有多个CSV文件与边缘情况下应用的解决方案。
我的尝试

  1. import pandas as pd
  2. import os
  3. import numpy
  4. file_paths = ['file1.csv', 'file2.csv', 'file3.csv']
  5. headers = []
  6. data_rows = []
  7. for file_path in file_paths:
  8. df = pd.read_csv(file_path)
  9. headers.append(pd.Series(df.columns))
  10. data_rows.append(pd.Series(df.iloc[0]))
  11. all_timestamps = pd.Series(pd.concat(headers).unique())
  12. master_df = pd.DataFrame(columns=all_timestamps)
  13. for header, data_row in zip(headers, data_rows):
  14. row = pd.Series(index=all_timestamps, data='')
  15. for time, value in zip(header, data_row):
  16. row[time] = value
  17. master_df = master_df.append(row, ignore_index=True)
  18. master_df.to_csv('merged_file.csv', index=False)


但是,我的尝试忽略了每个CSV文件的第一个值。我不知道如何进一步。编辑:只要您可以实现预期的输出,您就可以对数据进行排序。我删除了尾随的零以清楚地表达理解。我将我想要实现的过程可视化如下:


的数据
有没有一种垂直合并的方法来实现这一点?有什么方法来实现这一点?

deyfvvtc

deyfvvtc1#

你可以有一个中间步骤,你把所有的值都转储到一个列表中。然后你可以在列表上运行sort()。也许你必须把值转换为字符串,时间戳,或者任何与sort相关的东西。
范例:

  1. listExample = ["15:08", "15:06", "15:07", "15:10", "16:05", "16:03"]
  2. listExample.sort()
  3. Sorted listExample = ['15:06', '15:07', '15:08', '15:10', '16:03', '16:05']

字符串

hgc7kmma

hgc7kmma2#

Python的csv模块中的DictReader和DictWriter可以帮助你。
您可以利用DictWriter的两个属性来解决此问题:
1.在创建/初始化编写器时设置输出列的顺序
1.默认情况下,写入器将通过插入空字符串来“完成”缺少指定列的行
就像这样:

  1. all_rows = [
  2. {"12:33": "b", "14:55": "c", "11:22": "a"},
  3. {"12:33": "aa", "11:22": "bb" },
  4. ]
  5. with open("output.csv", "w", newline="") as f:
  6. writer = csv.DictWriter(f, fieldnames=["11:22", "12:33", "14:55"])
  7. writer.writeheader()
  8. writer.writerows(all_rows)

字符串
在输出CSV中,我们可以看到:
1.使用fieldnames属性设置的[11:22,12:33,14:55]的列顺序决定了所有行的最终顺序
1.第2行中缺少的值14:55已用空字符串(尾随逗号)填充

  1. 11:22,12:33,14:55
  2. a,b,c
  3. bb,aa,


你的解决方案需要遍历这些文件,并将每一行添加到一个大列表中,就像上面的all_rows一样。有些行会有键(列名),而其他行没有。然后你需要将所有列名收集到一个列表中,并对该列表进行排序。最后,你可以用最终的列名列表创建DictWriter,写标题,写行:

  1. import csv
  2. import glob
  3. all_rows = []
  4. for fname in sorted(glob.glob("input*.csv")):
  5. with open(fname, newline="") as f:
  6. reader = csv.DictReader(f)
  7. all_rows.extend(reader)
  8. col_names: set[str] = set()
  9. for row in all_rows:
  10. for col_name in row:
  11. col_names.add(col_name)
  12. with open("output.csv", "w", newline="") as f:
  13. writer = csv.DictWriter(f, fieldnames=sorted(col_names))
  14. writer.writeheader()
  15. writer.writerows(all_rows)

展开查看全部
l2osamch

l2osamch3#

输出中的错误源于这样一个事实,即在csv文件中,除了头部之外,每行都以0结尾,因此Pandas将每行中的第一列解析为行名称,然后扭曲所有其他赋值。改变预处理步骤以消除尾随的零,您会得到:

  1. import pandas as pd
  2. file_paths = ['file1.csv', 'file2.csv', 'file3.csv']
  3. headers = []
  4. data_rows = []
  5. for file_path in file_paths:
  6. file = open(file_path, 'r')
  7. lines = file.readlines()
  8. headers.append(lines[0].rstrip().split(','))
  9. for line in lines[1:]:
  10. data_rows.append(line.split(',')[:-1])
  11. all_timestamps = list(set([h for header in headers for h in header]))
  12. master_df = pd.DataFrame(columns=all_timestamps)
  13. for header, data_row in zip(headers, data_rows):
  14. row = pd.Series(index=all_timestamps, data='')
  15. for time, value in zip(header, data_row):
  16. row[time] = value
  17. master_df = pd.concat([master_df, pd.DataFrame([row])], ignore_index=True)
  18. master_df.to_csv('merged_file.csv', index=False)

字符串
请注意,这些列将没有特定的顺序。为了产生与您所说的完全相同的输出,您需要进行排序。解决下午开始的问题的一种替代方法是遍历每个标题,并将每个时间戳添加24,这比其前身小,然后在最终输出中取消。希望这有帮助。

展开查看全部
jobtbby3

jobtbby34#

而不是只考虑时间,我添加了日期的时间戳。

  1. # Initialize the VIZ_DATETIME column
  2. largest_group['VIZ_DATETIME'] = pd.Timestamp('1900-01-01')
  3. # Add the time to the base date
  4. largest_group['VIZ_DATETIME'] = pd.to_datetime(largest_group.VIZ_DATETIME.astype(str) + ' ' + largest_group.TIME.astype(str))
  5. # Adjust the day for times past midnight
  6. # To determine the rollover, we check if the current time is less than the previous time.
  7. largest_group['VIZ_DATETIME'] += largest_group['TIME'].diff().lt(pd.Timedelta(seconds=0)).cumsum() * pd.Timedelta(days=1)
  8. largest_group['VIZ_DATETIME'] = largest_group['VIZ_DATETIME'].dt.tz_localize(None)

字符串
然后,我初始化一个列表来保存DataFrame。将每个CSV文件加载到DataFrame中并将其追加到列表中。我使用迭代执行操作,并对DataFrame进行排序。然后,创建一个字典并填充数据。

  1. dataframes = []
  2. for file in csv_files:
  3. df = pd.read_csv(file)
  4. df['VIZ_DATETIME'] = pd.to_datetime(df['VIZ_DATETIME'])
  5. df = df.sort_values(by='VIZ_DATETIME')
  6. dataframes.append(df)
  7. all_datetimes = set()
  8. for df in dataframes:
  9. # Extract times and merge with all_datetimes
  10. df_times = df['VIZ_DATETIME'].dt.time
  11. all_datetimes |= set(df['VIZ_DATETIME'])
  12. # Sort all_datetimes
  13. all_datetimes = sorted(all_datetimes)
  14. # Create a dictionary for the new DataFrame using datetime
  15. output_dict = {datetime: [None] * len(dataframes) for datetime in all_datetimes}
  16. # Populate the dictionary with data
  17. for datetime in all_datetimes:
  18. for i, df in enumerate(dataframes):
  19. if datetime in df['VIZ_DATETIME'].values:
  20. output_dict[datetime][i] = df.loc[df['VIZ_DATETIME'] == datetime, 'VALUE1'].values[0]
  21. # Create the output DataFrame
  22. output_df = pd.DataFrame.from_dict(output_dict, orient='index')
  23. output_df.reset_index(inplace=True)
  24. output_df.columns = ['Time'] + [f'File{i+1}' for i in range(len(dataframes))]

展开查看全部

相关问题