如何合并多个带有TIME头的CSV文件?

hrirmatl  于 11个月前  发布在  其他
关注(0)|答案(4)|浏览(135)

我有多个以下格式的CSV文件:假设我有以下文件
file1.csv:

15:06:00,16:06:00,00:06:00,18:06:00
111,133,166,126,0

字符串
file2.csv:

15:08:00,16:08:00,17:06:00,18:06:00
123,133,166,126,0


file3.csv:

15:06:00,16:06:00,17:08:00,18:08:00
123,133,166,126,0


如何获得以下格式?

15:06:00,15:08:00,16:06:00,16:08:00,17:06:00,17:08:00,18:06:00,18:08:00,00:06:00,18:06:00
111,,133,,,,,,166,126,0
,123,,133,166,,126,,,,0
123,,133,,,166,,126,,,0


我如何才能实现这一点?PS:请没有硬编码的答案只为这些文件。我有多个CSV文件与边缘情况下应用的解决方案。
我的尝试

import pandas as pd
import os
import numpy

file_paths = ['file1.csv', 'file2.csv', 'file3.csv'] 
headers = []
data_rows = []

for file_path in file_paths:
    df = pd.read_csv(file_path)

    headers.append(pd.Series(df.columns))
    data_rows.append(pd.Series(df.iloc[0]))

all_timestamps = pd.Series(pd.concat(headers).unique())

master_df = pd.DataFrame(columns=all_timestamps)

for header, data_row in zip(headers, data_rows):
   
    row = pd.Series(index=all_timestamps, data='')

    
    for time, value in zip(header, data_row):
        row[time] = value
    
   
master_df = master_df.append(row, ignore_index=True)

master_df.to_csv('merged_file.csv', index=False)


但是,我的尝试忽略了每个CSV文件的第一个值。我不知道如何进一步。编辑:只要您可以实现预期的输出,您就可以对数据进行排序。我删除了尾随的零以清楚地表达理解。我将我想要实现的过程可视化如下:


的数据
有没有一种垂直合并的方法来实现这一点?有什么方法来实现这一点?

deyfvvtc

deyfvvtc1#

你可以有一个中间步骤,你把所有的值都转储到一个列表中。然后你可以在列表上运行sort()。也许你必须把值转换为字符串,时间戳,或者任何与sort相关的东西。
范例:

listExample = ["15:08", "15:06", "15:07", "15:10", "16:05", "16:03"]
listExample.sort()

Sorted listExample = ['15:06', '15:07', '15:08', '15:10', '16:03', '16:05']

字符串

hgc7kmma

hgc7kmma2#

Python的csv模块中的DictReader和DictWriter可以帮助你。
您可以利用DictWriter的两个属性来解决此问题:
1.在创建/初始化编写器时设置输出列的顺序
1.默认情况下,写入器将通过插入空字符串来“完成”缺少指定列的行
就像这样:

all_rows = [
    {"12:33":  "b", "14:55":  "c", "11:22": "a"},
    {"12:33": "aa", "11:22": "bb"              },
]

with open("output.csv", "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=["11:22", "12:33", "14:55"])
    writer.writeheader()
    writer.writerows(all_rows)

字符串
在输出CSV中,我们可以看到:
1.使用fieldnames属性设置的[11:22,12:33,14:55]的列顺序决定了所有行的最终顺序
1.第2行中缺少的值14:55已用空字符串(尾随逗号)填充

11:22,12:33,14:55
a,b,c
bb,aa,


你的解决方案需要遍历这些文件,并将每一行添加到一个大列表中,就像上面的all_rows一样。有些行会有键(列名),而其他行没有。然后你需要将所有列名收集到一个列表中,并对该列表进行排序。最后,你可以用最终的列名列表创建DictWriter,写标题,写行:

import csv
import glob

all_rows = []
for fname in sorted(glob.glob("input*.csv")):
    with open(fname, newline="") as f:
        reader = csv.DictReader(f)
        all_rows.extend(reader)

col_names: set[str] = set()
for row in all_rows:
    for col_name in row:
        col_names.add(col_name)

with open("output.csv", "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=sorted(col_names))
    writer.writeheader()
    writer.writerows(all_rows)

l2osamch

l2osamch3#

输出中的错误源于这样一个事实,即在csv文件中,除了头部之外,每行都以0结尾,因此Pandas将每行中的第一列解析为行名称,然后扭曲所有其他赋值。改变预处理步骤以消除尾随的零,您会得到:

import pandas as pd

file_paths = ['file1.csv', 'file2.csv', 'file3.csv'] 
headers = []
data_rows = []

for file_path in file_paths:
    file = open(file_path, 'r')
    lines = file.readlines()
    headers.append(lines[0].rstrip().split(','))
    for line in lines[1:]:
        data_rows.append(line.split(',')[:-1])

all_timestamps = list(set([h for header in headers for h in header]))

master_df = pd.DataFrame(columns=all_timestamps)
for header, data_row in zip(headers, data_rows):
    row = pd.Series(index=all_timestamps, data='')
    for time, value in zip(header, data_row):
        row[time] = value
    master_df = pd.concat([master_df, pd.DataFrame([row])], ignore_index=True)

master_df.to_csv('merged_file.csv', index=False)

字符串
请注意,这些列将没有特定的顺序。为了产生与您所说的完全相同的输出,您需要进行排序。解决下午开始的问题的一种替代方法是遍历每个标题,并将每个时间戳添加24,这比其前身小,然后在最终输出中取消。希望这有帮助。

jobtbby3

jobtbby34#

而不是只考虑时间,我添加了日期的时间戳。

# Initialize the VIZ_DATETIME column
largest_group['VIZ_DATETIME'] = pd.Timestamp('1900-01-01')

# Add the time to the base date
largest_group['VIZ_DATETIME'] = pd.to_datetime(largest_group.VIZ_DATETIME.astype(str) + ' ' + largest_group.TIME.astype(str))

# Adjust the day for times past midnight
# To determine the rollover, we check if the current time is less than the previous time.
largest_group['VIZ_DATETIME'] += largest_group['TIME'].diff().lt(pd.Timedelta(seconds=0)).cumsum() * pd.Timedelta(days=1)
largest_group['VIZ_DATETIME'] = largest_group['VIZ_DATETIME'].dt.tz_localize(None)

字符串
然后,我初始化一个列表来保存DataFrame。将每个CSV文件加载到DataFrame中并将其追加到列表中。我使用迭代执行操作,并对DataFrame进行排序。然后,创建一个字典并填充数据。

dataframes = []

for file in csv_files:
    df = pd.read_csv(file)
    df['VIZ_DATETIME'] = pd.to_datetime(df['VIZ_DATETIME'])
    df = df.sort_values(by='VIZ_DATETIME')
    dataframes.append(df)

all_datetimes = set()
for df in dataframes:
    # Extract times and merge with all_datetimes
    df_times = df['VIZ_DATETIME'].dt.time
    all_datetimes |= set(df['VIZ_DATETIME'])

# Sort all_datetimes
all_datetimes = sorted(all_datetimes)

# Create a dictionary for the new DataFrame using datetime
output_dict = {datetime: [None] * len(dataframes) for datetime in all_datetimes}

# Populate the dictionary with data
for datetime in all_datetimes:
    for i, df in enumerate(dataframes):
        if datetime in df['VIZ_DATETIME'].values:
            output_dict[datetime][i] = df.loc[df['VIZ_DATETIME'] == datetime, 'VALUE1'].values[0]

# Create the output DataFrame
output_df = pd.DataFrame.from_dict(output_dict, orient='index')
output_df.reset_index(inplace=True)
output_df.columns = ['Time'] + [f'File{i+1}' for i in range(len(dataframes))]

相关问题