csv 保存时数据框未垂直堆叠

myzjeezk  于 2023-11-14  发布在  其他
关注(0)|答案(1)|浏览(84)

data1.dat
data2.dat

21 GLY C  5.978   9.254  9.454 0 0 0 0  1  0
  22 LEU C  6.778  10.534 12.640 0 0 1 2  2  0
  23 GLU C  7.187   7.217 10.728 0 0 0 0  2  0
  24 ASN C  5.392   8.296 10.702 0 0 0 0  0  0
  25 LEU C  5.657   6.064  9.609 0 0 0 1  3  0
  26 ALA C  5.446   5.528  7.503 0 0 0 0  2  0
  27 ARG C  5.656   8.071  8.419 0 0 0 0  0  0
  28 MSE C  6.890   9.157  8.728 0 0 0 0  1  0
  29 ARG C  6.330   7.993 11.562 0 0 0 0  0  0
  30 LYS H  5.428   5.207  5.897 0 0 0 0  1  0
  31 GLN H  5.402   5.046  6.349 0 0 0 0  1  0
  32 ASP H  5.426   5.093  6.226 0 0 0 1  1  0
  33 ILE H  5.361   5.004  6.194 0 0 0 0  6  0
  34 ILE H  5.443   5.150  6.190 0 0 0 0  5  0
  35 PHE H  5.403   5.181  6.293 0 0 0 0  1  0
  36 ALA H  5.533   5.357  6.193 0 0 0 0  3  0
  37 ILE H  5.634   5.167  6.025 0 0 0 1  5  0
  38 LEU H  5.402   5.121  6.104 0 0 0 0  3  0
  39 LYS H  5.470   5.092  6.101 0 0 0 0  1  0
  40 GLN H  5.491   5.210  6.054 0 0 0 0  2  0
import os
import pandas as pd
from src.utils.get_root_dir import get_root_directory

def save_dataframe_to_ascii(df, filepath):
    df.to_csv(filepath, sep=',', index=False)

def getDataFrame(dataDirectoryPathString: str) -> pd.DataFrame:
    dataframes = []
    for filename in os.listdir(dataDirectoryPathString):
        if filename.endswith('.dat'):
            filepath = os.path.join(dataDirectoryPathString, filename)
            df = pd.read_csv(filepath, sep='\t')
            dataframes.append(df)
    concatenated_df = pd.concat(dataframes, ignore_index=True)
    return concatenated_df

if __name__ == "__main__":
    dataFrame = getDataFrame(get_root_directory() + "/data/")
    save_dataframe_to_ascii(dataFrame, get_root_directory() + "/save/save.txt")

输出量:
这些行应该是垂直堆叠的。
为什么输出中断?
我该怎么补救?

gab6jxml

gab6jxml1#

如果你不需要使用Pandas,并且我没有看到你使用它来做除连接之外的任何事情(到目前为止),我建议使用Python的csv模块来堆叠类似CSV的文件-它会比Pandas更快,并且使用几乎为零的内存:

import csv

TAB = "\t"

# Get files first, using maybe just use `glob.glob(get_root_directory()+'/data/*.dat')`
dat_files = ["input1.dat", "input2.dat"]

with open("output.dat", "w", newline="", encoding="utf-8") as f_out:
    writer = csv.writer(f_out, delimiter=TAB)

    for dat_file in dat_files:
        with open(dat_file, newline="", encoding="utf-8") as f_in:
            reader = csv.reader(f_in, delimiter=TAB)

            writer.writerows(reader)

字符串
reader变量是一个遍历文件中所有行的迭代器。调用writer.writerows(reader)将导致读取器中的每一行都被迭代,并直接交给写入器进行写入。这种方法几乎不使用内存,因为行只是从输入直接传递到输出。
我还展示了首先获取文件以避免额外的缩进/复杂性。
使用您共享的示例文件(并将任意数量的空格转换为TAB),我的output.dat看起来像:

相关问题