更快的方法来分割一个大的CSV文件均匀组成较小的CSV文件?

vfh0ocws  于 2023-07-31  发布在  其他
关注(0)|答案(5)|浏览(137)

我肯定有更好的办法,但我一片空白。我有一个这种格式的CSV文件。对ID列进行排序,以便至少将所有内容分组在一起:

Text                 ID
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text2, BBBB
this is sample text2, BBBB
this is sample text2, BBBB
this is sample text3, CCCC
this is sample text4, DDDD
this is sample text4, DDDD
this is sample text5, EEEE
this is sample text5, EEEE
this is sample text6, FFFF
this is sample text6, FFFF

字符串
我想做的是快速地将CSV拆分为X数量的较小CSV文件。因此,如果X==3,则AAAA将进入“1.csv”,BBBB将进入“2.csv”,CCCC将进入“3.csv”,下一组将循环返回并进入“1.csv”。
这些组的大小不同,因此按数字进行硬编码的分割在这里不起作用。
有没有一种更快的方法来可靠地分割这些,然后我目前的方法只是使用Python中的Pandas groupby来编写它们?

file_ = 0
    num_files = 3

    for name, group in df.groupby(by=['ID'], sort=False):

        file_+=1
        group['File Num'] = file_

        group.to_csv(file_+'.csv',index=False, header=False, mode='a')

        if file_ == num_files:

            file_ = 0


这是一个基于python的解决方案,但我对使用awk或bash的东西持开放态度,如果它能完成工作的话。
编辑:
为了澄清,我想在一个固定数量的文件,我可以设置组分裂。
在这种情况下,3。(因此x = 3)。第一个组(AAAA)将进入1.csv,第二个进入2.csv,第三个进入3.csv,然后对于第四个组,它将循环返回并将其插入1.csv。等等。
示例输出1.csv:

Text                 ID
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text4, DDDD
this is sample text4, DDDD


示例输出2.csv:

Text                 ID
this is sample text2, BBBB
this is sample text2, BBBB
this is sample text2, BBBB
this is sample text5, EEEE
this is sample text5, EEEE


示例输出3.csv:

Text                 ID
this is sample text3, CCCC
this is sample text6, FFFF
this is sample text6, FFFF

r3i60tvu

r3i60tvu1#

您可以使用此awk解决方案:

awk -v X=3 '
FNR == 1 {   # save 1st record as header 
   hdr = $0
   next
}
p != $NF {   # ID field changes, move to new output csv file 
   close(fn)
   fn = ((n++ % X) + 1)".csv" # construct new file name
}
!seen[fn]++ {                 # do we need to print header
   print hdr > fn 
}
{
   print >> fn                # append each record to output
   p = $NF                    # save last field in variable p
}' file

字符串

uqdfh47h

uqdfh47h2#

在每个Unix机器上的任何shell中使用任何awk:

$ cat tst.awk
NR==1 {
    hdr = $0
    next
}
$NF != prev {
    out = (((blockCnt++) % X) + 1) ".csv"
    if ( blockCnt <= X ) {
        print hdr > out
    }
    prev = $NF
}
{ print > out }

字符串

$ awk -v X=3 -f tst.awk input.csv

$ head [0-9]*.csv
==> 1.csv <==
Text                 ID
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text4, DDDD
this is sample text4, DDDD

==> 2.csv <==
Text                 ID
this is sample text2, BBBB
this is sample text2, BBBB
this is sample text2, BBBB
this is sample text5, EEEE
this is sample text5, EEEE

==> 3.csv <==
Text                 ID
this is sample text3, CCCC
this is sample text6, FFFF
this is sample text6, FFFF


如果X是一个足够大的数字,你超过了你的系统限制,同时打开文件,你开始得到一个“太多打开的文件”错误,那么你需要使用GNU awk,因为它在内部处理这个问题,或者改变代码,一次只打开一个文件:

NR==1 {
    hdr = $0
    next
}
$NF != prev {
    close(out)
    out = (((blockCnt++) % X) + 1) ".csv"
    if ( blockCnt <= X ) {
        print hdr > out
    }
    prev = $NF
}
{ print >> out }


或者实现您自己的方法来管理并发打开的文件数量。
编辑:以下是@PaulHodges在评论中的建议将导致这样的脚本:

NR == 1 {
    for ( i=1; i <= X; i++ ) {
        print > (i ".csv")
    }
    next
}
$NF != prev {
    out = (((NR-1) % X) + 1) ".csv"
    prev = $NF
}
{ print > out }

0wi1tuuw

0wi1tuuw3#

使用您显示的示例,请尝试以下代码。如上所述,考虑到最后一列按照所示样本进行排序。

awk -v x="3" '
BEGIN{
  count=1
  outFile=count".csv"
}
FNR==1{
  print
  next
}
prev!=$NF && prev{
  close(outFile)
  count++
  outFile=count".csv"
}
{
  print >> (outFile)
  prev=$NF
}
x==count{ count=1 }
' Input_file

字符串

u4dcyp6a

u4dcyp6a4#

使用groupbyfactorize模所需组的数量(N):

N = 3

for i, g in df.groupby(pd.factorize(df['ID'])[0]%N):
    g.to_csv(f'chunk{i+1}.csv', index=False)

字符串
输出文件:

# chunk1.csv
Text,ID
this is sample text,AAAA
this is sample text,AAAA
this is sample text,AAAA
this is sample text,AAAA
this is sample text,AAAA
this is sample text4,DDDD
this is sample text4,DDDD

# chunk2.csv
Text,ID
this is sample text2,BBBB
this is sample text2,BBBB
this is sample text2,BBBB
this is sample text5,EEEE
this is sample text5,EEEE

# chunk3.csv
Text,ID
this is sample text3,CCCC
this is sample text6,FFFF
this is sample text6,FFFF

计时

在1400万行上测试:

15.8 s ± 687 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


其中约14秒是由于I/O
与其他答案的比较(在shell中使用time):

# @mozway as a python script including imports and reading the file
real    0m20,834s

# @RavinderSingh13
real    1m22,952s

# @anubhava
real    1m23,790s

# @Ed Morton (updated code, original solution was 2m58,171s)
real    0m8,599s


作为函数:

import pandas as pd

def split_csv(filename, N=3, id_col='ID', out_basename='chunk'):
    df = pd.read_csv(filename)
    for i, g in df.groupby(pd.factorize(df[id_col])[0]%N):
        g.to_csv(f'{out_basename}{i+1}.csv', index=False)

split_csv('my_file.csv', N=3)

zazmityj

zazmityj5#

这里

group.to_csv(file_+'.csv',index=False, header=False, mode='a')

字符串
您提供字符串作为第一个参数,但是to_csv method允许您提供类似文件的对象作为第一个参数,在这种情况下,您可能会避免多次执行与文件打开相关的操作,请考虑以下简单的比较

import os
import time
import pandas as pd
REPEAT = 1000
df = pd.DataFrame({'col1':range(100)})
t1 = time.time()
for _ in range(REPEAT):
    df.to_csv('file.csv',index=False,header=False,mode='a')
t2 = time.time()
os.remove('file.csv')
t3 = time.time()
with open('file.csv','a') as f:
    for _ in range(REPEAT):
        df.to_csv(f,index=False,header=False)
t4 = time.time()
print('Using filename',t2-t1)
print('Using filehandle',t4-t3)


给出输出

Using filename 0.35850977897644043
Using filehandle 0.2669696807861328


观察到第二种方式花费了第一种方式的大约75%的时间,因此虽然它更快,但仍然是相同的数量级。

相关问题