如何在nodejs中加载非常大的csv文件?

ie3xauqp  于 2023-03-27  发布在  其他
关注(0)|答案(4)|浏览(224)

我尝试将两个大的csv加载到nodejs中,第一个的大小为257 597 ko,第二个为104 330 ko。我使用的是文件系统(fs)和csv模块,下面是我的代码:

fs.readFile('path/to/my/file.csv', (err, data) => {
  if (err) console.err(err)
  else {
    csv.parse(data, (err, dataParsed) => {
      if (err) console.err(err)
      else {
        myData = dataParsed
        console.log('csv loaded')
      }
    })
  }
})

而在年龄(1-2小时)后,它只是崩溃与此错误消息:

<--- Last few GCs --->

[1472:0000000000466170]  4366473 ms: Mark-sweep 3935.2 (4007.3) -> 3935.2 (4007.
3) MB, 5584.4 / 0.0 ms  last resort GC in old space requested
[1472:0000000000466170]  4371668 ms: Mark-sweep 3935.2 (4007.3) -> 3935.2 (4007.
3) MB, 5194.3 / 0.0 ms  last resort GC in old space requested

<--- JS stacktrace --->

==== JS stack trace =========================================

Security context: 000002BDF12254D9 <JSObject>
    1: stringSlice(aka stringSlice) [buffer.js:590] [bytecode=000000810336DC91 o
ffset=94](this=000003512FC822D1 <undefined>,buf=0000007C81D768B9 <Uint8Array map
 = 00000352A16C4D01>,encoding=000002BDF1235F21 <String[4]: utf8>,start=0,end=263
778854)
    2: toString [buffer.js:664] [bytecode=000000810336D8D9 offset=148](this=0000
007C81D768B9 <Uint8Array map = 00000352A16C4D01>,encoding=000002BDF1...

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memo
ry
 1: node::DecodeWrite
 2: node_module_register
 3: v8::internal::FatalProcessOutOfMemory
 4: v8::internal::FatalProcessOutOfMemory
 5: v8::internal::Factory::NewRawTwoByteString
 6: v8::internal::Factory::NewStringFromUtf8
 7: v8::String::NewFromUtf8
 8: std::vector<v8::CpuProfileDeoptFrame,std::allocator<v8::CpuProfileDeoptFrame
> >::vector<v8::CpuProfileDeoptFrame,std::allocator<v8::CpuProfileDeoptFrame> >
 9: v8::internal::wasm::SignatureMap::Find
10: v8::internal::Builtins::CallableFor
11: v8::internal::Builtins::CallableFor
12: v8::internal::Builtins::CallableFor
13: 00000081634043C1

加载了最大的文件,但node的内存不足。分配更多的内存可能很容易,但这里的主要问题是加载时间,尽管文件很大,但它似乎很长。那么正确的方法是什么?Python使用pandas btw加载这些csv非常快(3-5秒)。

xtupzzrd

xtupzzrd1#

流工作完美,它只花了3-5秒:

var csv = require('csv-parser')
var data = []

fs.createReadStream('path/to/my/data.csv')
  .pipe(csv())
  .on('data', function (row) {
    data.push(row)
  })
  .on('end', function () {
    console.log('Data loaded')
  })
taor4pac

taor4pac2#

readFile将把整个文件加载到内存中,但是fs.createReadStream将按照您指定的大小分块读取文件。
这将防止它耗尽内存

wn9m85ua

wn9m85ua3#

您可能希望流式传输CSV,而不是一次阅读所有内容:

bwitn5fc

bwitn5fc4#

const parseOptions = (chunkSize, count) => {
let parseObjList = []
for (let i = 0; i < (count / chunkSize); i++) {
    const from_line = (i * chunkSize) + 1
    const to_line = (i + 1) * chunkSize;
    let parseObj = {
        delimiter: ',',
        from_line: from_line,
        to_line: to_line,
        skip_empty_lines: true
    }
    parseObjList.push(parseObj);
}
return parseObjList;
}


function parseJourney(filepath) {
let chunksize = 10
const count = fs.readFileSync(filepath,'utf8').split('\n').length - 1;
const parseObjList = parseOptions(chunksize, count)
for (let i = 0; i < parseObjList.length; i++) {
    fs.createReadStream(filepath)
        .pipe(parse(parseObjList[i]))
        .on('data', function (row) {
            let journey_object = {};
            if (journeyValidation(row)) {
                journeyHeaders.forEach((columnName, idx) => {
                    journey_object[columnName] = row[idx];
                });
                logger.info(journey_object);
                Journey.create(journey_object).catch(error => {
                    logger.error(error);
                })
            }
            else { logger.error('Incorrect data type in this row: ' + row); }
        })
        .on('end', function () {
            logger.info('finished');
        })
        .on('error', function (error) {
            logger.error(error.message);
        });
}
}

通过传递文件路径调用函数:

parseJourney('./filePath.csv')

相关问题