使用OpenCSV解析CSV,并在引用字段中使用双引号

4nkexdtk  于 2022-12-06  发布在  其他
关注(0)|答案(3)|浏览(155)

我正在尝试使用OpenCSV解析一个CSV文件。其中一列以YAML序列化格式存储数据,并被引用,因为它可以包含逗号。它也包含引号,所以通过添加两个引号将其转义。我可以在Ruby中轻松解析此文件,但使用OpenCSV无法完全解析它。它是一个UTF-8编码的文件。
下面是我的Java代码片段,它试图读取文件

CSVReader reader = new CSVReader(new InputStreamReader(new FileInputStream(csvFilePath), "UTF-8"), ',', '\"', '\\');

这是这个文件中的两行。第一行没有被正确解析,并且在""[Fair Trade Certified]""处被分割,我猜是因为转义的双引号。

1061658767,update,1196916,Product,28613099,Product::Source,"---
product_attributes:
-
- :name: Ornaments
  :brand_id: 49120
  :size: each
  :alcoholic: false
  :details: ""[Fair Trade Certified]""
  :gluten_free: false
  :kosher: false
  :low_fat: false
  :organic: false
  :sugar_free: false
  :fat_free: false
  :vegan: false
  :vegetarian: false
",,2015-11-01 00:06:19.796944,,,,,,
1061658768,create,,,28613100,Product::Source,"---
product_id:
retailer_id:
store_id:
source_id: 333790
locale: en_us
source_type: Product::PrehistoricProductDatum
priority: 1
is_definition:
product_attributes:
",,2015-11-01 00:06:19.927948,,,,,,
wnavrhmk

wnavrhmk1#

首先,我很高兴FastCSV能为您工作,但我运行了可疑的子字符串,并通过3.9 openCSV运行了它,它与CsvParser和RFC 4180 Parser都工作。您能否给予一点细节,说明它如何无法解析和/或尝试使用3.9 openCSV,看看是否会遇到相同的问题,然后尝试以下配置。
以下是我使用的测试:
CSV解析器:

@Test
public void parseBigStringFromStackOverflowWithMultipleQuotesInLine() throws IOException {

    String bigline = "28613099,Product::Source,\"---\n" +
            "product_attributes:\n" +
            "-\n" +
            "- :name: Ornaments\n" +
            "  :brand_id: 49120\n" +
            "  :size: each\n" +
            "  :alcoholic: false\n" +
            "  :details: \"\"[Fair Trade Certified]\"\"\n" +
            "  :gluten_free: false\n" +
            "  :kosher: false\n" +
            "  :low_fat: false\n" +
            "  :organic: false\n" +
            "  :sugar_free: false\n" +
            "  :fat_free: false\n" +
            "  :vegan: false\n" +
            "  :vegetarian: false\n" +
            "\",,2015-11-01 00:06:19.796944";

    String suspectString = "---\n" +
            "product_attributes:\n" +
            "-\n" +
            "- :name: Ornaments\n" +
            "  :brand_id: 49120\n" +
            "  :size: each\n" +
            "  :alcoholic: false\n" +
            "  :details: \"[Fair Trade Certified]\"\n" +
            "  :gluten_free: false\n" +
            "  :kosher: false\n" +
            "  :low_fat: false\n" +
            "  :organic: false\n" +
            "  :sugar_free: false\n" +
            "  :fat_free: false\n" +
            "  :vegan: false\n" +
            "  :vegetarian: false\n" ;

    StringReader stringReader = new StringReader(bigline);

    CSVReaderBuilder builder = new CSVReaderBuilder(stringReader);
    CSVReader csvReader = builder.withFieldAsNull(CSVReaderNullFieldIndicator.BOTH).build();

    String item[] = csvReader.readNext();

    assertEquals(5, item.length);
    assertEquals("28613099", item[0]);
    assertEquals("Product::Source", item[1]);
    assertEquals(suspectString, item[2]);
}

RFC 4180解析器

def 'parse big line from stackoverflow with complex string'() {
    given:
    RFC4180ParserBuilder builder = new RFC4180ParserBuilder()
    RFC4180Parser parser = builder.build()
    String bigline = "28613099,Product::Source,\"---\n" +
            "product_attributes:\n" +
            "-\n" +
            "- :name: Ornaments\n" +
            "  :brand_id: 49120\n" +
            "  :size: each\n" +
            "  :alcoholic: false\n" +
            "  :details: \"\"[Fair Trade Certified]\"\"\n" +
            "  :gluten_free: false\n" +
            "  :kosher: false\n" +
            "  :low_fat: false\n" +
            "  :organic: false\n" +
            "  :sugar_free: false\n" +
            "  :fat_free: false\n" +
            "  :vegan: false\n" +
            "  :vegetarian: false\n" +
            "\",,2015-11-01 00:06:19.796944"

    String suspectString = "---\n" +
            "product_attributes:\n" +
            "-\n" +
            "- :name: Ornaments\n" +
            "  :brand_id: 49120\n" +
            "  :size: each\n" +
            "  :alcoholic: false\n" +
            "  :details: \"[Fair Trade Certified]\"\n" +
            "  :gluten_free: false\n" +
            "  :kosher: false\n" +
            "  :low_fat: false\n" +
            "  :organic: false\n" +
            "  :sugar_free: false\n" +
            "  :fat_free: false\n" +
            "  :vegan: false\n" +
            "  :vegetarian: false\n"

    when:
    String[] values = parser.parseLine(bigline)

    then:
    values.length == 5
    values[0] == "28613099"
    values[1] == "Product::Source"
    values[2] == suspectString
}
chhqkbe1

chhqkbe12#

解决方案是使用RFC4180兼容的CSV解析器,如Paul所建议的。我使用了OpenCSV的CSVReader,它不工作,或者我不能让它正常工作。
我使用了FastCSV,一个RFC4180 CSV解析器,它可以无缝地工作。

File file = new File(csvFilePath);
CsvReader csvReader = new CsvReader();
CsvContainer csv = csvReader.read(file, StandardCharsets.UTF_8);
for (CsvRow row : csv.getRows()) {
    System.out.println(row.getFieldCount());  
}
5anewei6

5anewei63#

我知道这是一个老问题,但在使用OpenCSV时偶然发现了这个问题,这里有一个我发现的解决方法。
基本上,当你循环你的值并期望一个列有一个逗号','时,只需要做一个基本的字符串操作,并在字符串的开头和结尾加上双引号'"'。

//temp storage to be passed later on to CSVWriter for writing the actual csv file
    List<String[]> lines = new ArrayList<>();

    //loop through the csv lines
    for (String[] rate: readAllLines("rating_context.csv")) {
                //checks if the 5th column has comma, or any column you expect 
                if(rate[4].contains(",")) {
                    //replace the value with the same value enclosed in double quotes, you can use StringBuilder to optimize
                    rate[4] = "\""+rate[4]+"\"";                 
                }
                lines.add(rate);
     }

并且只使用CSVWriter.NO_QUOTE_CHARACTER、CSVWriter.NO_ESCAPE_CHARACTER作为CSVWriter的构造函数
作为输入的示例csv行值:《美国总统》(1995),喜剧|戏剧|浪漫,0,0
作为输出的示例csv行值:《美国总统》(1995),喜剧|戏剧|浪漫,0,1

相关问题