使用opencsv读取包含'\'字符串和'\ 0'附件的CSV文件?

问题描述

我使用RFC4180Parser读取包含'\'和之前的"文件效果很好。

有我的代码,我正在使用RFC4180ParserCsvToBeanBuilder来读取CSV文件

final RFC4180Parser rfc4180Parser = new RFC4180ParserBuilder().build();
final CSVReaderBuilder csvReaderBuilder = new CSVReaderBuilder(new FileReader(inputDpaCsvFilePath))
    .withCSVParser(rfc4180Parser);
final List<MyClass> infos = new CsvToBeanBuilder<MyClass>(csvReaderBuilder.build())
    .withType(MyClass.class)
    .withSeparator(',')
    .build().parse();

原始CSV文件

"A","B","C","D"
"value 1","value 2","value 3","value 4"
"value\\" 11","value 22\\"","value 33","value 44"

但是现在文件格式改变了。在Header E列中添加了一些逗号。

新的CSV文件

"Header A","Header B","Header C","Header D","Header E"
"value1","value2","value3","value4","spA,spB,spC"
"value\\"5","value6\\"","value 7","value8",spC"
"value\\" 9","value 10","value 11","value 12","spC"

将会引发如下异常:

Exception in thread "pool-1-thread-1" java.lang.RuntimeException: com.opencsv.exceptions.CsvrequiredFieldEmptyException: Number of data fields does not match number of headers.
    at com.opencsv.bean.concurrent.ProcessCsvLine.run(ProcessCsvLine.java:101)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: com.opencsv.exceptions.CsvrequiredFieldEmptyException: Number of data fields does not match number of headers.
    at com.opencsv.bean.HeaderColumnNameMappingStrategy.verifyLineLength(HeaderColumnNameMappingStrategy.java:110)
    at com.opencsv.bean.AbstractMappingStrategy.populateNewBean(AbstractMappingStrategy.java:313)
    at com.opencsv.bean.concurrent.ProcessCsvLine.processLine(ProcessCsvLine.java:132)
    at com.opencsv.bean.concurrent.ProcessCsvLine.run(ProcessCsvLine.java:85)
    ... 3 more

如何更新代码以读取此CSV文件

解决方法

请参阅RCF4180 specification section 2.4

  1. 在标题和每条记录中,可能有一个或多个 字段,以逗号分隔。每行应包含相同的内容 整个文件中的字段数。 空格被视为一部分 字段,并且不能忽略。最后一个字段 记录后不能带逗号。例如:

    aaa,bbb,ccc

因此发生错误“数据字段数与标题数不匹配。”因为

"value1","value2","value3","value4","spA,spB,spC"

解析为7个字段(请注意前导空格和双引号。):

value1
     "value2"
    "value3"
   "value4"
   "spA
 spB
 spC

但标头仅包含5个字段。


无需修改csv,我们可以使用CSVParser代替RFC4180Parserignoring leading white space。以下程序演示了如何使用CSVParser来解析提供的csv,以及RFC4180Parser如何使用前导空格来解析字段:

import java.io.IOException;
import java.io.StringReader;
import java.util.List;

import com.opencsv.CSVParser;
import com.opencsv.CSVParserBuilder;
import com.opencsv.CSVReaderBuilder;
import com.opencsv.RFC4180Parser;
import com.opencsv.RFC4180ParserBuilder;
import com.opencsv.exceptions.CsvException;

public class ParseCsvFieldContainsCommaAndLeadingSpaceTest {
    public static void main(String[] args) throws IOException,CsvException {
        parseWithCSVParser();
        parseWithRFC4180Parser();
    }

    private static void parseWithCSVParser() throws IOException,CsvException {
        final CSVParser parser = new CSVParserBuilder().withIgnoreLeadingWhiteSpace(true).build();
        final CSVReaderBuilder csvReaderBuilder = new CSVReaderBuilder(
                new StringReader("\"Header A\",\"Header B\",\"Header C\",\"Header D\",\"Header E\"\r\n" +
                        "\"value1\",\"value2\",\"value3\",\"value4\",\"spA,spC\""))
                                .withCSVParser(parser);
        System.out.println("Result from CSVParser");
        List<String[]> lines = csvReaderBuilder.build().readAll();
        for (String[] line : lines) {
            System.out.println(String.join(" | ",line));
        }
    }

    private static void parseWithRFC4180Parser() throws IOException,CsvException {
        final RFC4180Parser rfc4180Parser = new RFC4180ParserBuilder().build();
        final CSVReaderBuilder csvReaderBuilder = new CSVReaderBuilder(
                new StringReader("\"Header A\",spC\""))
                                ////////////////////////////////// Removed space ^ to runnable
                                .withCSVParser(rfc4180Parser);
        System.out.println("Result from RFC4180Parser");
        List<String[]> lines = csvReaderBuilder.build().readAll();
        for (String[] line : lines) {
            System.out.println(String.join(" | ",line));
        }
    }
}