我正在读取一个csv文件,其中有一个注解字段。文本中有新行字符。当我使用csv的多行选项时,如果有超过1行,仍然会创建一个新行。
下面是代码和数据(示例,但与实际类似)
包sample.spark.com;
import java.util.HashMap;
import java.util.Map;
import org.apache.spark.sql.functions.*;
import org.apache.commons.logging.impl.SLF4JLog;
import org.apache.commons.logging.impl.SLF4JLogFactory;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.functions;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class IndividualApp {
static Logger logger = LoggerFactory.getLogger(IndividualApp.class.getName());
public static void main(String[] args) {
// TODO Auto-generated method stub
SparkSession session = SparkSession.builder().appName("IndividualApp").master("local[*]").getOrCreate();
session.sparkContext().setLogLevel("ERROR");
Map<String, String> options = new HashMap<>();
options.put("inferSchema", "true");
options.put("header", "true");
options.put("multiLine","true");
Dataset<Row> df = session.read().options(options).csv("C:\\DataSet\\sample.csv");
df.show(Boolean.FALSE);
df = df.groupBy("id").count().orderBy(functions.col("id").desc());
df.show(Boolean.FALSE);
logger.info("THE VALUE IS "+df.count());
}
}
输出
+---------------------------------------------+------------------------------------------------------------------------+
|ID |comment |
+---------------------------------------------+------------------------------------------------------------------------+
|1 |"Added business Added 80/60/1300;
200/100/1800-Name change from ""Added|
|Added Added Added - 311/271/1911 Added/Added"|null |
+---------------------------------------------+------------------------------------------------------------------------+
+---------------------------------------------+-----+
|id |count|
+---------------------------------------------+-----+
|Added Added Added - 311/271/1911 Added/Added"|1 |
|1 |1 |
+---------------------------------------------+-----+
数据
ID , comment
1 , Added business Added 80/60/1300;
200/100/1800-Name change from "Added, Added Added for Added Added Added."-Added-Added;
Added Added Added - 311/271/1911 Added/Added
有没有办法解决这个问题,因为它产生的记录比实际数字多。我的解决方案是忽略任何包含字符串的value id列,但如果注解只有数字,这就行不通了。谢谢您
1条答案
按热度按时间pxyaymoc1#
我可以通过添加下面的代码来解决这个问题。我不知道它是如何工作的,但它对我来说是有效的
(“escape”,“”);