通过spark.read.json()加载时从json中删除列

fquxozlt  于 2021-05-29  发布在  Hadoop
关注(0)|答案(1)|浏览(386)

我陷入了一个非常奇怪的境地。例如,我有一个包含这三个json的文件。

{"uploadTimeStamp":"1500618037189","ID":"123ID","data":[{"Data":{"unit":"rpm","value":"0"},"EventID":"E1","Timestamp":1500618037189,"pii":{}},{"Data":{"heading":"N","loc1":"false","loc2":"13.022425","loc3":"77.760587","loc4":"false","speed":"10"},"EventID":"E2","Timestamp":1500618037189,"pii":{}},{"Data":{"x":"1.1","y":"1.2","z":"2.2"},"EventID":"E3","Timestamp":1500618037189,"pii":{}},{"EventID":"E4","Data":{"value":"50","unit":"percentage"},"Timestamp":1500618037189},{"Data":{"unit":"kmph","value":"60"},"EventID":"E5","Timestamp":1500618037189,"pii":{}}]}
{"uploadTimeStamp":"1500618045735","ID":"123ID","data":[{"Data":{"unit":"rpm","value":"0"},"EventID":"E1","Timestamp":1500618045735,"pii":{}},{"Data":{"heading":"N","loc1":"false","loc2":"13.022425","loc3":"77.760587","loc4":"false","speed":"10"},"EventID":"E2","Timestamp":1500618045735,"pii":{}},{"Data":{"x":"1.1","y":"1.2","z":"2.2"},"EventID":"E3","Timestamp":1500618045735,"pii":{}},{"EventID":"E4","Data":{"value":"50","unit":"percentage"},"Timestamp":1500618045735},{"Data":{"unit":"kmph","value":"60"},"EventID":"E5","Timestamp":1500618045735,"pii":{}}]}
{"REGULAR_DUMMY":"REGULAR_DUMMY", "ID":"123ID", "uploadTimeStamp":1500546893837}

我正在使用sparksql加载这个json (spark.read.json) . 然后用 df.createOrReplaceTempView("TEST") 以及给予 spark.sql("select count(*) from TEST) 我想获得id为123id的所有记录的计数,但我想忽略 "REGULAR_DUMMY" 行。也就是说,计数(*)应该是2,而不是3。
我尝试了以下方法来删除列“regular\u dummy”以获得count。
1- val df = spark.read.json("hdfs://10.2.3.4/test/path/*).drop("REGULAR_DUMMY") -它将模式显示为uploadtimestamp、id和数据,这是完全正确的。但它的输出为3。
2 - df.drop("REGULAR_DUMMY").createOrReplaceTempView("TEST") -这还将模式显示为uploadtimestamp、id和数据。但计数又是3
3 - spark.sql("select count(*) from TEST).drop("REGULAR_DUMMY") -这又算作3
如果我这样做了: hadoop fs -cat /test/path/* | grep -i "123ID" | grep -v "REGULAR_DUMMY" | wc -l . 那么计数是2
以及: hadoop fs -cat /test/path/* | grep -i "123ID" | wc -l . 计数是3
那么,我错过了什么?

anhgbhbe

anhgbhbe1#

.drop 会把整根柱子 count 我会数数行的。因为您没有在中删除或筛选具有“regular\u dummy”值的行 REGULAR_DUMMY 列,计数为3,这是正确的。
你要做的就是 filter 中具有“regular\u dummy”值的行 REGULAR_DUMMY 列并执行 count 作为

import org.apache.spark.sql.functions._
df.filter(col("REGULAR_DUMMY") =!= "REGULAR_DUMMY").select(count("*"))

这会还给你的 0 作为的其他值 REGULAR_DUMMY 列为 null . 因此没有一行 REGULAR_DUMMY 列有一个值,所有行都被过滤掉。
解决这个问题的办法是 null 具有临时值和do的值 filter 最后做最后一件事 count 作为

df.na.fill("temp").filter(col("REGULAR_DUMMY") =!= "REGULAR_DUMMY").select(count("*"))

应该给你看正确的结果

+--------+
|count(1)|
+--------+
|2       |
+--------+

你可以用 where 同时过滤

df.na.fill("temp").where(col("REGULAR_DUMMY") =!= "REGULAR_DUMMY").select(count("*"))

相关问题