我陷入了一个非常奇怪的境地。例如,我有一个包含这三个json的文件。
{"uploadTimeStamp":"1500618037189","ID":"123ID","data":[{"Data":{"unit":"rpm","value":"0"},"EventID":"E1","Timestamp":1500618037189,"pii":{}},{"Data":{"heading":"N","loc1":"false","loc2":"13.022425","loc3":"77.760587","loc4":"false","speed":"10"},"EventID":"E2","Timestamp":1500618037189,"pii":{}},{"Data":{"x":"1.1","y":"1.2","z":"2.2"},"EventID":"E3","Timestamp":1500618037189,"pii":{}},{"EventID":"E4","Data":{"value":"50","unit":"percentage"},"Timestamp":1500618037189},{"Data":{"unit":"kmph","value":"60"},"EventID":"E5","Timestamp":1500618037189,"pii":{}}]}
{"uploadTimeStamp":"1500618045735","ID":"123ID","data":[{"Data":{"unit":"rpm","value":"0"},"EventID":"E1","Timestamp":1500618045735,"pii":{}},{"Data":{"heading":"N","loc1":"false","loc2":"13.022425","loc3":"77.760587","loc4":"false","speed":"10"},"EventID":"E2","Timestamp":1500618045735,"pii":{}},{"Data":{"x":"1.1","y":"1.2","z":"2.2"},"EventID":"E3","Timestamp":1500618045735,"pii":{}},{"EventID":"E4","Data":{"value":"50","unit":"percentage"},"Timestamp":1500618045735},{"Data":{"unit":"kmph","value":"60"},"EventID":"E5","Timestamp":1500618045735,"pii":{}}]}
{"REGULAR_DUMMY":"REGULAR_DUMMY", "ID":"123ID", "uploadTimeStamp":1500546893837}
我正在使用sparksql加载这个json (spark.read.json)
. 然后用 df.createOrReplaceTempView("TEST")
以及给予 spark.sql("select count(*) from TEST)
我想获得id为123id的所有记录的计数,但我想忽略 "REGULAR_DUMMY"
行。也就是说,计数(*)应该是2,而不是3。
我尝试了以下方法来删除列“regular\u dummy”以获得count。
1- val df = spark.read.json("hdfs://10.2.3.4/test/path/*).drop("REGULAR_DUMMY")
-它将模式显示为uploadtimestamp、id和数据,这是完全正确的。但它的输出为3。
2 - df.drop("REGULAR_DUMMY").createOrReplaceTempView("TEST")
-这还将模式显示为uploadtimestamp、id和数据。但计数又是3
3 - spark.sql("select count(*) from TEST).drop("REGULAR_DUMMY")
-这又算作3
如果我这样做了: hadoop fs -cat /test/path/* | grep -i "123ID" | grep -v "REGULAR_DUMMY" | wc -l
. 那么计数是2
以及: hadoop fs -cat /test/path/* | grep -i "123ID" | wc -l
. 计数是3
那么,我错过了什么?
1条答案
按热度按时间anhgbhbe1#
.drop
会把整根柱子count
我会数数行的。因为您没有在中删除或筛选具有“regular\u dummy”值的行REGULAR_DUMMY
列,计数为3,这是正确的。你要做的就是
filter
中具有“regular\u dummy”值的行REGULAR_DUMMY
列并执行count
作为这会还给你的
0
作为的其他值REGULAR_DUMMY
列为null
. 因此没有一行REGULAR_DUMMY
列有一个值,所有行都被过滤掉。解决这个问题的办法是
null
具有临时值和do的值filter
最后做最后一件事count
作为应该给你看正确的结果
你可以用
where
同时过滤