我尝试使用python“re”库和python片段的任意组合来纠正kafka在hdfs上使用cloudera的hadoop发行版提供的格式不正确的json字符串。
不正确的json:
{"json_data":"{"table":"TEST.FUBAR","op_type":"I","op_ts":"2019-03-14 15:33:50.031848","current_ts":"2019-03-14T15:33:57.479002","pos":"1111","after":{"COL1":949494949494949494,"COL2":99,"COL3":2,"COL4":" 99999","COL5":9999999,"COL6":90,"COL7":42478,"COL8":"I","COL9":null,"COL10":"2019-03-14 15:33:49","COL11":null,"COL12":null,"COL13":null,"COL14":"x222263 ","COL15":"2019-03-14 15:33:49","COL16":"x222263 ","COL17":"2019-03-14 15:33:49","COL18":"2020-09-10 00:00:00","COL19":"A","COL20":"A","COL21":0,"COL22":null,"COL23":"2019-03-14 15:33:47","COL24":2,"COL25":2,"COL26":"R","COL27":"2019-03-14 15:33:49","COL28":" ","COL29":"PBU67H ","COL30":" 20000","COL31":2,"COL32":null}}"}
注意:开始标记“json\u data”附近的双引号“{和结束标记“null}}”附近的双引号实际上是唯一需要删除的错误(我测试了它,没有额外的引号)
有效且正确的json:
{"json_data":{"table":"TEST.FUBAR","op_type":"I","op_ts":"2019-03-14 15:33:50.031848","current_ts":"2019-03-14T15:33:57.479002","pos":"1111","after":{"COL1":949494949494949494,"COL2":99,"COL3":2,"COL4":" 99999","COL5":9999999,"COL6":90,"COL7":42478,"COL8":"I","COL9":null,"COL10":"2019-03-14 15:33:49","COL11":null,"COL12":null,"COL13":null,"COL14":"x222263 ","COL15":"2019-03-14 15:33:49","COL16":"x222263 ","COL17":"2019-03-14 15:33:49","COL18":"2020-09-10 00:00:00","COL19":"A","COL20":"A","COL21":0,"COL22":null,"COL23":"2019-03-14 15:33:47","COL24":2,"COL25":2,"COL26":"R","COL27":"2019-03-14 15:33:49","COL28":" ","COL29":"PBU67H ","COL30":" 20000","COL31":2,"COL32":null}}}
我有40000到60000条记录,我需要使用pyspark每小时通读一次,而基础设施团队说这要由我来修复。
使用python读取所有字符串并删除开头和结尾附近的双引号有没有一种快速而肮脏的方法?
1条答案
按热度按时间6g8kf2rb1#
我建议你还是坚持下去
re
正则表达式,例如:应该会成功的。因为不需要的双引号紧跟在结束的黑体或冒号之后,紧跟在开始或结束的括号之前。
结果: