我正在使用pig读取avro文件,并在写回之前对数据进行规范化/转换。avro文件的记录形式如下:
{
"type" : "record",
"name" : "KeyValuePair",
"namespace" : "org.apache.avro.mapreduce",
"doc" : "A key/value pair",
"fields" : [ {
"name" : "key",
"type" : "string",
"doc" : "The key"
}, {
"name" : "value",
"type" : {
"type" : "map",
"values" : "bytes"
},
"doc" : "The value"
} ]
}
我将avrotools命令行实用程序与jq结合使用,将第一条记录转储到json:
$ java -jar avro-tools-1.8.1.jar tojson part-m-00000.avro | ./jq --compact-output 'select(.value.pf_v != null)' | head -n 1 | ./jq .
{
"key": "some-record-uuid",
"value": {
"pf_v": "v1\u0003Basic\u0001slcvdr1rw\u001a\u0004v2\u0003DayWatch\u0001slcva2omi\u001a\u0004v3\u0003Performance\u0001slc1vs1v1w1p1g1i\u0004v4\u0003Fundamentals\u0001snlj1erwi\u001a\u0004v5\u0003My Portfolio\u0001svr1dews1b2b3k1k2\u001a\u0004v0\u00035"
}
}
我运行以下pig命令:
REGISTER avro-1.8.1.jar
REGISTER json-simple-1.1.1.jar
REGISTER piggybank-0.15.0.jar
REGISTER jackson-core-2.8.6.jar
REGISTER jackson-databind-2.8.6.jar
DEFINE AvroLoader org.apache.pig.piggybank.storage.avro.AvroStorage();
AllRecords = LOAD 'part-m-00000.avro'
USING AvroLoader()
AS (key: chararray, value: map[]);
Records = FILTER AllRecords BY value#'pf_v' is not null;
SmallRecords = LIMIT Records 10;
DUMP SmallRecords;
上面最后一条命令的相应记录如下:
...
(some-record-uuid,[pf_v#v03v1Basicslcviv2DayWatchslcva2omiv3Performanceslc1vs1v1w1p1g1i])
...
如您所见,unicode字符已从 pf_v
价值观。unicode字符实际上被用作这些值中的分隔符,因此我需要它们来将记录完全解析为所需的规范化状态。unicode字符在编码的 .avro
文件(如将文件转储为json所示)。有人知道一种方法让avrostorage在加载记录时不删除unicode字符吗?
谢谢您!
更新:我还使用avro的python datafilereader执行了相同的操作:
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter
reader = DataFileReader(open("part-m-00000.avro", "rb"), DatumReader())
for rec in reader:
if 'some-record-uuid' in rec['key']:
print rec
print '--------------------------------------------'
break
reader.close()
这将打印一个dict,看起来像是用十六进制字符代替unicode字符(这比完全删除它们更好):
{u'value': {u'pf_v': 'v0\x033\x04v1\x03Basic\x01slcvi\x1a\x04v2\x03DayWatch\x01slcva2omi\x1a\x04v3\x03Performance\x01slc1vs1v1w1p1g1i\x1a'}, u'key': u'some-record-uuid'}
暂无答案!
目前还没有任何答案,快来回答吧!