我可以将Map器的输入设置为hashmap而不是输入文件吗

hgncfbus  于 2021-06-04  发布在  Hadoop
关注(0)|答案(1)|浏览(331)

我正在尝试设置一个mapreduce任务,它利用dynamodb的并行扫描特性。
基本上,我希望每个Map器类都采用一个元组作为输入值。
到目前为止,我看到的每个例子都说明了这一点:

FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

我可以将作业的输入格式改为hashmap吗?

628mspwn

628mspwn1#

我认为您希望将文件作为键值对来读取,而不是作为读取inputslipt(行号作为键,行作为值)的标准方式。如果你问了这个问题,那么你可以使用keyvaluetextinputformat,下面的描述可以在hadoop上找到:权威指南

KeyValueTextInputFormat
TextInputFormat’s keys, being simply the offset within the file, are not normally
very useful. It is common for each line in a file to be a key-value pair, 
separated by a delimiter such as a tab character. For example, this is the output   
produced by TextOutputFormat, Hadoop’s default OutputFormat. To interpret such 
files correctly, KeyValueTextInputFormat is appropriate.

You can specify the separator via the key.value.separator.in.input.line property. 
It is a tab character by default. Consider the following input file, 
where → represents a (horizontal) tab character:

line1→On the top of the Crumpetty Tree
line2→The Quangle Wangle sat,
line3→But his face you could not see,
line4→On account of his Beaver Hat.
Like in the TextInputFormat case, the input is in a single split comprising four
records, although this time the keys are the Text sequences before the tab in
each line:

(line1, On the top of the Crumpetty Tree)
(line2, The Quangle Wangle sat,)
(line3, But his face you could not see,)
(line4, On account of his Beaver Hat.)

相关问题