获取分布式缓存数据的最佳方法是什么?
public class TrailMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
ArrayList<String> globalFreq = new ArrayList<String>();
public void setup(Context context) throws IOException{
Configuration conf = context.getConfiguration();
FileSystem fs = FileSystem.get(conf);
URI[] cacheFiles = DistributedCache.getCacheFiles(conf);
Path getPath = new Path(cacheFiles[0].getPath());
BufferedReader bf = new BufferedReader(new InputStreamReader(fs.open(getPath)));
String setupData = null;
while ((setupData = bf.readLine()) != null) {
String [] parts = setupData.split(" ");
globalFreq.add(parts[0]);
}
}
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
//Accessing "globalFreq" data .and do further processing
}
或
public class TrailMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
URI[] cacheFiles
public void setup(Context context) throws IOException{
Configuration conf = context.getConfiguration();
FileSystem fs = FileSystem.get(conf);
cacheFiles = DistributedCache.getCacheFiles(conf);
}
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
ArrayList<String> globalFreq = new ArrayList<String>();
Path getPath = new Path(cacheFiles[0].getPath());
BufferedReader bf = new BufferedReader(new InputStreamReader(fs.open(getPath)));
String setupData = null;
while ((setupData = bf.readLine()) != null) {
String [] parts = setupData.split(" ");
globalFreq.add(parts[0]);
}
}
所以如果我们像(代码2)那样做,这意味着什么 Say we have 5 map task every map task reads the same copy of the data
. 在为每个Map这样写的时候,任务会多次读取数据,对吗(5次)?
代码1:当它在设置中写入时,它被读取一次,全局数据在Map中被访问。
这是编写分布式缓存的正确方法。
1条答案
按热度按时间vnjpjtjt1#
在工作中尽你所能
setup
方法:这将由每个Map器调用一次,但随后将为传递给Map器的每个记录进行缓存。解析每个记录的数据是可以避免的开销,因为那里没有依赖于key
,value
以及context
您在map
方法。这个
setup
方法将为每个Map任务调用,但是map
将为该任务处理的每个记录调用(这显然是一个非常高的数字)。