我正在使用kaggle数据集编写一个简单的map reduce程序https://www.kaggle.com/datasnaek/youtube-new
该数据集包含40950条视频记录,包含16个变量,如视频id、趋势日期、标题、频道标题、类别id、发布时间、标签、视图、喜欢、不喜欢、评论计数、描述等。
我的mapreduce程序的目的是查找所有描述中包含“iphonex”且至少有10000个赞的视频。最终输出应仅包含(标题、视频计数)
司机级套餐解决方案;
public class Driver extends Configured implements Tool{
@Override
public int run(String[] args) throws Exception{
if(args.length != 2){
System.out.printf("Usage: Driver <input dir> <output dir> \n");
return -1;
}
Job job = new Job(getConf());
job.setJarByClass(Driver.class);
job.setJobName("iPhoneX");
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(Mapper.class);
job.setReducerClass(Reducer.class);
//Specify Combiner as the combiner class
job.setCombinerClass(Reducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
if(job.getCombinerClass() == null){
throw new Exception("Combiner not set");
}
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
/* The main method calls the ToolRunner.run method,
* which calls the options parser that interprets Hadoop terminal
* options and puts them into a config object
* */
public static void main(String[] args) throws Exception{
int exitCode = ToolRunner.run(new Configuration(), new Driver(),args);
System.exit(exitCode);
}
}
减速器等级
package solution;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class Reducer extends Reducer<Text, IntWritable, Text, IntWritable>{
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException{
int video_count = 0;
for(IntWritable value : values){
video_count += value.get();
}
context.write(key, new IntWritable(video_count));
}
}
Map类
public class Mapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private Text description = new Text();
private IntWritable likes = new IntWritable();
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException{
String line = value.toString();
String str[] = line.split("\t");
if(str.length > 3){
description.set(str[8]);
}
// Testing how many times the iPhoneX word is located in the data set
// StringTokenizer itr = new StringTokenizer(line);
//
// while(itr.hasMoreTokens()){
// String token = itr.nextToken();
// if(token.contains("iPhoneX")){
// word.set("iPhoneX Count");
// context.write(word, new IntWritable(1));
// }
// }
}
}
1条答案
按热度按时间ozxc1zmp1#
您的代码看起来不错,但是您需要取消对输出任何数据的Map器部分的注解,但是,Map器键应该是“iphone”,您可能希望标记描述,而不是整行
您还需要提取喜欢的数量,并只过滤出与问题集列出的条件匹配的那些
顺便说一句,你至少需要9个元素来获得这个位置,而不是3个元素,所以在这里改变条件
或者,不必在Map器中预先聚合,您只需为每个“iphonex”令牌写出(token,1),然后让合并器和还原器为您进行求和