使用mapreduce从日志文件中提取url命中数

mctunoxg  于 2021-05-30  发布在  Hadoop
关注(0)|答案(2)|浏览(397)

我试着把下面的代码 Hadoop map-reduce . 我有一个日志文件,其中包含ip地址和URL打开的各个ip后面。具体如下:

  1. 192.168.72.224 www.m4maths.com
  2. 192.168.72.177 www.yahoo.com
  3. 192.168.72.177 www.yahoo.com
  4. 192.168.72.224 www.facebook.com
  5. 192.168.72.224 www.gmail.com
  6. 192.168.72.177 www.facebook.com
  7. 192.168.198.92 www.google.com
  8. 192.168.198.92 www.yahoo.com
  9. 192.168.72.224 www.google.com
  10. 192.168.72.177 www.yahoo.com
  11. 192.168.198.92 www.google.com
  12. 192.168.72.224 www.indiabix.com
  13. 192.168.72.177 www.yahoo.com
  14. 192.168.72.224 www.google.com
  15. 192.168.72.177 www.yahoo.com
  16. 192.168.72.224 www.yahoo.com
  17. 192.168.198.92 www.m4maths.com
  18. 192.168.198.92 www.facebook.com
  19. 192.168.72.224 www.gmail.com
  20. 192.168.72.177 www.google.com
  21. 192.168.72.224 www.indiabix.com
  22. 192.168.72.224 www.indiabix.com
  23. 192.168.72.177 www.m4maths.com
  24. 192.168.72.224 www.indiabix.com
  25. 192.168.198.92 www.google.com
  26. 192.168.72.177 www.yahoo.com
  27. 192.168.198.92 www.yahoo.com
  28. 192.168.72.177 www.yahoo.com
  29. 192.168.198.92 www.facebook.com
  30. 192.168.198.92 www.indiabix.com
  31. 192.168.72.177 www.indiabix.com
  32. 192.168.72.224 www.google.com
  33. 192.168.198.92 www.askubuntu.com
  34. 192.168.198.92 www.askubuntu.com
  35. 192.168.198.92 www.facebook.com
  36. 192.168.198.92 www.gmail.com
  37. 192.168.72.177 www.facebook.com
  38. 192.168.72.177 www.yahoo.com
  39. 192.168.198.92 www.m4maths.com
  40. 192.168.72.224 www.yahoo.com
  41. 192.168.72.177 www.google.com
  42. 192.168.72.177 www.m4maths.com
  43. 192.168.72.177 www.yahoo.com
  44. 192.168.72.224 www.m4maths.com
  45. 192.168.72.177 www.yahoo.com
  46. 192.168.72.177 www.yahoo.com
  47. 192.168.72.224 www.facebook.com
  48. 192.168.72.224 www.gmail.com
  49. 192.168.72.177 www.facebook.com
  50. 192.168.198.92 www.google.com
  51. 192.168.198.92 www.yahoo.com
  52. 192.168.72.224 www.google.com
  53. 192.168.72.177 www.yahoo.com
  54. 192.168.198.92 www.google.com
  55. 192.168.72.224 www.indiabix.com
  56. 192.168.72.177 www.yahoo.com
  57. 192.168.72.224 www.google.com
  58. 192.168.72.177 www.yahoo.com
  59. 192.168.72.224 www.yahoo.com
  60. 192.168.198.92 www.m4maths.com
  61. 192.168.198.92 www.facebook.com
  62. 192.168.72.224 www.gmail.com
  63. 192.168.72.177 www.google.com
  64. 192.168.72.224 www.indiabix.com
  65. 192.168.72.224 www.indiabix.com
  66. 192.168.72.177 www.m4maths.com
  67. 192.168.72.224 www.indiabix.com

现在我需要以这样一种方式组织这个文件的结果:它列出不同的ip地址,URL后跟该ip打开的次数。
例如,如果 192.168.72.224 打开 www.yahoo.com 根据整个日志文件15次,则输出必须包含: 192.168.72.224 www.yahoo.com 15 应该对文件中的所有IP执行此操作,最终输出应该如下所示:

  1. 192.168.72.224 www.yahoo.com 15
  2. www.m4maths.com 11
  3. 192.168.72.177 www.yahoo.com 6
  4. www.gmail.com 19
  5. ....
  6. ...
  7. ..
  8. .

我试过的代码是:

  1. public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
  2. {
  3. private final static IntWritable one = new IntWritable(1);
  4. private Text word = new Text();
  5. public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
  6. {
  7. String line = value.toString();
  8. StringTokenizer tokenizer = new StringTokenizer(line);
  9. while (tokenizer.hasMoreTokens())
  10. {
  11. word.set(tokenizer.nextToken());
  12. output.collect(word, one);
  13. }
  14. }
  15. }

我知道这个代码有严重的缺陷,请建议我一个想法向前推进。
谢谢您。

gkl3eglg

gkl3eglg1#

我建议这样设计:
mapper从文件中获取一行并输出ip作为密钥和一对网站,1作为值
合路器和减速器。获取ip作为密钥和一系列(website,count)对,按网站聚合它们(使用hashmap),并输出ip、website和count作为输出。
实现这一点需要您实现自定义可写来处理一对。
就我个人而言,我会用spark来做这件事,除非你太在意我的表现。有了pyspark,就这么简单了:

  1. rdd = sc.textFile('/sparkdemo/log.txt')
  2. counts = rdd.map(lambda line: line.split()).map(lambda line: ((line[0], line[1]), 1)).reduceByKey(lambda x, y: x+y)
  3. result = counts.map(lambda ((ip, url), cnt): (ip, (url, cnt))).groupByKey().collect()
  4. for x in result:
  5. print 'IP: %s' % x[0]
  6. for w in x[1]:
  7. print ' website: %s count: %d' % (w[0], w[1])

您的示例的输出是:

  1. IP: 192.168.72.224
  2. website: www.facebook.com count: 2
  3. website: www.m4maths.com count: 2
  4. website: www.google.com count: 5
  5. website: www.gmail.com count: 4
  6. website: www.indiabix.com count: 8
  7. website: www.yahoo.com count: 3
  8. IP: 192.168.72.177
  9. website: www.yahoo.com count: 14
  10. website: www.google.com count: 3
  11. website: www.facebook.com count: 3
  12. website: www.m4maths.com count: 3
  13. website: www.indiabix.com count: 1
  14. IP: 192.168.198.92
  15. website: www.facebook.com count: 4
  16. website: www.m4maths.com count: 3
  17. website: www.yahoo.com count: 3
  18. website: www.askubuntu.com count: 2
  19. website: www.indiabix.com count: 1
  20. website: www.google.com count: 5
  21. website: www.gmail.com count: 1
展开查看全部
oymdgrw7

oymdgrw72#

我用java编写了相同的逻辑

  1. public class UrlHitMapper extends Mapper<Object, Text, Text, Text>{
  2. public void map(Object key, Text value, Context contex) throws IOException, InterruptedException {
  3. System.out.println(value);
  4. StringTokenizer st=new StringTokenizer(value.toString());
  5. if(st.hasMoreTokens())
  6. contex.write(new Text(st.nextToken()), new Text(st.nextToken()));
  7. }
  8. }
  9. public class UrlHitReducer extends Reducer<Text, Text, Text, Text>{
  10. public void reduce(Text key, Iterable<Text> values, Context context)
  11. throws IOException, InterruptedException {
  12. HashMap<String, Integer> urlCount=new HashMap<>();
  13. String url=null;
  14. Iterator<Text> it=values.iterator();
  15. while (it.hasNext()) {
  16. url=it.next().toString();
  17. if(urlCount.get(url)==null)
  18. urlCount.put(url, 1);
  19. else
  20. urlCount.put(url,urlCount.get(url)+1);
  21. }
  22. for(Entry<String, Integer> k:urlCount.entrySet())
  23. context.write(key, new Text(k.getKey()+" "+k.getValue()));
  24. }
  25. }
  26. public class UrlHitCount extends Configured implements Tool {
  27. public static void main(String[] args) throws Exception {
  28. ToolRunner.run(new Configuration(), new UrlHitCount(), args);
  29. }
  30. public int run(String[] arg0) throws Exception {
  31. Job job = Job.getInstance(getConf());
  32. job.setJobName("url-hit-count");
  33. job.setOutputKeyClass(Text.class);
  34. job.setOutputValueClass(Text.class);
  35. job.setMapperClass(UrlHitMapper.class);
  36. job.setReducerClass(UrlHitReducer.class);
  37. job.setOutputFormatClass(TextOutputFormat.class);
  38. FileInputFormat.setInputPaths(job, new Path("input/urls"));
  39. FileOutputFormat.setOutputPath(job, new Path("url_otput"+System.currentTimeMillis()));
  40. job.setJarByClass(WordCount.class);
  41. job.submit();
  42. return 1;
  43. }
  44. }
展开查看全部

相关问题