r hadoop计数

okxuctiv  于 2021-06-02  发布在  Hadoop
关注(0)|答案(0)|浏览(226)

我是r的新手,我对mapreduce rmr2有问题。我有一个这样的文件要读,每一行都有一个日期和一些单词(a,b,c…):

  1. 2016-05-10, A, B, C, A, R, E, F, E
  2. 2016-05-18, A, B, F, E, E
  3. 2016-06-01, A, B, K, T, T, E, G, E, A, N
  4. 2016-06-03, A, B, K, T, T, E, F, E, L, T

我想在输出中得到如下结果:

  1. 2016-05: A 3
  2. 2016-05: E 4
  3. 2016-05: E 4

我在java实现中也做过同样的问题,现在我也要在r代码中做同样的问题,但我必须弄清楚如何做我的reducer。有一种方法可以在mapper和reduce代码中进行一些打印,因为在mapper或reduce中使用print命令,会在rstudio中获得错误

  1. Sys.setenv(HADOOP_STREAMING = "/usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.8.0.jar")
  2. Sys.setenv(HADOOP_HOME = "/usr/local/hadoop/bin/hadoop")
  3. Sys.setenv(HADOOP_CMD = "/usr/local/hadoop/bin/hadoop")
  4. library(stringr)
  5. library(rmr2)
  6. library(stringi)
  7. customMapper = function(k,v){
  8. #words = unlist(strsplit(v,"\\s"))
  9. #words = unlist(strsplit(v,","))
  10. tmp = unlist(stri_split_fixed(v, pattern= ",",n = 2))
  11. data = tmp[1]
  12. onlyYearMonth = unlist(stri_split_fixed(data, pattern= "-",n = 3))
  13. #print(words)
  14. words = unlist(strsplit(tmp[2],","))
  15. compositeK = paste(onlyYearMonth[1],"-",onlyYearMonth[2])
  16. keyval(compositeK,words)
  17. }
  18. customReducer = function(k,v) {
  19. #Here there are all the value with same date ???
  20. elementsWithSameDate = unlist(v)
  21. #defining something similar to java Map to use for counting elements in same date
  22. # myMap
  23. for(elWithSameDate in elementsWithSameDate) {
  24. words = unlist(strsplit(elWithSameDate,","))
  25. for(word in words) {
  26. compositeNewK = paste(k,":",word)
  27. # if myMap contains compositeNewK
  28. # myMap (compositeNewK, 1 + myMap.getValue(compositeNewK))
  29. # else
  30. #myMap (compositeNewK, 1)
  31. }
  32. }
  33. #here i want to transorm myMap in a String, containing the first 3 words with max occurrencies
  34. #fromMapToString = convert(myMap)
  35. keyval(k,fromMapToString)
  36. }
  37. wordcount = function(inputData,outputData=NULL){
  38. mapreduce(input = inputData,output = outputData,input.format = "text",map = customMapper,reduce = customReducer)
  39. }
  40. hdfs.data = file.path("/user/hduser","folder2")
  41. hdfs.out = file.path("/user/hduser","output1")
  42. result = wordcount(hdfs.data,hdfs.out)

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题