评分一个非常巨大的数据集

dnph8jn4  于 2021-06-02  发布在  Hadoop
关注(0)|答案(3)|浏览(482)

关闭。这个问题需要更加突出重点。它目前不接受答案。
**想改进这个问题吗?**通过编辑这篇文章更新这个问题,使它只关注一个问题。

5年前关门了。
改进这个问题
我用r/python在1-2%的样本数据上拟合了一个机器学习分类器,我对准确度度量(精度、召回率和fèu分数)非常满意。
现在我想用这个用r编码的分类器来为一个拥有7000万行/示例的大型数据库评分,这个数据库驻留在hadoop/hive环境中。
有关数据集的信息:
7000万x 40个变量(列):大约18个变量是分类变量,其余22个是数字变量(包括整数)
我该怎么做呢?有什么建议吗?
我想做的事情是:
a) 将hadoop系统中的数据以1 m的增量分块输出到csv文件中,并将其馈送到r
b) 某种批量处理。
它不是一个实时系统,所以不需要每天都发生,但我仍然想评分约2-3小时。

7bsow1i6

7bsow1i61#

如果您可以在所有datanode上安装r运行时,那么您就可以创建一个简单的hadoop流式Map作业来调用r代码
你也可以看看斯巴克

prdp8dxp

prdp8dxp2#

我推断您希望在完整的数据集而不是示例数据集上运行r代码(分类器)
因此,我们正在寻找在大规模分布式系统上执行r代码
而且,它必须与hadoop组件紧密集成。
所以rhadoop会适合你的问题陈述。
http://www.rdatamining.com/big-data/r-hadoop-setup-guide

xpcnnkqh

xpcnnkqh3#

  1. The scoring of 80 million to 8.5 seconds
  2. The code below was run on an off lease Dell T7400 workstation with 64gb ram, dual quad 3ghz XEONS and two raid 0 SSD arrays on separate channels which I purchased for $600. I also use the free SPDE to partition the dataset.
  3. For small datasets like your 80 million you might want to consider SAS or WPS.
  4. The code below scores 80 million 40 char records in 9 seconds
  5. The combination of in memory R and SAS/WPS makes a great combinations. Many SAS users consider datasets less than 1TB to be small.
  6. I ran 8 parallel processes, SAS 9.4 64bit Win Pro 64bit
  7. 8.5
  8. %let pgm=utl_score_spde;
  9. proc datasets library=spde;
  10. delete gig23ful_spde;
  11. run;quit;
  12. libname spde spde 'd:/tmp'
  13. datapath=("f:/wrk/spde_f" "e:/wrk/spde_e" "g:/wrk/spde_g")
  14. partsize=4g;
  15. ;
  16. data spde.littledata_spde (compress=char drop=idx);
  17. retain primary_key;
  18. array num[20] n1-n20;
  19. array chr[20] $4 c1-c20;
  20. do primary_key=1 to 80000000;
  21. do idx=31 to 50;
  22. num[idx-30]=uniform(-1);
  23. chr[idx-30]=repeat(byte(idx),40);
  24. end;
  25. output;
  26. end;
  27. run;quit;
  28. %let _s=%sysfunc(compbl(C:\Progra~1\SASHome\SASFoundation\9.4\sas.exe -sysin c:\nul -nosplash -sasautos c:\oto -autoexec c:\oto\Tut_Oto.sas));
  29. * score it;
  30. data _null_;file "c:\oto\utl_scoreit.sas" lrecl=512;input;put _infile_;putlog _infile_;
  31. cards4;
  32. %macro utl_scoreit(beg=1,end=10000000);
  33. libname spde spde 'd:/tmp'
  34. datapath=("f:/wrk/spde_f" "e:/wrk/spde_e" "g:/wrk/spde_g")
  35. partsize=4g;
  36. libname out "G:/wrk";
  37. data keyscore;
  38. set spde.littledata_spde(firstobs=&beg obs=&end
  39. keep=
  40. primary_key
  41. n1
  42. n12
  43. n3
  44. n14
  45. n5
  46. n16
  47. n7
  48. n18
  49. n9
  50. n10
  51. c18
  52. c19
  53. c12);
  54. score= (.1*n1 +
  55. .1*n12 +
  56. .1*n3 +
  57. .1*n14 +
  58. .1*n5 +
  59. .1*n16 +
  60. .1*n7 +
  61. .1*n18 +
  62. .1*n9 +
  63. .1*n10 +
  64. (c18='0000') +
  65. (c19='0000') +
  66. (c12='0000'))/3 ;
  67. keep primary_key score;
  68. run;
  69. %mend utl_scoreit;
  70. ;;;;
  71. run;quit;
  72. %utl_scoreit;
  73. %let tym=%sysfunc(time());
  74. systask kill sys101 sys102 sys103 sys104 sys105 sys106 sys107 sys108;
  75. systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=1,end=10000000);) -log G:\wrk\sys101.log" taskname=sys101;
  76. systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=10000001,end=20000000);) -log G:\wrk\sys102.log" taskname=sys102 ;
  77. systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=20000001,end=30000000);) -log G:\wrk\sys103.log" taskname=sys103 ;
  78. systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=30000001,end=40000000);) -log G:\wrk\sys104.log" taskname=sys104 ;
  79. systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=40000001,end=50000000);) -log G:\wrk\sys105.log" taskname=sys105 ;
  80. systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=50000001,end=60000000);) -log G:\wrk\sys106.log" taskname=sys106 ;
  81. systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=60000001,end=70000000);) -log G:\wrk\sys107.log" taskname=sys107 ;
  82. systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=70000001,end=80000000);) -log G:\wrk\sys108.log" taskname=sys108 ;
  83. waitfor _all_ sys101 sys102 sys103 sys104 sys105 sys106 sys107 sys108;
  84. systask list;
  85. %put %sysevalf( %sysfunc(time()) - &tym);
  86. 8.56500005719863
  87. NOTE: AUTOEXEC processing completed.
  88. NOTE: Libref SPDE was successfully assigned as follows:
  89. Engine: SPDE
  90. Physical Name: d:\tmp\
  91. NOTE: Libref OUT was successfully assigned as follows:
  92. Engine: V9
  93. Physical Name: G:\wrk
  94. NOTE: There were 10000000 observations read from the data set SPDE.LITTLEDATA_SPDE.
  95. NOTE: The data set WORK.KEYSCORE has 10000000 observations and 2 variables.
  96. NOTE: DATA statement used (Total process time):
  97. real time 7.05 seconds
  98. cpu time 6.98 seconds
  99. NOTE: SAS Institute Inc., SAS Campus Drive, Cary, NC USA 27513-2414
  100. NOTE: The SAS System used:
  101. real time 8.34 seconds
  102. cpu time 7.36 seconds
展开查看全部

相关问题