我想提高spark的word2vec模型在emr集群上的性能。我有大约54 gb的清理专利文本数据,我想在上面训练spark的word2vec。它看起来正在运行,但我认为性能可以提高。有人能给我建议怎么做吗?
采取的预处理步骤:
去除文本中的特殊字符,减少不必要的空白。
标记化单词
从标记中删除停止字
柠檬化词
删除频繁出现的单词(出现在30%以上文档中的单词)
已清理数据的示例
+----------------------------------------------------------------------------------------------------+
|[water, cooling, cooled, type, pre, burning, present, invention, provides, kind, water, cooling, ...|
|[new, energetic, liquid, invention, discloses, kind, new, energetic, liquid, made, head, outlet, ...|
|[pre, assembly, pre, disclosed, pre, cylindrical, body, member, extending, axially, opposite, pre...|
|[part, feed, ozone, feed, form, difference, ozone, concentration, space, wise, time, wise, premix...|
|[homogeneous, charge, thereof, invention, discloses, homogeneous, type, thereof, cover, arranged,...|
|[gasoline, pre, plug, pre, communicating, plug, associated, pre, respectively, gasoline, injected...|
|[pre, pre, homogeneous, charge, hcci, mode, providing, pre, fluidly, creating, radical, pre, achi...|
|[pre, 105, 351, another, aspect, pre, equal, greater, main, 107, 355, ieast, prior, main, aspect,...|
|[energy, apparatus, energy, apparatus, presented, herein, energy, conversion, module, containing,...|
|[diesel, invention, provides, inlet, processing, diesel, diesel, inlet, treatment, diesel, charac...|
+----------------------------------------------------------------------------------------------------+
only showing top 10 rows
emr硬件设置:
主机:m5.2x8 vcore,32 gib内存,仅ebs存储ebsstorage:128 gib
核心(10x):m5.4x16 vcore,64 gib内存,仅ebs存储ebsstorage:256 gib spark-submit
设置:
spark-submit --master yarn --conf "spark.executor.instances=40" --conf "spark.default.parallelism=640" --conf "spark.executor.cores=4" --conf "spark.executor.memory=12g" --conf "spark.driver.memory=12g" --conf "spark.driver.maxResultSize=12g" --conf "spark.dynamicAllocation.enabled=false" run_program.py
``` `Word2Vec` 设置(如果未提及,则使用默认设置):
vectorSize=200 `minCount=5` numIterations=15 `numPartitions=120` 更多注意事项:
在估计过程中,集群使用了大约70%的cpu
在估算期间,总ram使用率约为50-60%
我应该增加工资吗 `numPartitions` 利用大约100%的cpu利用率?它会在多大程度上(或在多大程度上)降低模型的准确性?我该怎么设置 `numIterations` ? 在这种情况下什么是足够的?
有人能帮我吗?
提前谢谢!
暂无答案!
目前还没有任何答案,快来回答吧!