EFS是否是HDFS的分布式存储替代品?

r8xiu3jd  于 2022-12-09  发布在  HDFS
关注(0)|答案(1)|浏览(198)

Our business requirement is to read from millions of files and process those parallelly (later index those in ES). This is a one time operation and after processing those we won't read those million files again. Now, we want to distribute the file storage and at the same time ensure data retention. I did some research and made the list

  1. EBS: The data is retained even after EC2 instance is shut down. It is accessible from a single EC2 instance from our AWS region. It will be useful if we split the data on our own and provide it to different EC2 instances. It offers redundancy and encryption security. Easy to scale. We can use it if we divide the chunks manually and provide those to the different servers we have.
  2. EFS: It allows us to mount the FS across multiple regions and instances (accessible from multiple EC2 instances). Since EFS is a managed service, we don’t have to worry about maintaining and deploying the FS
  3. S3: Not limited to access from EC2 but S3 is not a file system
  4. HDFS: Extremely good at scale but is only performant with double or triple replication. Scaling down HDFS is painful and buggy. "It also lacks encryption at storage and network levels. It has also been connected to various controversies because cybercriminals can easily exploit the frameworks that are built on Java." Not sure how big of a concern this is considering our servers are pretty secure.
  5. Problem with small files in Hadoop, explained in https://data-flair.training/forums/topic/what-is-small-file-problem-in-hadoop/ Considering most of the files we receive are less then 1 MB; this can cause memory issues if we go beyond a certain number. So it will not give us the performance we think it should.
    My confusion is in HDFS: I went through a lot of resources that talk about "S3" vs "HDFS" and surprisingly there are no clear resources on "EFS" vs "HDFS" which confuses me in understanding if they are really a substitute for each other or are complementary.
  6. For example, one question I found was "Has anyone tried using AWS EFS mounts as yarn scratch and HDFS directories?" -> what does it mean to have EFS mount as HDFS directory?
  7. "Using EBS volumes for HDFS prevents data locality" - What does it mean to use "EBS volume" for HDFS?
  8. What does it mean to run "HDFS in the cloud"?
    References
uqcuzwp8

uqcuzwp81#

任何类型的存储都有可能,但由于您的情况是一次性的,因此您需要在以下方面进行选择
1.成本优化
1.表现良好
1.安全
我无法回答您的所有问题,但关于您的用例,我认为您使用EC2示例中的数据,如果您提到这些文件的生成和处理以及每个文件的大小,也许我可以更好地帮助您。
注意事项:

  1. EBS具有已配置或有限的吞吐量,并强制您在处理后配置和删除数据。FYI:您可以将EBS卷保留策略设置为在EC2终止时删除,而不是在关机时删除。
    如果您确实需要最快的方式,并且不在乎成本,EBS是一个不错的主意,它可以很好地进行资源调配,因为您需要按其生命周期和存储进行收费。
  2. EFS是NAS存储,也需要在处理后删除数据。
  3. HDFS是一个分布式文件系统,是PB级和分布式文件系统的最佳选择,但不能用作一次性解决方案,您需要安装和配置。
    1.我个人建议您使用S3,因为您的吞吐量没有限制,使用VPC端点可以实现高达25 Gbps的速率,或者您可以使用S3生命周期策略根据标记自动删除数据,或者在1至356天后删除数据,或者根据需要进行归档。

相关问题