Our business requirement is to read from millions of files and process those parallelly (later index those in ES). This is a one time operation and after processing those we won't read those million files again. Now, we want to distribute the file storage and at the same time ensure data retention. I did some research and made the list
- EBS: The data is retained even after EC2 instance is shut down. It is accessible from a single EC2 instance from our AWS region. It will be useful if we split the data on our own and provide it to different EC2 instances. It offers redundancy and encryption security. Easy to scale. We can use it if we divide the chunks manually and provide those to the different servers we have.
- EFS: It allows us to mount the FS across multiple regions and instances (accessible from multiple EC2 instances). Since EFS is a managed service, we don’t have to worry about maintaining and deploying the FS
- S3: Not limited to access from EC2 but S3 is not a file system
- HDFS: Extremely good at scale but is only performant with double or triple replication. Scaling down HDFS is painful and buggy. "It also lacks encryption at storage and network levels. It has also been connected to various controversies because cybercriminals can easily exploit the frameworks that are built on Java." Not sure how big of a concern this is considering our servers are pretty secure.
- Problem with small files in Hadoop, explained in https://data-flair.training/forums/topic/what-is-small-file-problem-in-hadoop/ Considering most of the files we receive are less then 1 MB; this can cause memory issues if we go beyond a certain number. So it will not give us the performance we think it should.
My confusion is in HDFS: I went through a lot of resources that talk about "S3" vs "HDFS" and surprisingly there are no clear resources on "EFS" vs "HDFS" which confuses me in understanding if they are really a substitute for each other or are complementary. - For example, one question I found was "Has anyone tried using AWS EFS mounts as yarn scratch and HDFS directories?" -> what does it mean to have EFS mount as HDFS directory?
- "Using EBS volumes for HDFS prevents data locality" - What does it mean to use "EBS volume" for HDFS?
- What does it mean to run "HDFS in the cloud"?
References
- https://databricks.com/blog/2017/05/31/top-5-reasons-for-choosing-s3-over-hdfs.html
- https://cloud.google.com/blog/products/storage-data-transfer/hdfs-vs-cloud-storage-pros-cons-and-migration-tips
- https://www.knowledgehut.com/blog/big-data/top-pros-and-cons-of-hadoop
- https://data-flair.training/blogs/13-limitations-of-hadoop/
1条答案
按热度按时间uqcuzwp81#
任何类型的存储都有可能,但由于您的情况是一次性的,因此您需要在以下方面进行选择
1.成本优化
1.表现良好
1.安全
我无法回答您的所有问题,但关于您的用例,我认为您使用EC2示例中的数据,如果您提到这些文件的生成和处理以及每个文件的大小,也许我可以更好地帮助您。
注意事项:
如果您确实需要最快的方式,并且不在乎成本,EBS是一个不错的主意,它可以很好地进行资源调配,因为您需要按其生命周期和存储进行收费。
1.我个人建议您使用S3,因为您的吞吐量没有限制,使用VPC端点可以实现高达25 Gbps的速率,或者您可以使用S3生命周期策略根据标记自动删除数据,或者在1至356天后删除数据,或者根据需要进行归档。