Consistent Distributed File Systems
Historically, I’ve used standard HDFS, MapR’s version of HDFS (MapR-FS), and ADLS (Azure’s data lake service). All of these behave very much like you would expect a local file system to. If you write files and another process lists files, it will immediately see them and be able to use them without issue.
Amazon s3 File System Issues
I was surprised when I started learning about Amazon s3 after using all of these prior file systems. I understand that s3 is an object store… similar to Azure Blob Storage. I also understand that it is the main data lake solution though.
Maybe it’s just because I’m new and am missing something… but there doesn’t seem to be any AWS version of ADLS.
The s3 storage service is eventually consistent. This means that if you run Spark, or similar tools on it, they will likely produce improper results or fail. This is because multiple tasks will write files in parallel and list them and they won’t necessarily get the fully up to date view of the storage. So, they may write 10 files, list them, and see 5 files, etc.
I came across a very good article describing this in detail here: https://www.opendoor.com/w/blog/why-s3guard-with-s3-as-a-filesystem-spark.
The TLDR is that you have to use a consistency layer between your big data frameworks and s3 to ensure they function well. You can confirm this by reading the short hadoop documentation site here -> https://wiki.apache.org/hadoop/AmazonS3.
Note that the first article recommends S3Guard which works based on DynamoDB, but there may be other options (e.g. EMR will have a way of dealing with this).