10 Things You Might Not Know About Using S3
Authors: Joshua Levy and Stefan Zier
Date: Nov 1, 2016
Almost everyone who’s used Amazon Web Services has used S3. In the decade since it was first released, S3 storage has become essential to thousands of companies for file storage. While using S3 in simple ways is easy, at larger scale it involves a lot of subtleties and potentially costly mistakes, especially when your data or team are scaling up.
Sadly, as with much of AWS, we often learn some of these tips the hard way, when we’ve made oversights or wish we’d done things differently. Without further ado, here are the ten things about S3 that will help you avoid costly mistakes. We’ve assembled these tips from our own experience and the collective wisdom of several engineering friends and colleagues.
If we missed something or you disagree with anything, please comment below or Tweet at us.
Tip 1: Get log data into and out of S3 faster
Getting data into and out of S3 takes time. If you’re moving data on a frequent basis, there’s a good chance you can speed it up. Cutting down time you spend uploading and downloading files can be remarkably valuable in indirect ways — for example, if your team saves 10 minutes every time you deploy a staging build, you are improving engineering productivity significantly.
S3 is highly scalable, so in principle, with a big enough pipe or enough instances, you can get arbitrarily high throughput. A good example is S3DistCp, which uses many workers and instances. But almost always you’re hit with one of two bottlenecks:
- The size of the pipe between the source (typically a server on premises or EC2 instance) and S3.
- The level of concurrency used for requests when uploading or downloading (including multipart uploads).
The first takeaway from this is that regions and connectivity matter. Obviously, if you’re moving data within AWS via an EC2 instance, such as off of an EBS volume, you’re better off if your EC2 instance and S3 region correspond. More surprisingly, even when moving data within the same region, Oregon (a newer region) comes in faster than Virginia on some benchmarks (source).
If your servers are in a major data center but not in EC2, you might consider using DirectConnect ports to get significantly higher bandwidth (you pay per port). Alternately, you can use S3 Transfer Acceleration (discussed more on Hacker News) to get data into AWS faster simply by changing your API endpoints. You have to pay for that too, the equivalent of 1-2 months of storage cost for the transfer in either direction. Check speeds with the comparison tool. For distributing content quickly to users worldwide, remember you can use BitTorrent support, CloudFront, or another CDN with S3 as its origin.
Secondly, instance types matter. If you’re using EC2 servers, some instance types have higher bandwidth network connectivity than others. You can see this if you sort by “Network Performance” on the excellent ec2instances.info list.
Thirdly, and critically if you are dealing with lots of items, concurrency matters. Each S3 operation is an API request with significant latency — tens to hundreds of milliseconds, which adds up to pretty much forever if you have millions of objects and try to work with them one at a time. So what determines your overall throughput in moving many objects is the concurrency level of the transfer: How many worker threads (connections) on one instance and how many instances are used.
Many common S3 libraries (including the widely used s3cmd — see this issue) do not by default make many connections at once to transfer data. Both s4cmd and AWS’ own aws-cli (discussed here) do make concurrent connections, and are much faster for many files or large transfers (since multipart uploads allow parallelism). Another approach is with EMR, using Hadoop to parallelize the problem. For multipart uploads on a higher-bandwidth network, a reasonable part size is 25–50MB. It’s also possible to list objects much faster, too, if you traverse a folder hierarchy or other prefix hierarchy in parallel.
Finally, if you really have a ton of data to move in batches, just ship it.
Tip 2 : Think through data lifecycles up front
Okay, we might have gotten ahead of ourselves. Before you put something in S3 in the first place, there are several things to think about. One of the most important is a simple question:
When and how should this object be deleted?
Remember, large data will probably expire — that is, the cost of paying Amazon to store it in its current form will become higher than the expected value it offers your business. You might re-process or aggregate data from long ago, but it’s unlikely you want raw unprocessed logs or builds or archives forever.
At the time you are saving a piece of data, it may seem like you can just decide later. Most files are put in S3 by a regular process via a server, a data pipeline, a script, or even repeated human processes — but you’ve got to think through what’s going to happen to that data over time. In our experience, most S3 users don’t consider lifecycle up front, which means mixing files that have short lifecycles together with ones that have longer ones. By doing this you incur significant technical debt around data organization (or equivalently, monthly debt to Amazon!).
Once you know the answers, you’ll find managed lifecycles and S3 object tagging are your friends. In particular, you want to delete or archive based on object tags, so it’s wise to tag your objects appropriately so that it is easier to apply lifecycle policies. It is important to mention that S3 tagging has maximum limit of 10 tags per object and 128 unicode character.(We’ll return to this in Tip 4 and Tip 5.)
You’ll also want to consider compression schemes. For large data that isn’t already compressed, you almost certainly want to — S3 bandwidth and cost constraints generally make compression worth it. (Also consider what tools will read it. EMR supports specific formats like gzip, bzip2, and LZO, so it helps to pick a compatible convention.)
Another question to ask yourself is:
When and how is this object modified?
As with many engineering problems, prefer immutability when possible — design so objects are never modified, but only created and later deleted. However, sometimes mutability is necessary. If S3 is your sole copy of mutable log data, you should seriously consider some sort of backup — or locate the data in a bucket with versioning enabled.
If all this seems like it’s a headache and hard to document, it’s a good sign no one on the team understands it. By the time you scale to terabytes or petabytes of data and dozens of engineers, it’ll be more painful to sort out.
Tip 3: Prioritize access control, encryption, and compliance
This tip is the least sexy, and possibly the most important one here. Before you put something into S3, ask yourself the following questions:
- Are there people who should not be able to modify this data?
- Are there people who should not be able to read this data?
- How are the latter access rules likely to change in the future?
- Should the data be encrypted? (And if so, where and how will we manage the encryption keys?)
- Are there specific compliance requirements?
There’s a good chance your answers are, “I’m not sure. Am I really supposed to know that?”
Well … yes, you have to.
Some data is completely non-sensitive and can be shared with any employee. For these scenarios the answers are easy: Just put it into S3 without encryption or complex access policies. However, every business has sensitive data — it’s just a matter of which data, and how sensitive it is. Determine whether the answers to any of these questions are “yes.”
The compliance question can be confusing.
- Does the data you’re storing contain financial, PII, cardholder, or patient information?
- Do you have PCI, HIPAA, SOX, or EU Safe Harbor compliance requirements? (The latter has become rather complex recently.)
- Do you have customer data with restrictive agreements in place — for example, are you promising customers that their data is encrypted in at rest and in transit? If the answer is yes, you may need to work with (or become!) an expert on the relevant type of compliance and bring in services or consultants to help if necessary.
Minimally, you’ll probably want to store data with different needs in separate S3 buckets, regions, and/or AWS accounts, and set up documented processes around encryption and access control for that data. AWS has a lot of info on compliance.
It’s not fun digging through all this when all you want to do is save a little bit of data, but trust us, it’ll save in the long run to think about it early.
Tip 4: Nested S3 folder organization is great — except when it isn’t
Newcomers to S3 are always surprised to learn that latency on S3 operations depends on key names since prefix similarities become a bottleneck at more than about 100 requests per second. If you have need for high volumes of operations, it is essential to consider naming schemes with more variability at the beginning of the key names, like alphanumeric or hex hash codes in the first 6 to 8 characters, to avoid internal “hot spots” within S3 infrastructure.
This used to be in conflict with Tip 2 before announcement of new S3 storage management features such as object tagging. If you’ve thought through your lifecycles, you probably want to tag objects so you can automatically delete or transition objects based on tags, for example setting a policy like “archive everything with object tag raw to Glacier after 3 months.”
There’s no magic bullet here, other than to decide up front which you care about more for each type of data: Easy-to-manage policies or high-volume random-access operations?
A related consideration for how you organize your data is that it’s extremely slow to crawl through millions of objects without parallelism. Say you want to tally up your usage on a bucket with ten million objects. Well, if you don’t have any idea of the structure of the data, good luck! If you have a sane tagging, or if you have uniformly distributed hashes with a known alphabet, it’s also possible to parallelize.
Tip 5: Save money with Reduced Redundancy, Infrequent Access, or Glacier
S3’s “Standard” storage class offers very high durability (it advertises 99.999999999% durability, or “eleven 9s”), high availability, low latency access, and relatively cheap access cost.
There are three ways you can store data with lower cost per gigabyte:
- S3’s Reduced Redundancy Storage (RRS) has lower durability (99.99%, so just four nines). That is, there’s a good chance you’ll lose a small amount of data. For some datasets where data has value in a statistical way (losing say half a percent of your objects isn’t a big deal), this is a reasonable trade-off.
- S3’s Infrequent Access (IA) (confusingly also called “Standard – Infrequent Access”) lets you get cheaper storage in exchange for more expensive access. This is great for archives like logs you already processed but might want to look at later.
- Glacier gives you much cheaper storage with much slower and more expensive access. It is intended for archival usage.
A common policy that saves money is to set up managed lifecycles that migrate Standard storage to IA and then from IA to Glacier.
Tip 6: Organize S3 data along the right axes
One of the most common oversights is to organize data in a way that causes business risks or costs later. You might initially assume data should be stored according to the type of data, or the product, or by team, but often that’s not enough. It’s usually best to organize your data into different buckets and paths at the highest level not on what the data is itself, but rather by considering these axes:
- Sensitivity: Who can and cannot access it? (E.g. is it helpful for all engineers or only a few admins?)
- Compliance: What are necessary controls and processes? (E.g. is it PII?)
- Lifecycle: How will it be expired or archived? (E.g. is it verbose logs only needed for a month, or important financial data?)
- Realm: Is it for internal or external use? For development, testing, staging, production?
- Visibility: Do I need to track usage for this category of data exactly?
We’ve already discussed the first three. The concept of a realm is just that you often want to partition things in terms of process: For example, to make sure no one puts test data into a production location. It’s best to assign buckets and prefixes by realm up front.
The final point is a technical one: If you want to track usage, AWS offers easy usage reporting at the bucket level. If you put millions of objects in one bucket, tallying usage by prefix or other means can be cumbersome at best, so consider individual buckets where you want to track significant S3 usage or you can use a log analytics solution like Sumo Logic to analyze your S3 logs.
Tip 7: Don’t bake S3 locations into your code
This is pretty simple, but it comes up a lot. Don’t hard-code S3 locations in your code. This is tying your code to deployment details, which is almost guaranteed to hurt you later. You might want to deploy multiple production or staging environments. Or you might want to migrate all of one kind of data to a new location, or audit which pieces of code access certain data.
Decouple code and S3 locations. Especially if you follow Tip 6, this will also help with test releases, or unit or integration tests so they use different buckets, paths, or mocked S3 services. Set up some sort of configuration file or service, and read S3 locations like buckets and prefixes from that.
Tip 8: You can deploy your own testing or production alternatives to S3
There are many services that are (more or less) compatible with S3 APIs. This is helpful both for testing and for migration to local storage. Commonly used tools for small test deployments are S3Proxy (Java) and FakeS3 (Ruby), which can make it far easier and faster to test S3-dependent code in isolation. More full-featured object storage servers with S3 compatibility include Minio (in Go), Ceph (C++/Terra), and Riak CS (Erlang).
Many large enterprises have private cloud needs and deploy AWS-compatible cloud components, including layers corresponding to S3, in their own private clouds, using Eucalyptus and OpenStack. These are not quick and easy to set up but are mature open source private cloud systems. See Eucalyptus Storage and OpenStack Swift.
Tip 9: Check out newer tools for mapping filesystem and S3 data
So can you use S3 as a filesystem? One tool that’s been around a long time is s3fs, the FUSE filesystem that lets you mount S3 as a regular filesystem in Linux and Mac OS. Disappointingly, it turns out this is often more of a novelty than a good idea, as S3 doesn’t offer all the right features to make it a robust filesystem: Appending to a file requires rewriting the whole file, which cripples performance, there is no atomic rename of directories or mutual exclusion on opening files, and a few other issues.
That said, there are some other solutions that use a different object format and allow filesystem-like access. Riofs (C) and Goofys (Go) are more recent implementations that are generally improvements on s3fs. S3QL (discussed here) is a Python implementation that offers data de-duplication, snap-shotting, and encryption. It only supports one client at a time, however. A commercial solution that offers lots of filesystem features and concurrent clients is ObjectiveFS (discussed here).
Another use case is filesystem backups to S3. The standard approach is to use EBS volumes and use snapshots for incremental backups, but this does not fit every use case. Open source backup and sync tools include zbackup (deduplicating backups, inspired by rsync, in C++, analyzed here), restic (deduplicating backups, in Go), borg (deduplicating backups, in Python), and rclone (data syncing to cloud) can be used in conjunction with S3.
Tip 10: Don’t use S3 if another solution is better
Consider that S3 may not be the optimal choice for your use case. As discussed, Glacier and cheaper S3 variants are great for cheaper pricing. EBS and EFS can be much more suitable for random-access data, but cost 3 to 10 times more per gigabyte (see the table above). Traditionally, EBS (with regular snapshots) is the option of choice if you need a filesystem abstraction in AWS. Remember EBS has a very high failure rate compared to S3 (0.1-0.2% per year), so you need to use regular snapshots. You can only attach one instance to an EBS volume at a time. However, with the release of EFS, AWS’ new network file service (NFS v4.1) there is another option that allows up to thousands of EC2 instances to connect to the same drive concurrently — if you can afford it.
Of course, if you’re willing to store data outside AWS, the directly competitive cloud options include Google Cloud Storage, Azure Blob Storage, Rackspace Cloud Files, EMC Atmos, and BackBlaze B2. BackBlaze (discussed here) has a different architecture that offloads some work to the client, and is significantly cheaper.
Bonus Tips: Two S3 issues you no longer need to worry about
A few AWS “gotchas” are significant enough people remember them years later, even though they are no longer relevant. Two long-hated S3 limitations you might remember or have heard rumors of have (finally!) gone away:
- For many years, there was a hard 100-bucket limit per account, which caused many companies significant pain. You’d blithely be adding buckets, and then slam into this limit, and be stuck until you created a new account or consolidated buckets. As of 2015, it can now be raised if you ask Amazon nicely. (Up to 1000 per account; it’s still not unlimited as buckets are in a global namespace.)
- For a long time, the data consistency model in the original ‘us-standard’ region was different and more lax than in the other (newer) S3 regions. Since 2015, this is no longer the case. All regions have read-after-write consistency.