AWS - Storage Service Detail - Notes - by Rohit Singh
AWS - Storage Service Detail - Notes - by Rohit Singh
Rohit Singh
S3 Simplified:
S3 provides developers and IT teams with secure, durable, and highly-scalable object
storage. Object storage, as opposed to block storage, is a general term that refers to
data composed of three things:
This makes it a perfect candidate to host files or directories and a poor candidate to
host databases or operating systems. The following table highlights key differences
between object and block storage:
Data uploaded into S3 is spread across multiple files and facilities. The files uploaded
into S3 have an upper-bound of 5TB per file and the number of files that can be
uploaded is virtually limitless. S3 buckets, which contain all files, are named in a
universal namespace so uniqueness is required. All successful uploads will return an
HTTP 200 response.
S3 Key Details:
Objects (regular files or directories) are stored in S3 with a key, value, version
ID, and metadata. They can also contain torrents and sub resources for access
control lists which are basically permissions for the object itself.
The data consistency model for S3 ensures immediate read access for new
objects after the initial PUT requests. These new objects are introduced into
AWS for the first time and thus do not need to be updated anywhere so they
are available immediately.
The data consistency model for S3 also ensures immediate read access for
PUTS and DELETES of already existing objects, since Decembre 2020.
Amazon guarantees 99.999999999% (or 11 9s) durability for all S3 storage
classes except its Reduced Redundancy Storage class.
S3 comes with the following main features:
6.) access control lists & bucket policies to secure the data
S3 charges by:
4.) data transfer pricing (objects leaving/entering AWS via the internet)
5.) transfer acceleration (an optional speed increase for moving objects via
Cloudfront)
Bucket policies secure data at the bucket level while access control lists secure
data at the more granular object level.
By default, all newly created buckets are private.
S3 can be configured to create access logs which can be shipped into another
bucket in the current account or even a separate account all together. This
makes it easy to monitor who accesses what inside S3.
There are 3 different ways to share S3 buckets across AWS accounts:
1.) For programmatic access only, use IAM & Bucket Policies to share entire
buckets
2.) For programmatic access only, use ACLs & Bucket Policies to share objects
3.) For access via the console & the terminal, use cross-account IAM roles
S3 is a great candidate for static website hosting. When you enable static
website hosting for S3 you need both an index.html file and an error.html file.
Static website hosting creates a website endpoint that can be accessed via the
internet.
When you upload new files and have versioning enabled, they will not inherit
the properties of the previous version.
S3 Storage Classes:
S3 Infrequently Accessed (IA) - For data that is needed less often, but when it is
needed the data should be available quickly. The storage fee is cheaper, but you are
charged for retrieval.
S3 Glacier - low-cost storage class for data archiving. This class is for pure storage
purposes where retrieval isn’t needed often at all. Retrieval times range from minutes
to hours. There are differing retrieval methods depending on how acceptable the
default retrieval times are for you:
S3 Deep Glacier - The lowest cost S3 storage where retrieval can take 12 hours.
S3 Encryption:
Encryption In Transit: When the traffic passing between one endpoint to another is
indecipherable. Anyone eavesdropping between server A and server B won’t be able
to make sense of the information passing by. Encryption in transit for S3 is always
achieved by SSL/TLS.
You can encrypt on the AWS supported server-side in the following ways:
S3 Versioning:
S3 Lifecycle Management:
S3 Transfer Acceleration:
Transfer acceleration makes use of the CloudFront network by sending or
receiving data at CDN points of presence (called edge locations) rather than
slower uploads or downloads at the origin.
This is accomplished by uploading to a distinct URL for the edge location
instead of the bucket itself. This is then transferred over the AWS network
backbone at a much faster speed.
S3 Event Notifications:
The Amazon S3 notification feature enables you to receive and send notifications
when certain events happen in your bucket. To enable notifications, you must first
configure the events you want Amazon S3 to publish (new object added, old object
deleted, etc.) and the destinations where you want Amazon S3 to send the event
notifications. Amazon S3 supports the following destinations where it can publish
events:
S3 and ElasticSearch:
If you are using S3 to store log files, ElasticSearch provides full search
capabilities for logs and can be used to search through data stored in an S3
bucket.
You can integrate your ElasticSearch domain with S3 and Lambda. In this
setup, any new logs received by S3 will trigger an event notification to
Lambda, which in turn will then run your application code on the new log
data. After your code finishes processing, the data will be streamed into your
ElasticSearch domain and be available for observation.
If the request rate for reading and writing objects to S3 is extremely high, you
can use sequential date-based naming for your prefixes to improve
performance. Earlier versions of the AWS Docs also suggested to use hash
keys or random strings to prefix the object's name. In such cases, the
partitions used to store the objects will be better distributed and therefore will
allow better read/write performance on your objects.
If your S3 data is receiving a high number of GET requests from users, you
should consider using Amazon CloudFront for performance optimization. By
integrating CloudFront with S3, you can distribute content via CloudFront's
cache to your users for lower latency and a higher data transfer rate. This also
has the added bonus of sending fewer direct requests to S3 which will reduce
costs. For example, suppose that you have a few objects that are very popular.
CloudFront fetches those objects from S3 and caches them. CloudFront can
then serve future requests for the objects from its cache, reducing the total
number of GET requests it sends to Amazon S3.
Server access logging provides detailed records for the requests that are
made to a bucket. Server access logs are useful for many applications. For
example, access log information can be useful in security and access audits. It
can also help you learn about your customer base and better understand your
Amazon S3 bill.
By default, logging is disabled. When logging is enabled, logs are saved to a
bucket in the same AWS Region as the source bucket.
Each access log record provides details about a single access request, such as
the requester, bucket name, request time, request action, response status, and
an error code, if relevant.
It works in the following way:
o S3 periodically collects access log records of the bucket you want to
monitor
o S3 then consolidates those records into log files
o S3 finally uploads the log files to your secondary monitoring bucket as
log objects
S3 Multipart Upload:
Multipart upload allows you to upload a single object as a set of parts. Each
part is a contiguous portion of the object's data. You can upload these object
parts independently and in any order.
Multipart uploads are recommended for files over 100 MB and is the only
way to upload files over 5 GB. It achieves functionality by uploading your data
in parallel to boost efficiency.
If transmission of any part fails, you can retransmit that part without affecting
other parts. After all parts of your object are uploaded, Amazon S3 assembles
these parts and creates the object.
Possible reasons for why you would want to use Multipart upload:
o Multipart upload delivers the ability to begin an upload before you
know the final object size.
o Multipart upload delivers improved throughput.
o Multipart upload delivers the ability to pause and resume object
uploads.
o Multipart upload delivers quick recovery from network issues.
You can use an AWS SDK to upload an object in parts. Alternatively, you can
perform the same action via the AWS CLI.
You can also parallelize downloads from S3 using byte-range fetches. If
there's a failure during the download, the failure is localized just to the specific
byte range and not the whole object.
S3 Pre-signed URLs:
All S3 objects are private by default, however the object owner of a private
bucket with private objects can optionally share those objects without having
to change the permissions of the bucket to be public.
This is done by creating a pre-signed URL. Using your own security
credentials, you can grant time-limited permission to download or view your
private S3 objects.
When you create a pre-signed URL for your S3 object, you must do the
following:
o Provide your security credentials.
o Specify a bucket.
o Specify an object key.
o Specify the HTTP method (GET to download the object).
o Specify the expiration date and time.
The pre-signed URLs are valid only for the specified duration and anyone who
receives the pre-signed URL within that duration can then access the object.
The following diagram highlights how Pre-signed URLs work:
S3 Select:
S3 Select is an Amazon S3 feature that is designed to pull out only the data
you need from an object, which can dramatically improve the performance
and reduce the cost of applications that need to access data in S3.
Most applications have to retrieve the entire object and then filter out only the
required data for further analysis. S3 Select enables applications to offload the
heavy lifting of filtering and accessing data inside objects to the Amazon S3
service.
As an example, let’s imagine you’re a developer at a large retailer and you
need to analyze the weekly sales data from a single store, but the data for all
200 stores is saved in a new GZIP-ed CSV every day.
o Without S3 Select, you would need to download, decompress and
process the entire CSV to get the data you needed.
o With S3 Select, you can use a simple SQL expression to return only the
data from the store you’re interested in, instead of retrieving the entire
object.
By reducing the volume of data that has to be loaded and processed by your
applications, S3 Select can improve the performance of most applications that
frequently access data from S3 by up to 400% because you’re dealing with
significantly less data.
You can also use S3 Select for Glacier.
Short Notes:
Amazon Simple Storage Service (Amazon S3) is a highly scalable, reliable, and
low-latency object storage service provided by Amazon Web Services (AWS).
It is designed to store and retrieve any amount of data from anywhere on the
web. Here’s a comprehensive overview of Amazon S3:
Key Features
Storage Classes