What is the max size of an object in S3 Bucket?
The max size is 5TB
How upload more than 5GB?
You must use “multi-part upload”
Amazon S3 is strong consistency? What that means?
Yes, it is.
After a successsful write of a new object (new PUT) or an overwrite or delete of an existing object (overwrite PUT or DELETE)
Any subsequent read request immediately receives the last version of the object (read after write consistency)
subsequent list request immediately reflects changes (list consistency)
What are the classess of S3 Storage Classes?
• Amazon S3 Standard - General Purpose
• Amazon S3 Standard-Infrequent Access (IA)
• Amazon S3 One Zone-Infrequent Access
• Amazon S3 Intelligent Tiering
— Automatically moves objects between two access tiers based on changing access patterns
• Amazon Glacier
• Amazon Glacier Deep Archive
— Time to retrieve object: Standard (12 hours) / Bulk (48 hours) / Minimum storage duration of 180 days
(Slide 115)
What are the S3 Lifecycle Rules?
Transition actions --- It defines when objects are transitioned to another storage class (move objects to Standard IA class 60 days after creation)
Expiration actions
— Configure objects to expire (delete) after some time
Can be used to delete old version of files
How does S3 Performance work?
Amazon S3 automatically scales to high request rates, latency 100-200 ms
Your application can achieve at least 3,500
PUT/COPY/POST/DELETE and 5,500 GET/HEAD requests per second per prefix in a bucket.
How does upload in S3 work?
We have two options:
• Multi-Part upload:
• recommended for files > 100MB,
must use for files > 5GB
• Can help parallelize uploads
(speed up transfers)• S3 Transfer Acceleration
• Increase transfer speed by
transferring file to an AWS edge
location which will forward the data
to the S3 bucket in the target region
• Compatible with multi-part uploadHow does Download in S3 work?
You can use S3 byte-range feches:
• Parallelize GETs by requesting
specific byte ranges
• Better resilience in case of failures
Can be used to speed up downloads
Can be used to retrieve only partial
data (for example the head of a
file)
How does S3 Encryption work?
There are 4 methods of encrypting objects in S3
• SSE-S3: encrypts S3 objects using keys handled & managed by AWS
— AES-256 encryption type
— Must set header: “x-amz-server-side-encryption”: “AES256”
• SSE-KMS: leverage AWS Key Management Service to manage
encryption keys
— KMS Advantages: user control + audit trail
— Must set header: “x-amz-server-side-encryption”: ”aws:kms”
• SSE-C: when you want to manage your own encryption keys
• Client Side Encryption
— Customer fully manages the keys and encryption cycle
(slide 128)
How does S3 Security Access work?
User based
• IAM policies - which API calls should be allowed for a specific user from IAM console
Resource Based
• Bucket Policies - bucket wide rules from the S3 console - allows cross
account
• Object Access Control List (ACL) – finer grain
• Bucket Access Control List (ACL) – less common
Note: an IAM principal can access an S3 object if
• the user IAM permissions allow it OR the resource policy ALLOWS it
• AND there’s no explicit DENY
How does S3 Security work?
Can be user based (IAM polices), Resource Based (bucket policies).
Networking - Supports VPC Endpoints
Logging and Audit - S3 Access Logs can be stored in other S3 bucket / API calls can be logged in AWS CloudTrail
User Security - MFA Delete / Pre-Signed URLs: URLs that are valid only for a limited time (ex:
premium video service for logged in users)
How does DynamoDb Partition work?
• WCU and RCU are spread evenly between partitions
How do DynamoDb Conditional Writes work?
How do DynamoDb Batching Writes work? What are the benefits?
How do DynamoDb Batching Read work? What are the benefits?
DynamoDB – Query
DynamoDB - Scan
• Scan the entire table and then filter out data (inefficient)
• Returns up to 1 MB of data – use pagination to keep on reading
• Consumes a lot of RCU
• Limit impact using Limit or reduce the size of the result and pause
• For faster performance, use parallel scans:
• Multiple instances scan multiple partitions at the same time
• Increases the throughput and RCU consumed
• Limit the impact of parallel scans just like you would for Scans
• Can use a ProjectionExpression + FilterExpression (no change to RCU)
What is Glue?
Serverless discovery and definition
of table definitions and schema (S3 datalakes, RDS,…)
Custom ETL jobs fully managed, trigger-drive, on a schedule, or on demand.
How Glue and S3 partitions work?
Glue crawler will extract partitions based on how your S3 data is organized
Think up front about how you will be querying your data lake in S3
Example: devices send sensor data every hour
Do you query primarily by time ranges?
• If so, organize your buckets as yyyy/mm/dd/device
Do you query primarily by device?
• If so, organize your buckets as device/yyyy/mm/dd
(slide 188)
How Glue + Hive work?
• Hive lets you run SQL-like queries from EMR • The Glue Data Catalog can serve as a Hive “metastore” • You can also import a Hive metastore into Glue