What are the 3 V’s of Big Data?
What is Redshift?
Redshift is a fully managed, petabyte scale data warehouse service in the cloud. It’s a very large relational database traditionally used in big applications.
What are the features that make Redshift different that a traditional relational database?
Is Redshift a highly available service?
No, it only comes online in one AZ; if you want it in multiple AZs, you will have to create multiple copies
What is ETL?
Extract-Transform-Load
What is EMR?
EMR is a managed big data platform that allows you to process vast amounts of data using open-source tools, such as Spark, Hive, HBase, Flink, Hudi and Presto.
It’s AWS’s ETL tool.
It’s not proprietary to Amazon.
What is the architecture of EMR?
When you spin up an EMR cluster, it will live inside of your VPC.
For the purpose of the exam, will focus on using EC2 instances (but it can also run on EKS and Outpost).
EMR will spin up the instances for you, keep them online, manage them for you. It will take in data, process it putting it into the form you want, and then store in S3 bucket.
If you see a scenario asking about optimizing cost of EC2 instances in EMR, what options do you have?
You can use reserved instances and spot instances because you have control over the types of instances used.
What is Kinesis?
Allows you to ingest, process and analyze real-time streaming data. You can think of it as a huge data highway connected to your AWS account.
What are the two types of Kinesis?
Data Streams
Firehose
What is the architecture for Kinesis Data Streams?
What is the architecture for Kinesis Firehose?
What do you use if you need to analyze data as it is flowing through Kinesis Data Stream or Firehose?
Kinesis Data Analytics (using standard SQL)
When you are looking for a messaging broker, which do you pick?
If you are given a scenario where you need a message broker that delivers in real-time, what would you recommend?
Kinesis (Data Streams)
If you are given a scenario where you need a message broker that delivers in near real-time, what would you recommend?
Kinesis Data Firehose
If a scenario talks about streaming data, what service would you recommend?
Some form of Kinesis
If you are given a scenario that needs to automatically scale your streaming service, what service would you recommend?
Kinesis Data Firehose (only option that offers automatic scaling)
What is Athena?
Athena is a serverless, interactive query service that makes it easy to analyze data in S3 using SQL. This allows you to directly query data in your S3 bucket without loading it in the database.
What is AWS Glue?
Glue is a serverless data integration service that makes it easy to discover, prepare and combine data. It allows you to perform ETL workloads without managing underlying servers.
It replaces EMR.
How do you put Athena and Glue together?
Point Glue at the data in S3 to build a catalog.
Once that is built, you have a couple of options:
You could then use something like Quicksight (Amazon’s version of Tableau) to visualize the data in a dashboard
If you are given a scenario that asks about needing serverless SQL solution to query BI data or logs, what service would you recommend?
Athena
What is Quicksight?
Amazon Quicksight is a fully managed business intelligence (BI) data visualization service. It allows you to easily create dashboards and share them within your company.
Similar to Tableau
What are common ways to incorporate Quicksight into an architecture?
Somewhere you need a data visualization tool (integrates with Athena)