#Glue What are Glue Worker Types?
AWS Glue comes with 3 worker types
Standard - 4 vCPU, 16GB RAM, 50GB, 2 Spark executors ⇒ 1 DPU
G.1X - 4 vCPU, 16GB RAM, 64GB, 1 Spark executors ⇒ 1 DPU
G.2X - 8 vCPU, 32GB RAM, 128GB, 1 Spark executors ⇒ 2 DPU
1 DPU can run 8 Spark executors
G1.X for jobs that are memory intensive
G2.X for jobs that uses AWS Glue ML workloads such as ML Transforms
#EMR What are three different node types in a EMR cluster?
#EMR What is EMRFS Consistent View?
EMRFS Consistent View is an optional feature that allows EMR cluster to check for list and read-after-write consistency for S3 objects written or synced with EMRFS.
Why? S3’s eventual consistency
#EMR How does EMRFS Consistent View works?
EMRFS Consistent View uses an Amazon DynamoDB to store object metadata and track consistency with S3 (EMRFS Metadata Store)
#EMR What are the storage options for a EMR cluster?
Can you add or detach a EBS volumes to a running EMR cluster?
#EMR What does the Spark stack consist of?
#EMR What is Hive?
Hive - data warehouse and analytic infrastructure built on top of Hadoop
#EMR What is Tez?
Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop;
Both Tez and MapReduce are execution engine in Hive
#EMR What is Presto?
Pesto is an open source in-memory distributed fast SQL query engine designed for interactive queries against PB of data from different sources
#EMR What are EMR Notebooks?
EMR Notebooks are similar Zeppelin but with more AWS Integration
#EMR What is HUE?
HUE stands for Hadoop User Experience. It is a Open source web interface for Apache Hadoop and other non-Hadoop applications running on EMR
#EMR What is Flume?
Flume is another way to streaming data (e.g. log data) into your EMR cluster.
#EMR What is MXNet?
MXNet is an alternative to Tensorflow for building neural networks. MXNet is included in EMR.
#EMR What is S3DistCP?
S3DistCP is a tool for copying large amount of data between S3 and HDFS
#EMR What are the Hive integration with AWS?
- DynamoDB as an external table. Hive can process and join data stored in DynamoDB
#Quicksight What is a KPI Chart?
KPI Charts use a key performance indicator (KPI) to visualize a comparison between a key value and its target value.
#DynamoDB What is WCU and RCU for DynamoDB?
1 WCU = 1KB/s WRITE
1 RCU = 2 eventual consistent READ of 4KB/s; 1 consistent READ of 4KB/s
#S3 What is Glacier Select?
Glacier Select allow you query Glacier data with simple SQL queries and get results in minutes, without need to restore to S3.
#IoT What are types of identity principals for device or client authentication supported by AWS IoT?
I think you can also implement Federated Identity via Cognito since it can use
#Redshift What is Redshift's Elastic Resize?
Redshift’s Elastic Resize allow you add / remove nodes and also change node types.
However, Elastic resize only holds connections open if you only change the number of nodes, not the node type.
If you want to minimizes the downtime involved, you might still use the snapshot / restore / resize approach with classic resize
https://aws.amazon.com/blogs/big-data/scale-your-amazon-redshift-clusters-up-and-down-in-minutes-to-get-the-performance-you-need-when-you-need-it/
#EMR Which compression algorithm are splittable?
BZIP and LZO are splittable, great for parallel processing
GZIP and SNAPPY are NOT splittable
#EMR What is HBase Read-replica in S3?
Amazon EMR version 5.7.0+ allows you to maintain read-only copies of data in Amazon S3.
You can access the data from the read-replica cluster to perform read operations simultaneously, and in the event that the primary cluster becomes unavailable.
#EMR What are the 3 ways that HBase can integrate with S3
#EMR What is Ganglia?
Ganglia is a a scalable, distributed system designed to monitor clusters and grids while minimizing the impact on their performance.
Ganglia is installed on the Master Node and Ganglia is the operational dashboard provided with EMR.