Spark Patterns and Anti Patterns
Spark Patterns:
Spark Anti Patterns:
Kinesis Retention Periods
24 Hours to 7 Days
Default is 24 Hours
EMR Consistent View
EMRFS consistent view monitors Amazon S3 list consistency for objects written by or synced with EMRFS, delete consistency for objects deleted by EMRFS, and read-after-write consistency for new objects written by EMRFS.
You can configure additional settings for consistent view by providing them for the
/home/hadoop/conf/emrfs-site.xml
DynamoDB Max number of LSI
5
Kinesis Firehose Handling
2. Redshift & ElastiSearch : 0-7200 Seconds
Apache Hadoop Modules
Apache Hadoop Modules
Impala
Impala is an open source tool in the Hadoop ecosystem for interactive, ad hoc querying using SQL syntax. Instead of using MapReduce, it leverages a massively parallel processing (MPP) engine similar to that found in traditional relational database management systems (RDBMS).
Kinesis Consumers
Read data from streams:
Kinesis Streams
Kinesis Streams:
EMR Data Compression Formats
Algorithm/Splittable/Comp. Ratio/Co-De Speed
Presto - Patterns and Anti-Patterns
Presto Patterns:
Presto Anti-patterns:
KPL - Key Concepts
Resizing EMR Cluster
Redshift Important Operations
Redshift important operations:
DynamoDB Performance Metrics
1 Partition = 10 GB = 3000 RCU & 1000 WCU
RCU - 4KB/sec
WCU- 1 KB/sec
DynamoDB Streams Configuration Views
KPL Use Cases
- Record aggregation
Zookeeper
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
- Coordinates distributed processing
Regression Model
Use Cases
Kinesis Agent
Kinesis Firehose Destination Data Delivery
Machine Learning Algorithms
EMR Cluster sizing
2. Core Nodes - Replication Factor >10 Node cluster - 3 4-9 Node cluster -2 3 Node cluster - 1
HDFS Capacity Formula=
Data Size = Total Storage/Replication Factor
Note: AWS recommends smaller cluster of larger nodes
DynamoDB Performance
DynamoDB Performance