Redshift Spectrum
Redshift Spectrum is a feature of Amazon Redshift that enables you to run queries against exabytes of unstructured data in Amazon S3, with no loading or ETL required.
Redshift Distribution Style
Redshift has 3 distribution Styles
1) Even (default) - rows distributed across the slices regardless of values in a particular column in a round-robin fashion
When to use?
No joins, reduced parallelism is ok, where KEY and ALl are not a clear choice
Key - distribute data evenly among slices; Collocate matching rows on the same slice, good for table joins performance - no cross node traffic
When to use?
- Tables used in joins
- Large fact tables in a star schema
ALL - entire copies of table is distributed to each node; More storage needed
When to use?
- Table data that does not change
- Reasonably sized tables ( a few million rows)
- No common distribution key
Redshift Sort Key
Amazon Redshift stores your data on disk in sorted order according to the sort key. The Amazon Redshift query optimizer uses sort order when it determines optimal query plans
Redshift - Different Sort Keys
Compound sort key - table sorted by columns listed in the sort key definition - weight based on the sort key order in the definition
Interleave sort key - Equal weight to each column in the sort key
Redshift - Data Types
There are 12 data types, e.g.
Redshift - WLM
Workload Management is an ordered set of query queues that define how resources are allocated and how queries are routed for processing. WLM definition is part of the Parameter Group for a Redshift cluster.
Redshift - Dynamic and Static WLM Settings
Dynamic WLM settings can be applied immediately to any cluster associated with the Parameter Group; e.g. Memory
Static WLM settings require reboot for changes to be applied; e.g. User Group, Query Group
Redshift - Query Queue Settings
With each queue, you can define
Redshift - WLM Queue Assignment Rule
Superuser queue; User group queue; Query group queue; Default queue
WLM query queue hopping - if a query is timeout, it is automatically moved to the next queue and restart execution
Redshift - which 4 AWS services can copy data directly to Redshift?
S3, EC2, DynamoDB, and EMR
Redshift - when to data load manifest?
Redshift - is UPSERT supported?
No.
Redshift - what are 2 methods to update or merge data?
Using staging tables to 1) replace existing rows in target table; 2) update certain columns in the target.
Redshift - which S3 encryption supported by COPY?
SSE-S3, SSE-KMS, CSE-Symmetric
Redshift
which S3 encryption supported by UNLOAD?
SSE-S3 (default), SSE-KMS, CSE-CMK
Redshift
What are the node types?
What are their storage capacity?
How many node slices supported?
RA3 - high performance with scalable managed storage (up to 64TB/node)
RA3.4Xlarge: 2-32 nodes and upto 2PB;
12 vCPU / 4 slices
RA3.16Xlarge: 2-128 nodes and up to 8.2PB;
32vCPU / 16 slices
DC2 - High performance with fixed local SSD storage (up to 2.6TB/node)
DC2.large: 1-32 nodes and up to 20.5TB total;
2 vCPU / 2 slices
DC2.8Xlarge: 2-128 nodes and up to 327.7TB total;
32 vCPU / 16 slices
DS2 (legacy) - Large workloads with fixed local HDD storage (16TB/node)
Redshift - what is VACUUM command and why we need it?
VACUUM helps recover the space and sorts the table.
It is needed since Redshift blocks are immutable
Redshift - what are different vacuum commands?
VACUUM FULL;
VACUUM SORT ONLY;
VACUUM DELETED ONLY;
VACUUM REINDEX ;
Redshift - what is deep copy? what is the best way to perform deep copy?
Deep copy is faster than VACUUM for large table (700GB+).
It can be done by
1. Using original DDL, SELECT INSERT data from original, drop the original and rename to original;
Using original DDL is the fastest
Redshift - cross-region snapshots
For KMS-encrypted cluster, you would need to create a snapshot grant in the destination region so Redshift can use a master key in the destination region
ML - what are 3 ML models supported by Amazon Machine Learning?
S3 - what is S3 Select?
S3 Select is a new Amazon S3 capability designed to pull out only the data you need from an object, which can dramatically improve the performance and reduce the cost of applications that need to access data in S3
S3 - what is Glacier Select?
Glacier Select allow you query Glacier data using SQL within minutes. The results are written to S3 and there is no need to restore from Glacier to S3
Quiz - You have a table in your Redshift cluster, and the data in this table changes infrequently. The table has fewer than 15 million rows and does not JOIN any other tables. Which distribution style would you select for this table?
1) EVEN 2) KEY 3) DEFAULT 4) ALL
4) ALL - The ALL distribution type is appropriate for tables that change infrequently (tables that are not updated frequently or extensively). With this distribution style, the entire table is distributed to every node.