Analytics Flashcards

(83 cards)

1
Q

What is the purpose of Glue crawlers?

A

To populate the Glue Data Catalog with metadata on the data in S3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does crawling with Glue Crawlers enable for Athena, Redshift and EMR?

A

Allows you to query your unstructured data using Athena, Redshift and EMR as if it was structured

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is ResolveChoice in Glue dynamic frames?

A

Allows you to deal with ambiguity in the dynamic frames, e.g. find a way to differentiate between two fields that have the same values in a field

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are dynamic frames in Glue?

A

An extension of Spark’s dataframes, a collection of records that have a schema. Used for semi-structured data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does Hive allow you to do on EMR?

A

Run SQL-like queries from EMR

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Can you modify the data catalog using a script to update the partitions or schema recorded there?

A

Yes, you can do this for certain data formats (JSON, CSV, Avro, Parquet) and if the data is in S3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a job bookmark in Glue ETL jobs?

A

A way to persist the state of the previous job run, allowing you to prevent the re-processing of old data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Glue Studio?

A

A visual interface for defining ETL workflows using DAGs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Glue Data Quality?

A

A feature within Glue Studio that allows you to perform an action based on an evaluation of the quality of your data, e.g. fail the whole job or report the results to CloudWatch

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Glue DataBrew?

A

A visual data preparation tool for transforming your data with over 250 ready-made transformations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What should you create if you have a sequence of transformations you know you will want to re-use elsewhere in Glue DataBew?

A

A ‘recipe’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are 3 ways to deal with PII in Glue DataBrew?

A
  • Substitute with random numbers
  • Shuffle them around so they don’t match the other values
  • Deterministically encrypt
  • Probabilistically encrypt
  • Decrypt
  • Null out or delete
  • Mask out part or all of it
  • Hash it
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the purpose of Glue Workflows?

A

To design multi-job or multi-crawler workflows within AWS Glue

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Does Lake Formation itself cost money?

A

No

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the finest grain of access in Lake Formation?

A

Cell-level, using LF Data Filters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is needed to do cross-account permissions in Lake Formation?

A

Set up the recipient as a data lake administrator
Use AWS RAM for accounts external to your organisation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Can Athena query unstructured data?

A

Yes - it can do structured, semi-structured and unstructured data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Does Athena support all of CSV, TSV, Avro, JSON?

A

Yes, it also supports Parquet and ORC (which are the obvious ones as they are columnar)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is Athena workgroups?

A

Allows you to organise users, teams, apps and workloads into groups where you can control query access and track costs by group, as well as implement the amount of data that each group can scan and keep query histories

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How do you pay for Athena?

A

Per TB scanned, for successful and cancelled queries but not failed queries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are 2 tips for optimising performance with Athena?

A

1/ Use columnar formats such as Parquet and ORC
2/ Use a small number of large files instead of a large number of small ones
3/ Use partitions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What type of tables in Athena are ACID compatible?

A

Iceberg tables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What negative effect can ACID support have on your tables?

A

Can bloat them with lots and lots of data held to ensure consistency for all users - you should periodically compact your data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What regulatory use case is ACID useful for?

A

GDPR compliance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What do ACID compliance and iceberg tables enable?
Rollback, querying of historical data, verification of changes between updates, changing the partitioning of your data through a simple query
26
Are Glue Data Catalogs compatible with Iceberg by default?
No
27
What is Spark?
A distributed processing framework that supports Java, Scala, Python and R
28
What does CREATE TABLE AS SELECT do?
Creates a new table from the results of a query
29
What does Kinesis use to integrate with Spark?
Kinesis Client Library
30
How can you run Jupyter notebooks with Spark?
Through the Athena console
31
How can you keep costs down when using Spark to query your data in a serverless manner?
Limit the data processing units for the co-ordinator and executor sizes
32
What are Athena Federated Queries?
Queries that can be used on data outside of S3 using data source connectors
33
Where are views that are created using Athena federated queries stored?
On Glue, NOT the original state source
34
What is a 'passthrough' Federated Query in Athena?
A query that allows you to use the native query language of the data source
35
What are the 3 types of node in EMR?
Master, Core, Task
36
What is the difference between a task and core node in EMR?
Task nodes cannot store data
37
What happens to HDFS data when a cluster is terminated?
It is lost
38
What type of node is always removed first when EMR is scaled down?
Task nodes
39
Can you add and remove core nodes on the fly in EMR?
Yes
40
How do you submit queries and scripts for EMR serverless?
Through job run requests
41
Is EMR serverless multi-region?
No
42
Can EMR run on EKS?
Yes
43
What is the read/write of a shard in KDS?
Write: 1MB/s Read: 2MB/s
44
When a data record is written to a consumer in KDS, what 3 features does it have?
A partition key, a sequence number and a data blob
45
What are KDS' 2 capacity modes?
Provisioned mode and on-demand mode
46
What 2 languages are used by the KCL?
C++ or Java
47
What API does the KDS SDK use?
PutRecord(s)
47
When might you use the KDS SDK versus the other options?
If you don't mind higher latency, lower throughput and a simpler API interface
47
What is 1 positive and 1 negative of the KPL?
It has high performance, automatic and configurable retry mechanism, uses batching Has to be decoded with the KCL
48
What type of servers is the Kinesis agent compatible with?
Linux
49
What is the maximum number of GetRecord KDS calls per second?
5
50
Does the KDS KCL support checkpointing?
Yes, with DynamoDB
51
What is the Kinesis Connector Library?
Different to the KCL Used to write data to S3, DDB, Redshift, OpenSearch
52
What does KDS enhanced fan-out mean and what combination of service and library does it work with?
Lambda and KCL Allows each consumer to get 2MB/s of provisioned throughput per shard
53
What is one risk when merging or splitting shards?
Your records can be read out of order by accident due to the data being read from the child shards before it has exhausted reading the data from the parent shard KCL has logic to counteract this
54
Why can network timeouts cause duplicates from producers?
As the ACK from KDS never actually reaches, so the producer sends the data again until it gets an ACK back
55
Do KDS and KDF both have auto-scaling?
No. KDS cannot auto-scale, KDF can
56
What is the concept of buffer size in KDF?
Firehose has a buffer size which it accumulates and then batch sends. The buffer is defined by size and time - whichever comes first. To get faster delivery you can reduce the buffer, or increase the buffer to increase throughput.
57
What is Managed Service for Apache Flink used for?
Streaming ETL, continuous metric generation, responsive analytics
58
Where is the data in MSK stored?
EBS volumes
59
What is 1 key difference in message size between KDS and MSK?
MSK can do custom config to make max message size higher than 1MB, KDS is stuck at the maximum of 1MB
60
Can Kafka ACLs be managed with IAM?
No
61
What is MSK Connect?
A framework for taking data from somewhere into Kafka or vice versa Allows you to essentially plug in many many destinations to your MSK cluster
62
What do you need to define for MSK serverless?
Topics and partitions
63
What is SPICE?
A way for QS to speed up its querying for queries that would time out if using Athena directly
64
What level of security can QS do?
Row and column level
65
Can QS access data in another region by default?
No
66
What are embedded dashboards and how does access to them work?
Embedded dashboards are dashboards that you can share through your webapp Access is managed to those who also have QS access which is authenticated through SSO/Active Directory/Cognito
67
What are the 4 types of ML insights QS offers?
* Anomaly detection * Forecasting * Auto-narratives * Suggested insights
68
What are QS calculated fields?
New fields that you can create based on others, e.g. profit is the revenue column - the costs column
69
What are documents in Opensearch?
What you are searching for - any structured JSON
70
What are types in Opensearch?
The schema and mapping shared by documents that are similar/represent the same thing, e.g. logs are all one 'type'
71
What are indexes and shards in Opensearch?
An index is split into shards, and each document will be related to a specific shard
72
How do indexes in Opensearch enable the parallelisation of read/write?
They have 2 primary and 2 replica shards within them
73
What is Opensearch used for?
* Search * Log analytics (e.g. security and clickstream analytics) * Application monitoring based on incoming log data
74
What are Opensearch's 3 storage speed options and what do they use?
Hot - EBS volumes Warm/ultrawarm - S3 + caching Cold - S3 only
75
Can you move between Opensearch's storage speed options automatically?
Yes
76
Why might you run Index State Management in Opensearch?
To delete old indexes, move indexes into a read-only state, reduce replica count over time
77
What are index roll-ups?
When you periodically roll up data into a summarised index. Reduces storage space but the new index will have less detailed data
78
What is the difference between index roll ups and index transforms?
Index roll-ups are more about saving space, index transforms are more about creating a different view with which to analyse the data in a new way
79
What can happen if there are unbalanced numbers of shards across nodes in Opensearch?
You can have memory issues
80
What are the lower limits for search and indexing for capacity in Opensearch?
2 Opensearch capacity units for both
81
What are the 2 types of collections in Opensearch serverless?
Search (optimised for search workloads) and time series (for sequential/time-series data and better for append-only)