Analytics Flashcards by Al Them

What is the purpose of Glue crawlers?

To populate the Glue Data Catalog with metadata on the data in S3

How well did you know this?

Not at all

Perfectly

What does crawling with Glue Crawlers enable for Athena, Redshift and EMR?

Allows you to query your unstructured data using Athena, Redshift and EMR as if it was structured

How well did you know this?

Not at all

Perfectly

What is ResolveChoice in Glue dynamic frames?

Allows you to deal with ambiguity in the dynamic frames, e.g. find a way to differentiate between two fields that have the same values in a field

How well did you know this?

Not at all

Perfectly

What are dynamic frames in Glue?

An extension of Spark’s dataframes, a collection of records that have a schema. Used for semi-structured data

How well did you know this?

Not at all

Perfectly

What does Hive allow you to do on EMR?

Run SQL-like queries from EMR

How well did you know this?

Not at all

Perfectly

Can you modify the data catalog using a script to update the partitions or schema recorded there?

Yes, you can do this for certain data formats (JSON, CSV, Avro, Parquet) and if the data is in S3

How well did you know this?

Not at all

Perfectly

What is a job bookmark in Glue ETL jobs?

A way to persist the state of the previous job run, allowing you to prevent the re-processing of old data

How well did you know this?

Not at all

Perfectly

What is Glue Studio?

A visual interface for defining ETL workflows using DAGs

How well did you know this?

Not at all

Perfectly

What is Glue Data Quality?

A feature within Glue Studio that allows you to perform an action based on an evaluation of the quality of your data, e.g. fail the whole job or report the results to CloudWatch

How well did you know this?

Not at all

Perfectly

What is Glue DataBrew?

A visual data preparation tool for transforming your data with over 250 ready-made transformations

How well did you know this?

Not at all

Perfectly

What should you create if you have a sequence of transformations you know you will want to re-use elsewhere in Glue DataBew?

A ‘recipe’

How well did you know this?

Not at all

Perfectly

What are 3 ways to deal with PII in Glue DataBrew?

Substitute with random numbers
Shuffle them around so they don’t match the other values
Deterministically encrypt
Probabilistically encrypt
Decrypt
Null out or delete
Mask out part or all of it
Hash it

How well did you know this?

Not at all

Perfectly

What is the purpose of Glue Workflows?

To design multi-job or multi-crawler workflows within AWS Glue

How well did you know this?

Not at all

Perfectly

Does Lake Formation itself cost money?

How well did you know this?

Not at all

Perfectly

What is the finest grain of access in Lake Formation?

Cell-level, using LF Data Filters

How well did you know this?

Not at all

Perfectly

What is needed to do cross-account permissions in Lake Formation?

Set up the recipient as a data lake administrator
Use AWS RAM for accounts external to your organisation

How well did you know this?

Not at all

Perfectly

Can Athena query unstructured data?

Yes - it can do structured, semi-structured and unstructured data

How well did you know this?

Not at all

Perfectly

Does Athena support all of CSV, TSV, Avro, JSON?

Yes, it also supports Parquet and ORC (which are the obvious ones as they are columnar)

How well did you know this?

Not at all

Perfectly

What is Athena workgroups?

Allows you to organise users, teams, apps and workloads into groups where you can control query access and track costs by group, as well as implement the amount of data that each group can scan and keep query histories

How well did you know this?

Not at all

Perfectly

How do you pay for Athena?

Per TB scanned, for successful and cancelled queries but not failed queries

How well did you know this?

Not at all

Perfectly

What are 2 tips for optimising performance with Athena?

1/ Use columnar formats such as Parquet and ORC
2/ Use a small number of large files instead of a large number of small ones
3/ Use partitions

How well did you know this?

Not at all

Perfectly

What type of tables in Athena are ACID compatible?

Iceberg tables

How well did you know this?

Not at all

Perfectly

What negative effect can ACID support have on your tables?

Can bloat them with lots and lots of data held to ensure consistency for all users - you should periodically compact your data

How well did you know this?

Not at all

Perfectly

What regulatory use case is ACID useful for?

GDPR compliance

How well did you know this?

Not at all

Perfectly

What do ACID compliance and iceberg tables enable?

Rollback, querying of historical data, verification of changes between updates, changing the partitioning of your data through a simple query

Are Glue Data Catalogs compatible with Iceberg by default?

What is Spark?

A distributed processing framework that supports Java, Scala, Python and R

What does CREATE TABLE AS SELECT do?

Creates a new table from the results of a query

What does Kinesis use to integrate with Spark?

Kinesis Client Library

How can you run Jupyter notebooks with Spark?

Through the Athena console

How can you keep costs down when using Spark to query your data in a serverless manner?

Limit the data processing units for the co-ordinator and executor sizes

What are Athena Federated Queries?

Queries that can be used on data outside of S3 using data source connectors

Where are views that are created using Athena federated queries stored?

On Glue, NOT the original state source

What is a 'passthrough' Federated Query in Athena?

A query that allows you to use the native query language of the data source

What are the 3 types of node in EMR?

Master, Core, Task

What is the difference between a task and core node in EMR?

Task nodes cannot store data

What happens to HDFS data when a cluster is terminated?

It is lost

What type of node is always removed first when EMR is scaled down?

Task nodes

Can you add and remove core nodes on the fly in EMR?

Yes

How do you submit queries and scripts for EMR serverless?

Through job run requests

Is EMR serverless multi-region?

Can EMR run on EKS?

Yes

What is the read/write of a shard in KDS?

Write: 1MB/s Read: 2MB/s

When a data record is written to a consumer in KDS, what 3 features does it have?

A partition key, a sequence number and a data blob

What are KDS' 2 capacity modes?

Provisioned mode and on-demand mode

What 2 languages are used by the KCL?

C++ or Java

What API does the KDS SDK use?

PutRecord(s)

When might you use the KDS SDK versus the other options?

If you don't mind higher latency, lower throughput and a simpler API interface

What is 1 positive and 1 negative of the KPL?

It has high performance, automatic and configurable retry mechanism, uses batching Has to be decoded with the KCL

What type of servers is the Kinesis agent compatible with?

Linux

What is the maximum number of GetRecord KDS calls per second?

Does the KDS KCL support checkpointing?

Yes, with DynamoDB

What is the Kinesis Connector Library?

Different to the KCL Used to write data to S3, DDB, Redshift, OpenSearch

What does KDS enhanced fan-out mean and what combination of service and library does it work with?

Lambda and KCL Allows each consumer to get 2MB/s of provisioned throughput per shard

What is one risk when merging or splitting shards?

Your records can be read out of order by accident due to the data being read from the child shards before it has exhausted reading the data from the parent shard KCL has logic to counteract this

Why can network timeouts cause duplicates from producers?

As the ACK from KDS never actually reaches, so the producer sends the data again until it gets an ACK back

Do KDS and KDF both have auto-scaling?

No. KDS cannot auto-scale, KDF can

What is the concept of buffer size in KDF?

Firehose has a buffer size which it accumulates and then batch sends. The buffer is defined by size and time - whichever comes first. To get faster delivery you can reduce the buffer, or increase the buffer to increase throughput.

What is Managed Service for Apache Flink used for?

Streaming ETL, continuous metric generation, responsive analytics

Where is the data in MSK stored?

EBS volumes

What is 1 key difference in message size between KDS and MSK?

MSK can do custom config to make max message size higher than 1MB, KDS is stuck at the maximum of 1MB

Can Kafka ACLs be managed with IAM?

What is MSK Connect?

A framework for taking data from somewhere into Kafka or vice versa Allows you to essentially plug in many many destinations to your MSK cluster

What do you need to define for MSK serverless?

Topics and partitions

What is SPICE?

A way for QS to speed up its querying for queries that would time out if using Athena directly

What level of security can QS do?

Row and column level

Can QS access data in another region by default?

What are embedded dashboards and how does access to them work?

Embedded dashboards are dashboards that you can share through your webapp Access is managed to those who also have QS access which is authenticated through SSO/Active Directory/Cognito

What are the 4 types of ML insights QS offers?

* Anomaly detection * Forecasting * Auto-narratives * Suggested insights

What are QS calculated fields?

New fields that you can create based on others, e.g. profit is the revenue column - the costs column

What are documents in Opensearch?

What you are searching for - any structured JSON

What are types in Opensearch?

The schema and mapping shared by documents that are similar/represent the same thing, e.g. logs are all one 'type'

What are indexes and shards in Opensearch?

An index is split into shards, and each document will be related to a specific shard

How do indexes in Opensearch enable the parallelisation of read/write?

They have 2 primary and 2 replica shards within them

What is Opensearch used for?

* Search * Log analytics (e.g. security and clickstream analytics) * Application monitoring based on incoming log data

What are Opensearch's 3 storage speed options and what do they use?

Hot - EBS volumes Warm/ultrawarm - S3 + caching Cold - S3 only

Can you move between Opensearch's storage speed options automatically?

Yes

Why might you run Index State Management in Opensearch?

To delete old indexes, move indexes into a read-only state, reduce replica count over time

What are index roll-ups?

When you periodically roll up data into a summarised index. Reduces storage space but the new index will have less detailed data

What is the difference between index roll ups and index transforms?

Index roll-ups are more about saving space, index transforms are more about creating a different view with which to analyse the data in a new way

What can happen if there are unbalanced numbers of shards across nodes in Opensearch?

You can have memory issues

What are the lower limits for search and indexing for capacity in Opensearch?

2 Opensearch capacity units for both

What are the 2 types of collections in Opensearch serverless?

Search (optimised for search workloads) and time series (for sequential/time-series data and better for append-only)

Analytics Flashcards

(83 cards)