Lesson 22 - Data & Analytics Flashcards

(10 cards)

1
Q

Amazon Athena: Serverless SQL Query Service for Data in S3

A

Amazon Athena is a serverless query service that allows you to analyse data stored in Amazon S3 using standard SQL.
Athena supports multiple data formats including CSV, JSON, ORC, Avro, and Parquet, with Parquet and ORC recommended for performance improvements.
Performance can be enhanced by using columnar data formats, compressing data, partitioning datasets, and using larger files.
Athena supports Federated Queries via Lambda connectors, enabling querying across various AWS and on-premises data sources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Redshift

A

Amazon Redshift is a PostgreSQL-based OLAP database designed for analytics and data warehousing.
Redshift uses a columnar storage format and a parallel query engine to enhance query performance.
It supports provisioned and serverless cluster modes, with leader and compute nodes architecture.
Redshift offers snapshot-based disaster recovery and integrates with tools like Amazon Kinesis Data Firehose and Redshift Spectrum for data ingestion and querying from S3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Compare Redshift vs Athena

A

Redshift has faster queries/joins, aggregations thanks to indexes and you need to provision the whole cluster but Athena is serverless only

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Amazon OpenSearch Service Overview

A

Amazon OpenSearch Service is the successor to Amazon ElasticSearch, renamed due to licensing issues.
OpenSearch allows searching any fields, including partial matches, complementing databases like DynamoDB.
OpenSearch supports both managed and serverless cluster provisioning options.
Data ingestion into OpenSearch can be done via various AWS services such as DynamoDB Streams, CloudWatch Logs, Kinesis Data Firehose, and Lambda functions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Amazon EMR Overview

A

Amazon EMR stands for Elastic MapReduce and is used to create Hadoop clusters for big data processing on AWS.
EMR simplifies provisioning and configuration of big data tools like Apache Spark, HBase, Presto, and Apache Flink.
EMR clusters consist of Master, Core, and optional Task nodes, each with specific roles and purchasing options.
Purchasing options include On-Demand, Reserved, and Spot Instances, each suited for different node types and workload reliability needs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Amazon QuickSight Overview

A

Amazon QuickSight is a serverless, machine-powered business intelligence service for creating interactive dashboards.
QuickSight integrates with various AWS data sources like RDS, Aurora, Athena, Redshift, S3, and third-party sources.
The SPICE engine enables fast, in-memory computation when data is imported directly into QuickSight.
QuickSight supports user and group management within the service, with dashboards as read-only snapshots of analyses that can be shared securely.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

AWS Glue: Serverless ETL and Data Cataloging Service

A

AWS Glue is a fully managed, serverless extract, transform, and load (ETL) service that prepares and transforms data for analytics.
Glue can convert data formats, such as transforming CSV files into the columnar Parquet format, which optimizes querying with services like Amazon Athena.
Glue Data Catalog crawls various data sources to collect metadata, which is leveraged by Glue jobs and other AWS services like Athena, Redshift Spectrum, and EMR.
Additional Glue features include Job Bookmarks to avoid reprocessing data, Glue DataBrew for data cleaning, Glue Studio for GUI-based ETL job management, and Glue Streaming ETL for real-time data processing using Apache Spark Structured Streaming.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Lake Formation

A

AWS Lake Formation is a fully managed service that simplifies the creation of data lakes, reducing setup time from months to days.
It automates complex manual steps such as data discovery, cleansing, transformation, ingestion, and de-duplication using machine learning transforms.
Lake Formation supports combining structured and unstructured data from various sources, including Amazon S3, RDS, Aurora, on-premises SQL/NoSQL databases, with out-of-the-box blueprints.
It provides centralized, fine-grained access control at the row and column level, enabling consistent security management across multiple analytics services like Athena, Redshift, EMR, and QuickSight.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Amazon Managed Service for Apache Flink

A

Amazon Managed Service for Apache Flink is a managed service for running Apache Flink applications on AWS.
Flink is a framework primarily using Java, SQL, or Scala for real-time data stream processing.
The service provisions compute resources, supports parallel computation, automatic scaling, and manages application backups via checkpoints and snapshots.
Flink can read data from Kinesis Data Streams and Amazon MSK (Apache Kafka), but not from Amazon Data Firehose.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

MSK - Managed Streaming for Apache Kafka

A

Amazon MSK provides a fully managed Apache Kafka cluster on AWS, simplifying deployment and management.
MSK Serverless allows running Apache Kafka without provisioning or managing servers, with automatic scaling of compute and storage.
Kafka and Kinesis are both streaming data solutions, but differ in scaling, message limits, and partition management.
Data can be produced to MSK using Kafka Producers and consumed using various AWS services or custom consumers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly