Lesson 22 - Data & Analytics Flashcards

Question 1

Q

Amazon Athena: Serverless SQL Query Service for Data in S3

Answer

A

Amazon Athena is a serverless query service that allows you to analyse data stored in Amazon S3 using standard SQL.
Athena supports multiple data formats including CSV, JSON, ORC, Avro, and Parquet, with Parquet and ORC recommended for performance improvements.
Performance can be enhanced by using columnar data formats, compressing data, partitioning datasets, and using larger files.
Athena supports Federated Queries via Lambda connectors, enabling querying across various AWS and on-premises data sources.

Question 2

Q

Redshift

Answer

A

Amazon Redshift is a PostgreSQL-based OLAP database designed for analytics and data warehousing.
Redshift uses a columnar storage format and a parallel query engine to enhance query performance.
It supports provisioned and serverless cluster modes, with leader and compute nodes architecture.
Redshift offers snapshot-based disaster recovery and integrates with tools like Amazon Kinesis Data Firehose and Redshift Spectrum for data ingestion and querying from S3.

Question 3

Q

Compare Redshift vs Athena

Answer

A

Redshift has faster queries/joins, aggregations thanks to indexes and you need to provision the whole cluster but Athena is serverless only

Question 4

Q

Amazon OpenSearch Service Overview

Answer

A

Amazon OpenSearch Service is the successor to Amazon ElasticSearch, renamed due to licensing issues.
OpenSearch allows searching any fields, including partial matches, complementing databases like DynamoDB.
OpenSearch supports both managed and serverless cluster provisioning options.
Data ingestion into OpenSearch can be done via various AWS services such as DynamoDB Streams, CloudWatch Logs, Kinesis Data Firehose, and Lambda functions.

Question 5

Q

Amazon EMR Overview

Answer

A

Amazon EMR stands for Elastic MapReduce and is used to create Hadoop clusters for big data processing on AWS.
EMR simplifies provisioning and configuration of big data tools like Apache Spark, HBase, Presto, and Apache Flink.
EMR clusters consist of Master, Core, and optional Task nodes, each with specific roles and purchasing options.
Purchasing options include On-Demand, Reserved, and Spot Instances, each suited for different node types and workload reliability needs.

Question 6

Q

Amazon QuickSight Overview

Answer

A

Amazon QuickSight is a serverless, machine-powered business intelligence service for creating interactive dashboards.
QuickSight integrates with various AWS data sources like RDS, Aurora, Athena, Redshift, S3, and third-party sources.
The SPICE engine enables fast, in-memory computation when data is imported directly into QuickSight.
QuickSight supports user and group management within the service, with dashboards as read-only snapshots of analyses that can be shared securely.

Question 7

Q

AWS Glue: Serverless ETL and Data Cataloging Service

Answer

A

AWS Glue is a fully managed, serverless extract, transform, and load (ETL) service that prepares and transforms data for analytics.
Glue can convert data formats, such as transforming CSV files into the columnar Parquet format, which optimizes querying with services like Amazon Athena.
Glue Data Catalog crawls various data sources to collect metadata, which is leveraged by Glue jobs and other AWS services like Athena, Redshift Spectrum, and EMR.
Additional Glue features include Job Bookmarks to avoid reprocessing data, Glue DataBrew for data cleaning, Glue Studio for GUI-based ETL job management, and Glue Streaming ETL for real-time data processing using Apache Spark Structured Streaming.

Question 8

Q

Lake Formation

Answer

A

AWS Lake Formation is a fully managed service that simplifies the creation of data lakes, reducing setup time from months to days.
It automates complex manual steps such as data discovery, cleansing, transformation, ingestion, and de-duplication using machine learning transforms.
Lake Formation supports combining structured and unstructured data from various sources, including Amazon S3, RDS, Aurora, on-premises SQL/NoSQL databases, with out-of-the-box blueprints.
It provides centralized, fine-grained access control at the row and column level, enabling consistent security management across multiple analytics services like Athena, Redshift, EMR, and QuickSight.

Question 9

Q

Amazon Managed Service for Apache Flink

Answer

A

Amazon Managed Service for Apache Flink is a managed service for running Apache Flink applications on AWS.
Flink is a framework primarily using Java, SQL, or Scala for real-time data stream processing.
The service provisions compute resources, supports parallel computation, automatic scaling, and manages application backups via checkpoints and snapshots.
Flink can read data from Kinesis Data Streams and Amazon MSK (Apache Kafka), but not from Amazon Data Firehose.

Question 10

Q

MSK - Managed Streaming for Apache Kafka

Answer

A

Amazon MSK provides a fully managed Apache Kafka cluster on AWS, simplifying deployment and management.
MSK Serverless allows running Apache Kafka without provisioning or managing servers, with automatic scaling of compute and storage.
Kafka and Kinesis are both streaming data solutions, but differ in scaling, message limits, and partition management.
Data can be produced to MSK using Kafka Producers and consumed using various AWS services or custom consumers.

Lesson 22 - Data & Analytics Flashcards

(10 cards)