Data & Analytics Flashcards

(28 cards)

1
Q

What is Amazon Athena and what is it used for?

A

Serverless, interactive query service that analyzes data directly in Amazon S3.

Uses standard SQL via the Trino/Presto engine.

No infrastructure to manage; you pay per query (based on data scanned).

Works well for:

Ad-hoc queries

Log analysis

Data lake analytics

Supports partitioning, compression, and columnar formats (Parquet, ORC) to reduce query cost.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How can you improve performance and reduce cost in Amazon Athena?

A

Use columnar formats like Parquet or ORC to reduce data scanned.

Compress data (e.g., Snappy, Gzip) to speed up queries.

Partition your data (e.g., by date, region) so Athena scans only relevant subsets.

Use bucketing to speed up joins and filtering on high-cardinality columns.

Organize data with consistent file sizes (typically 128–1,024 MB).

Use the AWS Glue Data Catalog for schema management.

Avoid too many small files; compact them for efficiency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are Athena Federated Queries and what do they enable?

A

Allow Athena to query data outside S3 using SQL.

Can read from RDS, Aurora, Redshift, DynamoDB, and many third-party sources.

Uses Athena Data Source Connectors (Lambda-based).

Enables joining S3 data with external databases in a single query.

Still serverless—you pay only for the amount of data scanned.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Amazon Redshift and what is it designed for?

A

A fully managed, petabyte-scale data warehouse service.

Optimized for complex analytical queries using SQL.

Uses columnar storage, data compression, and massively parallel processing (MPP) for high performance.

Supports ingestion from S3, Kinesis, DynamoDB, RDS, and more.

Integrates with Redshift Spectrum to query S3 data lake directly.

Ideal for BI dashboards, analytics workloads, and large-scale reporting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a Redshift Cluster and how is it structured?

A

A Redshift cluster consists of one leader node and one or more compute nodes.

Leader node:

Manages query planning and coordination.

Aggregates results and returns them to the client.

Compute nodes:

Execute queries in parallel.

Store data using columnar storage.

Nodes are organized into slices, allowing massively parallel processing (MPP).

Cluster size and node type determine performance and storage capacity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How does Amazon Redshift handle snapshots and disaster recovery (DR)?

A

Supports automated snapshots and manual snapshots.

Snapshots are stored in S3 and are incremental (only changes are saved).

Automated snapshots are retained based on the retention period you configure.

You can restore a cluster from any snapshot to create a new cluster.

Redshift supports cross-Region snapshot copy for disaster recovery.

Enables quick recovery from data corruption, accidental deletion, or cluster failure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the best practice for loading data into Redshift and why?

A

Use large, bulk inserts instead of many small inserts.

Redshift is optimized for batch loading using COPY from S3, not row-by-row writes.

Large inserts:

Improve throughput

Reduce transaction overhead

Allow Redshift to better compress, sort, and distribute data

Small inserts lead to:

Slower performance

More commits

Less efficient storage and query performance

Rule of thumb:
Load data in large batches → Redshift performs dramatically better.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Redshift Spectrum and what does it enable?

A

Allows Redshift to query data directly in Amazon S3 without loading it into the cluster.

Ideal for extending analytics from your warehouse into your data lake.

Supports open file formats like Parquet, ORC, JSON, and CSV.

Provides massively parallel processing by using Redshift’s compute nodes plus Spectrum workers.

Reduces storage costs by keeping infrequently accessed data in S3.

Use cases:

Query historical or cold data

Combine warehouse and data lake queries

Analyze huge datasets without resizing the cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Amazon OpenSearch Service and what is it used for?

A

Managed service for running OpenSearch and Elasticsearch clusters.

Used for search, log analytics, real-time monitoring, and observability.

Provides distributed indexing and search across large datasets.

Integrates with Kinesis, CloudWatch, S3, and many AWS ingestion pipelines.

Offers built-in dashboards, visualizations, and full-text search capabilities.

Handles scaling, backups, patching, and cluster maintenance automatically.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How is DynamoDB commonly integrated with OpenSearch, and why?

A

Use DynamoDB Streams to capture item changes (INSERT, MODIFY, REMOVE).

A Lambda function processes each stream record and indexes the data into OpenSearch.

Enables full-text search, advanced filtering, and analytics not supported natively by DynamoDB.

Pattern provides a real-time, eventually consistent search layer on top of a NoSQL database.

Ideal for:

Product catalogs

Search-driven applications

Log/event indexing

Enriching DynamoDB queries with search features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is Amazon EMR and what is it used for?

A

Managed big-data platform for running Apache Spark, Hadoop, Hive, HBase, Flink, Presto, and more.

Used for large-scale data processing, ETL, machine learning, and analytics.

Can process data stored in S3, HDFS, Glue Data Catalog, and other AWS sources.

Offers auto-scaling, spot instance integration, and flexible cluster sizing.

Supports both long-running clusters and serverless EMR on EKS / EMR Serverless.

Designed for high performance at significantly lower cost than on-prem Hadoop clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What node types and purchasing options does Amazon EMR support?

A

EMR Node Types

Master Node – Coordinates the cluster, manages job scheduling.

Core Nodes – Run tasks and store data (HDFS).

Task Nodes – Run tasks only (optional, no HDFS storage).

Purchasing Options

On-Demand Instances – Flexible, no commitment.

Spot Instances – Up to 90% cheaper; best for task nodes tolerant of interruption.

Reserved Instances / Savings Plans – Lower cost for steady workloads.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Amazon QuickSight and what is it used for?

A

Serverless, cloud-native business intelligence (BI) service.

Creates interactive dashboards, visualizations, and reports.

Scales automatically to thousands of users.

Uses SPICE (in-memory engine) for fast performance and parallel queries.

Integrates with S3, Athena, Redshift, RDS, Salesforce, and many other data sources.

Supports ML-powered insights like anomaly detection and forecasting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What data sources can Amazon QuickSight integrate with?

A

AWS sources: S3, Athena, Redshift, RDS, Aurora, EMR, OpenSearch

External databases: MySQL, PostgreSQL, SQL Server, Snowflake, and more

SaaS apps: Salesforce, ServiceNow, Jira, Adobe Analytics

Supports both direct queries and SPICE in-memory acceleration

Enables unified dashboards across multiple AWS and third-party systems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

In Amazon QuickSight, what’s the difference between an Analysis and a Dashboard?

A

Analysis

Interactive workspace where you build, explore, and edit visuals.

Used by authors to prepare data, create charts, and design layouts.

Dashboard

Published, read-only version of an Analysis.

Shared with viewers for consumption and interaction (filters, drill-downs) but no editing.

Key idea:
Analyses are for building; dashboards are for sharing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is AWS Glue and what does it do?

A

Serverless data integration and ETL service.

Automatically discovers, catalogs, and prepares data using the Glue Data Catalog.

Can run ETL jobs in Python or Scala to clean and transform data at scale.

Includes crawlers to infer schemas from S3, JDBC sources, and more.

Integrates with Athena, Redshift, EMR, and data lakes.

Supports visual tools like Glue Studio and DataBrew.

17
Q

How can AWS Glue help convert data into Parquet format and why is this useful?

A

Glue can run ETL jobs that read raw data (CSV, JSON, logs, etc.) and convert it to Parquet.

Parquet is a columnar, compressed format that:

Reduces storage cost

Speeds up analytics by reducing data scanned

Works efficiently with Athena, Redshift Spectrum, EMR, and Spark

Glue crawlers can update the schema in the Glue Data Catalog after conversion.

Ideal for building optimized data lake storage in S3.

18
Q

What is the AWS Glue Data Catalog and what is it used for?

A

A central metadata repository for storing table schemas and dataset definitions.

Tracks databases, tables, partitions, and data locations in S3 or other sources.

Used by Athena, Redshift Spectrum, EMR, and Glue ETL jobs for consistent schema management.

Glue crawlers can automatically discover and update metadata.

Acts as the data lake catalog, enabling query engines to understand your data.

19
Q

What key things should you know about AWS Glue at a high level?

A

Glue is a serverless ETL and data integration service.

The Glue Data Catalog stores metadata for Athena, Redshift Spectrum, and EMR.

Crawlers automatically detect schema and create/update catalog tables.

Glue jobs (Python/Scala) run on Apache Spark for distributed processing.

Supports job scheduling, workflows, and dependency management.

Integrates tightly with S3-based data lakes and other AWS analytics services.

20
Q

What is AWS Lake Formation and what does it do?

A

A service that simplifies building a secure, well-governed data lake on AWS.

Helps ingest, catalog, clean, and organize data in S3.

Provides fine-grained access control down to database, table, and column levels.

Integrates with Athena, Redshift Spectrum, EMR, and Glue for unified permissions.

Automates common tasks:

Setting up storage locations

Managing metadata

Enforcing security and governance

Ensures consistent, centralized permissions management across analytics services.

21
Q

What is Amazon Managed Service for Apache Flink and what is it used for?

A

Fully managed service for running Apache Flink applications on AWS.

Processes streaming data in real time with low latency.

Integrates with Kinesis Data Streams, MSK (Kafka), Kinesis Data Firehose, and S3.

Automatically handles scaling, failover, monitoring, and checkpointing.

Ideal for:

Real-time analytics

Streaming ETL

Event-driven applications

Continuous data processing pipelines

22
Q

What is Amazon MSK and what is it used for?

A

Fully managed service for running Apache Kafka clusters on AWS.

Handles provisioning, patching, scaling, monitoring, and recovery.

Provides high availability across multiple AZs.

Integrates with producers/consumers using native Kafka APIs (no code changes).

Used for:

Real-time data streaming

Event pipelines

Log ingestion

Stream processing with Flink, Spark, Lambda

23
Q

What is Apache Kafka at a high level?

A

A distributed streaming platform for ingesting and processing real-time data.

Organizes data into topics, which are split into partitions for scalability.

Producers write messages, consumers read messages independently.

Provides high throughput, low latency, and fault tolerance.

Stores data durably, allowing consumers to read at their own pace.

Ideal for event streaming, log aggregation, real-time analytics, and data pipelines.

24
Q

How do Kinesis Data Streams and Amazon MSK differ?

A

Kinesis Data Streams

Fully managed, AWS-native streaming service

No servers or clusters to manage

Scales automatically with shards

Producers/consumers use Kinesis APIs

Best for simple, fully-managed streaming workloads

Amazon MSK

Fully managed Apache Kafka

Uses native Kafka APIs (no code changes for Kafka apps)

You manage some cluster configuration choices

Best for teams already using Kafka or needing Kafka’s ecosystem

Key difference:
Kinesis is AWS-native and simpler; MSK is managed Kafka with full Kafka compatibility.

25
How do consumers work in Amazon MSK?
Consumers use native Kafka consumer APIs—no code changes required. Multiple consumers can form a consumer group to share partitions. Each partition is consumed by only one consumer within a group. MSK stores data durably so consumers can re-read data at their own pace. Supports real-time processing with Flink, Spark, Lambda, and other stream processors.
26
What are the main components of a Big Data ingestion pipeline on AWS?
A typical ingestion pipeline includes: Producers – Apps, services, IoT devices generating data Streaming services – Kinesis, MSK (Kafka) for real-time ingestion Batch ingestion – S3 uploads, Glue crawlers, scheduled ETL Processing layer – Lambda, Flink, EMR, Glue ETL Storage layer – S3 data lake, DynamoDB, Redshift Analytics layer – Athena, Redshift, EMR, OpenSearch Visualization – QuickSight dashboards Purpose: reliably ingest, process, store, and analyze large-scale data.
27
What key considerations come up when designing a Big Data ingestion pipeline?
Data Volume & Velocity – Choose streaming (Kinesis/MSK) vs. batch (S3 uploads) based on throughput. Schema Management – Use Glue Data Catalog to keep schemas consistent across tools. Data Quality & ETL – Apply cleaning, validation, and transformation with Glue, Lambda, or EMR. Storage Strategy – Use S3 as the data lake; choose Parquet/ORC for efficient analytics. Processing Needs – Real-time (Flink, Lambda) vs. batch (EMR, Glue). Access Patterns – Query with Athena, Redshift Spectrum, or OpenSearch depending on use case. Security & Governance – Apply Lake Formation, IAM, and encryption at rest/in transit. Scalability & Cost – Use autoscaling services and optimize formats to reduce scan costs.
28