What is Amazon Athena and what is it used for?
Serverless, interactive query service that analyzes data directly in Amazon S3.
Uses standard SQL via the Trino/Presto engine.
No infrastructure to manage; you pay per query (based on data scanned).
Works well for:
Ad-hoc queries
Log analysis
Data lake analytics
Supports partitioning, compression, and columnar formats (Parquet, ORC) to reduce query cost.
How can you improve performance and reduce cost in Amazon Athena?
Use columnar formats like Parquet or ORC to reduce data scanned.
Compress data (e.g., Snappy, Gzip) to speed up queries.
Partition your data (e.g., by date, region) so Athena scans only relevant subsets.
Use bucketing to speed up joins and filtering on high-cardinality columns.
Organize data with consistent file sizes (typically 128–1,024 MB).
Use the AWS Glue Data Catalog for schema management.
Avoid too many small files; compact them for efficiency.
What are Athena Federated Queries and what do they enable?
Allow Athena to query data outside S3 using SQL.
Can read from RDS, Aurora, Redshift, DynamoDB, and many third-party sources.
Uses Athena Data Source Connectors (Lambda-based).
Enables joining S3 data with external databases in a single query.
Still serverless—you pay only for the amount of data scanned.
What is Amazon Redshift and what is it designed for?
A fully managed, petabyte-scale data warehouse service.
Optimized for complex analytical queries using SQL.
Uses columnar storage, data compression, and massively parallel processing (MPP) for high performance.
Supports ingestion from S3, Kinesis, DynamoDB, RDS, and more.
Integrates with Redshift Spectrum to query S3 data lake directly.
Ideal for BI dashboards, analytics workloads, and large-scale reporting.
What is a Redshift Cluster and how is it structured?
A Redshift cluster consists of one leader node and one or more compute nodes.
Leader node:
Manages query planning and coordination.
Aggregates results and returns them to the client.
Compute nodes:
Execute queries in parallel.
Store data using columnar storage.
Nodes are organized into slices, allowing massively parallel processing (MPP).
Cluster size and node type determine performance and storage capacity.
How does Amazon Redshift handle snapshots and disaster recovery (DR)?
Supports automated snapshots and manual snapshots.
Snapshots are stored in S3 and are incremental (only changes are saved).
Automated snapshots are retained based on the retention period you configure.
You can restore a cluster from any snapshot to create a new cluster.
Redshift supports cross-Region snapshot copy for disaster recovery.
Enables quick recovery from data corruption, accidental deletion, or cluster failure.
What is the best practice for loading data into Redshift and why?
Use large, bulk inserts instead of many small inserts.
Redshift is optimized for batch loading using COPY from S3, not row-by-row writes.
Large inserts:
Improve throughput
Reduce transaction overhead
Allow Redshift to better compress, sort, and distribute data
Small inserts lead to:
Slower performance
More commits
Less efficient storage and query performance
Rule of thumb:
Load data in large batches → Redshift performs dramatically better.
What is Redshift Spectrum and what does it enable?
Allows Redshift to query data directly in Amazon S3 without loading it into the cluster.
Ideal for extending analytics from your warehouse into your data lake.
Supports open file formats like Parquet, ORC, JSON, and CSV.
Provides massively parallel processing by using Redshift’s compute nodes plus Spectrum workers.
Reduces storage costs by keeping infrequently accessed data in S3.
Use cases:
Query historical or cold data
Combine warehouse and data lake queries
Analyze huge datasets without resizing the cluster
What is Amazon OpenSearch Service and what is it used for?
Managed service for running OpenSearch and Elasticsearch clusters.
Used for search, log analytics, real-time monitoring, and observability.
Provides distributed indexing and search across large datasets.
Integrates with Kinesis, CloudWatch, S3, and many AWS ingestion pipelines.
Offers built-in dashboards, visualizations, and full-text search capabilities.
Handles scaling, backups, patching, and cluster maintenance automatically.
How is DynamoDB commonly integrated with OpenSearch, and why?
Use DynamoDB Streams to capture item changes (INSERT, MODIFY, REMOVE).
A Lambda function processes each stream record and indexes the data into OpenSearch.
Enables full-text search, advanced filtering, and analytics not supported natively by DynamoDB.
Pattern provides a real-time, eventually consistent search layer on top of a NoSQL database.
Ideal for:
Product catalogs
Search-driven applications
Log/event indexing
Enriching DynamoDB queries with search features
What is Amazon EMR and what is it used for?
Managed big-data platform for running Apache Spark, Hadoop, Hive, HBase, Flink, Presto, and more.
Used for large-scale data processing, ETL, machine learning, and analytics.
Can process data stored in S3, HDFS, Glue Data Catalog, and other AWS sources.
Offers auto-scaling, spot instance integration, and flexible cluster sizing.
Supports both long-running clusters and serverless EMR on EKS / EMR Serverless.
Designed for high performance at significantly lower cost than on-prem Hadoop clusters.
What node types and purchasing options does Amazon EMR support?
EMR Node Types
Master Node – Coordinates the cluster, manages job scheduling.
Core Nodes – Run tasks and store data (HDFS).
Task Nodes – Run tasks only (optional, no HDFS storage).
Purchasing Options
On-Demand Instances – Flexible, no commitment.
Spot Instances – Up to 90% cheaper; best for task nodes tolerant of interruption.
Reserved Instances / Savings Plans – Lower cost for steady workloads.
What is Amazon QuickSight and what is it used for?
Serverless, cloud-native business intelligence (BI) service.
Creates interactive dashboards, visualizations, and reports.
Scales automatically to thousands of users.
Uses SPICE (in-memory engine) for fast performance and parallel queries.
Integrates with S3, Athena, Redshift, RDS, Salesforce, and many other data sources.
Supports ML-powered insights like anomaly detection and forecasting.
What data sources can Amazon QuickSight integrate with?
AWS sources: S3, Athena, Redshift, RDS, Aurora, EMR, OpenSearch
External databases: MySQL, PostgreSQL, SQL Server, Snowflake, and more
SaaS apps: Salesforce, ServiceNow, Jira, Adobe Analytics
Supports both direct queries and SPICE in-memory acceleration
Enables unified dashboards across multiple AWS and third-party systems
In Amazon QuickSight, what’s the difference between an Analysis and a Dashboard?
Analysis
Interactive workspace where you build, explore, and edit visuals.
Used by authors to prepare data, create charts, and design layouts.
Dashboard
Published, read-only version of an Analysis.
Shared with viewers for consumption and interaction (filters, drill-downs) but no editing.
Key idea:
Analyses are for building; dashboards are for sharing.
What is AWS Glue and what does it do?
Serverless data integration and ETL service.
Automatically discovers, catalogs, and prepares data using the Glue Data Catalog.
Can run ETL jobs in Python or Scala to clean and transform data at scale.
Includes crawlers to infer schemas from S3, JDBC sources, and more.
Integrates with Athena, Redshift, EMR, and data lakes.
Supports visual tools like Glue Studio and DataBrew.
How can AWS Glue help convert data into Parquet format and why is this useful?
Glue can run ETL jobs that read raw data (CSV, JSON, logs, etc.) and convert it to Parquet.
Parquet is a columnar, compressed format that:
Reduces storage cost
Speeds up analytics by reducing data scanned
Works efficiently with Athena, Redshift Spectrum, EMR, and Spark
Glue crawlers can update the schema in the Glue Data Catalog after conversion.
Ideal for building optimized data lake storage in S3.
What is the AWS Glue Data Catalog and what is it used for?
A central metadata repository for storing table schemas and dataset definitions.
Tracks databases, tables, partitions, and data locations in S3 or other sources.
Used by Athena, Redshift Spectrum, EMR, and Glue ETL jobs for consistent schema management.
Glue crawlers can automatically discover and update metadata.
Acts as the data lake catalog, enabling query engines to understand your data.
What key things should you know about AWS Glue at a high level?
Glue is a serverless ETL and data integration service.
The Glue Data Catalog stores metadata for Athena, Redshift Spectrum, and EMR.
Crawlers automatically detect schema and create/update catalog tables.
Glue jobs (Python/Scala) run on Apache Spark for distributed processing.
Supports job scheduling, workflows, and dependency management.
Integrates tightly with S3-based data lakes and other AWS analytics services.
What is AWS Lake Formation and what does it do?
A service that simplifies building a secure, well-governed data lake on AWS.
Helps ingest, catalog, clean, and organize data in S3.
Provides fine-grained access control down to database, table, and column levels.
Integrates with Athena, Redshift Spectrum, EMR, and Glue for unified permissions.
Automates common tasks:
Setting up storage locations
Managing metadata
Enforcing security and governance
Ensures consistent, centralized permissions management across analytics services.
What is Amazon Managed Service for Apache Flink and what is it used for?
Fully managed service for running Apache Flink applications on AWS.
Processes streaming data in real time with low latency.
Integrates with Kinesis Data Streams, MSK (Kafka), Kinesis Data Firehose, and S3.
Automatically handles scaling, failover, monitoring, and checkpointing.
Ideal for:
Real-time analytics
Streaming ETL
Event-driven applications
Continuous data processing pipelines
What is Amazon MSK and what is it used for?
Fully managed service for running Apache Kafka clusters on AWS.
Handles provisioning, patching, scaling, monitoring, and recovery.
Provides high availability across multiple AZs.
Integrates with producers/consumers using native Kafka APIs (no code changes).
Used for:
Real-time data streaming
Event pipelines
Log ingestion
Stream processing with Flink, Spark, Lambda
What is Apache Kafka at a high level?
A distributed streaming platform for ingesting and processing real-time data.
Organizes data into topics, which are split into partitions for scalability.
Producers write messages, consumers read messages independently.
Provides high throughput, low latency, and fault tolerance.
Stores data durably, allowing consumers to read at their own pace.
Ideal for event streaming, log aggregation, real-time analytics, and data pipelines.
How do Kinesis Data Streams and Amazon MSK differ?
Kinesis Data Streams
Fully managed, AWS-native streaming service
No servers or clusters to manage
Scales automatically with shards
Producers/consumers use Kinesis APIs
Best for simple, fully-managed streaming workloads
Amazon MSK
Fully managed Apache Kafka
Uses native Kafka APIs (no code changes for Kafka apps)
You manage some cluster configuration choices
Best for teams already using Kafka or needing Kafka’s ecosystem
Key difference:
Kinesis is AWS-native and simpler; MSK is managed Kafka with full Kafka compatibility.