Analytics Flashcards

Leverage AWS analytics tools to interpret data and support business intelligence and decision-making. (10 cards)

3
Q

A company provides a REST-based interface to an application that allows a partner company to send data in near-real time. The application then processes the data that is received and stores it for later analysis. The application runs on Amazon EC2 instances.

The partner company has received many 503 Service Unavailable Errors when sending data to the application and the compute capacity reaches its limits and is unable to process requests when spikes in data volume occur.

Which design should a Solutions Architect implement to improve scalability?

  1. Use Amazon API Gateway in front of the existing application. Create a usage plan with a quota limit for the partner company.
  2. Use Amazon Kinesis Data Streams to ingest the data. Process the data using AWS Lambda functions.
  3. Use Amazon SQS to ingest the data. Configure the EC2 instances to process messages from the SQS queue.
  4. Use Amazon SNS to ingest the data and trigger AWS Lambda functions to process the data in near-real time.
A

2. Use Amazon Kinesis Data Streams to ingest the data. Process the data using AWS Lambda functions.

Amazon Kinesis enables you to ingest, buffer, and process streaming data in real-time. Kinesis can handle any amount of streaming data and process data from hundreds of thousands of sources with very low latencies. This is an ideal solution for data ingestion.

To ensure the compute layer can scale to process increasing workloads, the EC2 instances should be replaced by AWS Lambda functions. Lambda can scale seamlessly by running multiple executions in parallel.

  • A usage plan will limit the amount of data that is received and cause more errors to be received by the partner company.
  • Amazon Kinesis Data Streams should be used for near-real time or real-time use cases instead of Amazon SQS.
  • SNS is not a near-real time solution for data ingestion. SNS is used for sending notifications.

References:

Save time with our AWS cheat sheets:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

A genetics research firm processes DNA sequencing data for multiple clients. The raw data is stored in relational databases provided by each client. The company must extract the data, apply unique transformation algorithms for each client, and store the processed results in Amazon S3.

Due to the sensitivity of the data, the company must encrypt it both during processing and at rest in Amazon S3. Each client must have their own encryption keys to meet compliance requirements. The company also wants to minimize operational overhead while implementing this solution.

Which solution will meet these requirements with the LEAST operational effort?

  1. Use AWS Glue to create a single ETL pipeline for all clients. Configure the pipeline to tag each client’s data and use server-side encryption with AWS KMS keys (SSE-KMS) to encrypt data based on client-specific keys before storing it in Amazon S3.
  2. Deploy an Amazon EMR cluster for each client with a client-specific Hadoop configuration. Use client-side encryption (CSE) to encrypt data with customer-managed root keys during transformations and upload the results to S3.
  3. Use AWS Glue to create individual ETL jobs for each client. Attach a security configuration that uses client-specific AWS KMS keys for server-side encryption (SSE-KMS) during processing and storage in S3.
  4. Deploy a centralized Amazon EMR cluster to process data for all clients. Encrypt the data in transit using TLS certificates for each client and store the data in Amazon S3 using server-side encryption with Amazon S3 managed keys (SSE-S3).
A

3. Use AWS Glue to create individual ETL jobs for each client. Attach a security configuration that uses client-specific AWS KMS keys for server-side encryption (SSE-KMS) during processing and storage in S3.

AWS Glue simplifies ETL workflows and supports attaching security configurations to enforce client-specific encryption with KMS keys. This approach minimizes operational effort by automating the process while meeting encryption requirements.

  • A single ETL pipeline complicates the tagging and encryption logic, increasing the risk of errors. Separate ETL jobs for each client provide a cleaner and more scalable solution.
  • Deploying separate EMR clusters for each client significantly increases operational overhead and costs. AWS Glue is a more efficient solution for this workload.
  • TLS encrypts data only in transit, not during processing. Additionally, SSE-S3 does not meet the compliance requirement of using client-specific encryption keys.

References:

Save time with our AWS cheat sheets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

A research organization wants to set up an Amazon EMR cluster for multiple departments to run their big data analytics jobs. The organization needs to ensure that each department’s workloads can access only the specific AWS services required for their analysis. Additionally, the organization wants to block access to Instance Metadata Service Version 2 (IMDSv2) on the EMR cluster’s underlying EC2 instances.

Which solution will meet these requirements?

  1. Configure VPC interface endpoints for each AWS service that the departments require. Route traffic from the big data workloads through these VPC endpoints.
  2. Use EMR runtime roles to enforce granular permissions for each department’s workloads. Configure the EMR cluster to use these roles when submitting jobs.
  3. Assign unique EC2 IAM instance profiles to each team’s workloads. Configure the instance profiles with the specific permissions needed for each department.
  4. Create an EMR security configuration that disables access to the Instance Metadata Service. Use this security configuration with application-specific IAM roles to submit the workloads.
A

2. Use EMR runtime roles to enforce granular permissions for each department’s workloads. Configure the EMR cluster to use these roles when submitting jobs.

EMR runtime roles allow fine-grained access control for individual workloads without exposing permissions at the instance level. Runtime roles are scoped specifically to applications, reducing the risk of unnecessary access.

  • While VPC interface endpoints can restrict network access, they do not enforce permissions at the application level or prevent access to IMDSv2.
  • Instance profiles provide permissions at the instance level rather than the workload level. This does not ensure workload isolation or restrict access to IMDSv2.
  • EMR security configurations do not directly provide granular access control for individual workloads. While security configurations can restrict metadata access, they cannot enforce permissions per workload.

References:

Save time with our AWS cheat sheets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

A media company operates an on-premises analytics platform to collect streaming data from video playback devices. The platform provides near real-time insights into user engagement and content performance. The company wants to migrate the platform to AWS and use an AWS-native solution for data ingestion, storage, search, and visualization.

Which solution will meet these requirements?

  1. Use Amazon EC2 instances to ingest and process the data streams into Amazon S3 buckets for storage. Use AWS Glue to catalog the data and Amazon Athena to perform searches. Use Amazon QuickSight to create visualizations.
  2. Use Amazon Kinesis Data Streams to ingest the data streams and process the data with AWS Lambda. Store the data in Amazon OpenSearch Service for search and analysis. Use Amazon Managed Grafana to create visual dashboards.
  3. Use Amazon EMR to process the data streams and store the data in Amazon DynamoDB. Use DynamoDB queries for searching and Amazon CloudWatch to create graphical dashboards.
  4. Use Amazon MSK (Managed Streaming for Apache Kafka) to ingest the data streams. Store the data in Amazon Redshift for analysis. Use Redshift Spectrum for advanced querying and Amazon QuickSight to create visual dashboards.
A

2. Use Amazon Kinesis Data Streams to ingest the data streams and process the data with AWS Lambda. Store the data in Amazon OpenSearch Service for search and analysis. Use Amazon Managed Grafana to create visual dashboards.

Kinesis Data Streams is designed for real-time data ingestion. OpenSearch Service supports full-text search and analytics, while Managed Grafana provides dynamic dashboards for visualization.

  • EC2-based ingestion increases operational overhead and does not leverage AWS-native streaming capabilities like Kinesis. Athena is not ideal for near real-time analytics.
  • EMR is primarily used for large-scale data processing, not real-time streaming. Additionally, DynamoDB is not a search engine and is not well-suited for the type of search and analysis required in this use case.
  • While MSK is a suitable option for data ingestion, Redshift is designed for batch analytics, not real-time streaming analytics.

References:

Save time with our AWS cheat sheets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

A traffic law enforcement company is building a solution that has thousands of edge devices that collectively generate 1 TB of status alerts each day. These devices provide vehicle information and number plate data whenever alerts detecting red light jumps are detected. Each entry is around 2Kb in size. A solutions architect needs to implement a solution to ingest and store the alerts for future analysis.

The company wants a highly available solution. However, the company needs to minimize costs and does not want to manage additional infrastructure. Additionally, the company wants to keep 14 days of data available for immediate analysis and archive any data older than 14 days.

What is the MOST operationally efficient solution that meets these requirements?

  1. Create an Amazon Kinesis Data Firehose delivery stream to ingest the alerts. Configure the Kinesis Data Firehose stream to deliver the alerts to an Amazon S3 bucket. Set up an S3 Lifecycle configuration to transition data to Amazon S3 Glacier after 14 days.
  2. Launch Amazon EC2 instances across two Availability Zones and place them behind an Elastic Load Balancer to ingest the alerts. Create a script on the EC2 instances that will store the alerts in an Amazon S3 bucket. Set up an S3 Lifecycle configuration to transition data to Amazon S3 Glacier after 14 days.
  3. Create an Amazon Kinesis Data Firehose delivery stream to ingest the alerts. Configure the kinesis Data Firehose stream to deliver the alerts to an Amazon OpenSearch Service (Amazon Elasticsearch Service) cluster. Set up the Amazon Open Search Service (Amazon Elasticsearch Service) cluster to take manual snapshots every day and delete data from the cluster that is older than 14 days.
  4. Create an Amazon Simple Queue Service (Amazon SQS) standard queue to ingest the alerts and set the message retention period to 14 days. Configure consumers to poll the SQS queue, check the age of the message, and analyze the message data as needed. If the message is 14 days old, the consumer should copy the message to an Amazon S3 bucket and delete the message from the SQS queue.
A

1. Create an Amazon Kinesis Data Firehose delivery stream to ingest the alerts. Configure the Kinesis Data Firehose stream to deliver the alerts to an Amazon S3 bucket. Set up an S3 Lifecycle configuration to transition data to Amazon S3 Glacier after 14 days.

Data ingestion is a good use case for since it is scalable and can achieve the volumes required. Also, an S3 lifecycle configuration is appropriate for the requirement for data retention.

  • Provisioning additional EC2 instances means provisioning infrastructure, and the question states that the company wants to avoid this.
  • This option would mean provisioning ECS clusters and since the question is asking for archival of data, S3 is a better fit (data deletion is not desired).
  • With an SQS queue you must have processing components adding and retrieving messages from the queue and this means additional infrastructure to manage. With Kinesis Data Firehose the data is loaded straight to the destination without any need for additional infrastructure.

References:

Save time with our AWS cheat sheets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

A global logistics company collects shipment tracking information, which updates every few seconds. The company wishes to perform real-time analysis on these data updates to monitor shipment progress and predict delays, after which they want the data to be ingested into their Amazon S3-based data lake.

Which solution will fulfill these requirements with the MOST operational efficiency?

  1. Use Amazon Kinesis Data Streams for data ingestion and AWS Lambda for real-time data analysis.
  2. Use Amazon Kinesis Data Firehose for data ingestion and Amazon Managed Service for Apache Flink for real-time analysis.
  3. Use AWS Direct Connect for data ingestion and Amazon Athena for real-time analysis.
  4. Use Amazon SQS for data ingestion and Amazon EMR for real-time analysis.
A

2. Use Amazon Kinesis Data Firehose for data ingestion and Amazon Managed Service for Apache Flink for real-time analysis.

Amazon Kinesis Data Firehose is ideal for ingesting high-velocity data into AWS, like the shipment tracking data in this scenario. It can capture, transform, and load streaming data into data lakes on S3. Amazon Managed Service for Apache Flink can then analyze this data in real-time, making this the most operationally efficient solution.

  • Kinesis Data Streams can handle real-time data ingestion and Lambda can perform real-time processing, but this approach requires managing the stream consumers (like AWS Lambda) and ensuring they are scaled properly. This may not be the most operationally efficient solution.
  • AWS Direct Connect is a networking service primarily for establishing dedicated network connections from on-premises to AWS, not typically used for high-velocity data ingestion. Amazon Athena is more suitable for ad-hoc querying on S3 data, not real-time analysis.
  • Amazon SQS is capable of handling high-throughput workloads, but it’s more suited to decoupling and scaling microservices, distributed systems, and serverless applications. Amazon EMR is a managed cluster platform that simplifies running big data frameworks, but it is not suitable for real-time data analysis.

Reference:
Amazon Managed Service for Apache Flink

Save time with our AWS cheat sheets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

IAM permissions-related Access Denied errors and Unauthorized errors need to be analyzed and troubleshooted by a company. AWS CloudTrail has been enabled at the company.

Which solution will meet these requirements with the LEAST effort?

  1. Create a custom script and execute it against CloudTrail logs to find errors using AWS Batch.
  2. Search CloudTrail logs with Amazon RedShift. Create a dashboard to identify the errors.
  3. Write custom scripts to query CloudTrail logs using AWS Glue.
  4. Search CloudTrail logs with Amazon QuickSight. Create a dashboard to identify the errors.
A

4. Search CloudTrail logs with Amazon QuickSight. Create a dashboard to identify the errors.

CloudTrail logs are stored natively within an S3 bucket , which can then be easily integrated with Amazon QuickSight. Amazon QuickSight is a data visualization tool which will show any IAM permissions-related Access Denied errors and Unauthorized errors.

  • Writing custom scripts is inevitably more effort than using the native connection between AWS CloudTrail and Amazon QuickSight.
  • Amazon RedShift would not be a simple way of achieving this outcome.
  • AWS Batch requires configuring several EC2 instances to run jobs for you. This, and writing custom scripts will significantly increase the effort involved.

Reference:
Logging QuickSight information with AWS CloudTrail

Save time with our AWS cheat sheets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

A company is in the process of improving its security posture and wants to analyze and rectify a high volume of failed login attempts and unauthorized activities being logged in AWS CloudTrail.

What is the most efficient solution to help the company identify these security events with the LEAST amount of operational effort?

  1. Leverage AWS Lambda to trigger on CloudTrail log updates and use a custom script to scan for failed logins and unauthorized actions.
  2. Utilize AWS Data Pipeline to regularly extract CloudTrail logs and use a custom script to identify the required security events.
  3. Use Amazon Athena to directly query CloudTrail logs for failed logins and unauthorized activities.
  4. Implement Amazon Elasticsearch Service with Kibana to visualize the CloudTrail logs and manually search for these events.
A

3. Use Amazon Athena to directly query CloudTrail logs for failed logins and unauthorized activities.

Amazon Athena can directly query data from S3 (where CloudTrail logs are stored) using standard SQL, making it a powerful and efficient tool for analyzing these logs. You don’t need to manage any infrastructure or write custom scripts, and you can quickly write and run queries to identify the required security events.

  • While Lambda functions can be triggered based on CloudTrail log updates and could theoretically be used to scan for security events, this would require substantial setup and ongoing maintenance of the script. It’s not the most efficient choice.
  • This solution could work, but the operational overhead of managing the extraction process and maintaining a custom script for analysis is not minimal.
  • While Elasticsearch and Kibana provide powerful search and visualization capabilities, respectively, they require a fair amount of setup and management. This option would provide more in-depth analysis and real-time monitoring, but it wouldn’t be the most efficient way to simply identify the security events mentioned.

Reference:
Query AWS CloudTrail logs

Save time with our AWS cheat sheets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

A reporting team receives files each day in an Amazon S3 bucket. The reporting team manually reviews and copies the files from this initial S3 bucket to an analysis S3 bucket each day at the same time to use with Amazon QuickSight. Additional teams are starting to send more files in larger sizes to the initial S3 bucket.

The reporting team wants to move the files automatically to the analysis S3 bucket as the files enter the initial S3 bucket. The reporting team also wants to use AWS Lambda functions to run pattern-matching code on the copied data. In addition, the reporting team wants to send the data files to a pipeline in Amazon SageMaker Pipelines.

What should a solutions architect do to meet these requirements with the LEAST operational overhead?

  1. Create a Lambda function to copy the files to the analysis S3 bucket. Create an S3 event notification for the analysis S3 bucket. Configure Lambda and SageMaker Pipelines as destinations of the event notification. Configure s3:ObjectCreated:Put as the event type.
  2. Create a Lambda function to copy the files to the analysis S3 bucket. Configure the analysis S3 bucket to send event notifications to Amazon EventBridge. Configure an ObjectCreated rule in EventBridge. Configure Lambda and SageMaker Pipelines as targets for the rule.
  3. Configure S3 replication between the S3 buckets. Create an S3 event notification for the analysis S3 bucket. Configure Lambda and SageMaker Pipelines as destinations of the event notification. Configure s3:ObjectCreated:Put as the event type.
  4. Configure S3 replication between the S3 buckets. Configure the analysis S3 bucket to send event notifications to Amazon EventBridge. Configure an ObjectCreated rule in EventBridge. Configure Lambda and SageMaker Pipelines as targets for the rule.
A

4. Configure S3 replication between the S3 buckets. Configure the analysis S3 bucket to send event notifications to Amazon EventBridge. Configure an ObjectCreated rule in EventBridge. Configure Lambda and SageMaker Pipelines as targets for the rule.

With Amazon S3 you can configure same region replication (SRR) to automatically copy files from one bucket to another one as they are added to the source bucket. S3 event notifications can also be configured to trigger event driven responses when changes happen in an Amazon S3 bucket.

Amazon SageMaker Pipelines, the first purpose-built, continuous integration and continuous deployment (CI/CD) service for machine learning (ML), is now supported as a target for routing events in Amazon EventBridge. This enables customers to trigger the execution of the Amazon SageMaker model building pipeline based on any event in their event bus or on a schedule by selecting the pipeline as the target in Amazon EventBridge.

For example, customers can set up EventBridge to trigger the execution of the SageMaker model building pipeline when a new file with the training data set is uploaded to an Amazon S3 bucket or when the SageMaker Model Monitor indicates a deviation in model quality through alarms in Amazon CloudWatch metrics. Customers can also create rules in Amazon EventBridge that trigger the pipeline execution on an automated schedule.

  • This is the closest option with one flaw in that it involves setting up a Lambda function which would require more effort. S3 replication is an out of the box feature from AWS which will be more efficient.
  • This options involves manual steps to set up Lambda and any manual intervention could be avoided with S3 replication. Hence this is an incorrect option.
  • This option could work but again avoiding writing a Lambda function and using EventBridge reduces the manual intervention and the effort needed. Also, an S3 event notification cannot be directly integrated with a SageMaker pipeline.

Reference:
Using dynamic Amazon S3 event handling with Amazon EventBridge

Save time with our AWS Cheat Sheets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

A Solutions Architect is designing the messaging and streaming layers of a serverless application. The messaging layer will manage communications between components and the streaming layer will manage real-time analysis and processing of streaming data.
The Architect needs to select the most appropriate AWS services for these functions.

Which services should be used for the messaging and streaming layers?

(Select TWO.)

  1. Use Amazon Kinesis for collecting, processing and analyzing real-time streaming data
  2. Use Amazon SWF for providing a fully managed messaging service
  3. Use Amazon SNS for providing a fully managed messaging service
  4. Use Amazon EMR for collecting, processing and analyzing real-time streaming data
  5. Use AWS CloudTrail for collecting, processing and analyzing real-time streaming data
A

1. Use Amazon Kinesis for collecting, processing and analyzing real-time streaming data

3. Use Amazon SNS for providing a fully managed messaging service

Amazon Kinesis makes it easy to collect, process, and analyze real-time streaming data. With Amazon Kinesis Analytics, you can run standard SQL or build entire streaming applications using SQL

Amazon Simple Notification Service (Amazon SNS) provides a fully managed messaging service for pub/sub patterns using asynchronous event notifications and mobile push notifications for microservices, distributed systems, and serverless applications.

  • Amazon Simple Workflow Service is used for executing tasks not sending messages.
  • Amazon Elastic Map Reduce runs on EC2 instances so is not serverless.
  • AWS CloudTrail is used for recording API activity on your account.

References:
* Amazon Kinesis
* Amazon Simple Notification Service

Save time with AWS Cheat Sheets
* Amazon Kinesis
* AWS Application Integration Services

How well did you know this?
1
Not at all
2
3
4
5
Perfectly