clone Flashcards

(60 cards)

1
Q

Data cube

A

A multidimensional matrix representing high-dimensional space to show how data attributes are arranged.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Columnar
storage

A

Stores data by columns instead of rows to provide faster analytics and better compression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Data lake

A

A central storage for holding raw data (structured and unstructured) in its original format.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

data warehouse

A

A central database that stores cleaned and structured data and optimized for analysis reporting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Roll-up

A

An operation that make similar data attributes having the same dimension together

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Dicing

A

Performs a multidimensional cutting that cuts a range of more than one dimension, resulting in a subcube

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Slicing

A

Filters unnecessary parts of the data to focus on a particular attribute for analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Drill down

A

The reverse of roll-up. It subdivides information for coarser granularity analysis, zooming into more detail

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Pivot

A

Transforms the data cube in terms of view without changing the data. It allows the user to change the viewpoint

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Multidimensional data
cube

A

Use multi-dimensional arrays to store data; it is faster and more efficient.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Rational data cube

A

Uses relational tables to store data; it is slower compared to multidimensional cubes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

AWS uses data columnar Amazon Redshift

A

Stores data in columnar format to speed up analytics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

AWS uses data columnar Amazon Athena

A

Queries data stored in columnar formats like Parquet and ORC directly on S3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

AWS uses data columnar AWS Glue

A

Supports ETL (Extract, Transform, Load) jobs that read and write columnar formats.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

AWS uses data columnar Amazon S3 select

A

Reads specific columns from Parquet or ORC files.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Graph processing

A

The computational process of analyzing data structured as a graph (vertices and edges) to extract insights, such as finding the shortest path or influential users.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Graph databases

A

A type of NoSQL database that uses graph theory to store, map, and query relationships.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

AWS SageMaker

A

A fully managed service that simplifies building, training, and deploying machine learning models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Google cloud AutoML

A

Used for training an AutoML machine learning model and its development.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Google cloud speech to
text

A

A speech recognition system for transmitting speech to text, supporting 120 languages.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Google cloud vision AI

A

Used to create machine learning models for cloud vision that detect text, etc..

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Microsoft Azure
machine learning

A

Used to create and deploy machine learning models on the cloud.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Microsoft Azure
databricks

A

Provides Apache Spark-based analytics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Microsoft Azure bot
service

A

Provides smart, intelligent, and scalable bot services

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Draw a flowchart showing how does Amazon SageMaker works. Explain it briefly.
+--------------------+ | Amazon Lambda | +---------+----------+ | v +------------------+ +------+------+ | Amazon |---->| | | CloudWatch | | Event | +------------------+ | Handler | +------+------+ | ----------------------------- | | v v Model Training Model Deployment | | v v +--------------------------------------+ | Amazon SageMaker | +------------------+-------------------+ | v Finish Model Training Job | v +------------------+ | | Amazon S3 |------------+ | Bucket | +------------------+
26
Explain each step in AWS SageMaker Workflow
# Model deployment - Data preparation : Collecting, cleaning, and transforming data into the appropriate format. - Model building Using pre-built algorithms/frameworks or custom algorithms to build the model. - Model training Training the model using the prepared data, with options for distributed training. * Model optimization Fine-tuning hyperparameters and optimizing the architecture for performance. - Model deployment Deploying the model to endpoints for use in production. * Model monitoring Tracking performance metrics and detecting anomalies in real-time. * Model Management Managing the model over time, including updates and retraining.
27
Explain four approaches used to make AWS SageMaker secure.
*IAM Integration: Uses AWS Identity and Access Management (IAM) policies to maintain security for data stored in S3 (Data Lake). * Encryption: Offers optional encryption for models in transit and at rest using AWS Key Management Service (KMS). * Secure Connections: All API requests are transmitted over Secure Sockets Layer (SSL) connections. * VPC Deployment: Can be deployed within an Amazon Virtual Private Cloud (VPC) for greater control over data flow.
28
SageMaker provides options for deploying models to various endpoints. Mention three of them.
* Amazon EC2 instances. * Lambda functions. * API Gateway.
29
Mention four well-known graph analytics platforms.
* PuppyGraph. * AWS Neptune. * Neo4j. * DataStax.
30
What do we mean by the statement “Neo4j is a highly scalable and native graph database”?
It means Neo4j is designed specifically for storing and processing graph data (native), offering a powerful and flexible data model that allows for efficient querying and analysis of complex, interconnected data.
31
- Mention the two components of a graph analytics platform.
Data storage component: Responsible for storing graph data (e.g., a graph database). * Analytics engine: Responsible for performing actual analysis on the graph data (e.g., path analysis).
32
- Compare between batch ingestion and streaming ingestion:
Batch ingestion Way of ingestion Collects data over a set period and ingests it all at once (in fixed groups). Efficient for large data volumes. Used when delay is acceptable (e.g., hourly, daily). Streaming ingestion Data flows in as it happens; sends each event immediately. Data size Handles continuous updates (records/events). Update speed Used when fast updates are needed (low latency, real-time).
33
Draw a diagram showing the pipeline of the Amazon Kinesis Data Streams framework and explain how it works. Mention clearly the difference between data streams, shards, data record, data blob, and partition key.
How it works: 1. Producers send data to a Kinesis stream. 2. Data is stored in shards for a retention period (24 hours default). 3. Consumers (apps or AWS services) read and process records in real-time. Definitions: * Data streams: A container for data records. * Shards: The unit of capacity in a stream; each shard handles a fixed number of reads/writes. * Data record: The smallest unit in a stream (up to 1 MB), consisting of a data blob and partition key. * Data blob: The actual data payload within the record. * Partition key: Determines which shard a record goes to +----------------+ +----------------+ | Event Producer | | Event Consumer | +----------------+ +----------------+ \ / \ / Pull Messages Push Messages \ / \ / v v +-------------------------------------+ | Kinesis Data Streams | |-------------------------------------| | Stream 1 | Stream 2 | Stream 3 | +-------------------------------------+ ^ ^ | | | | +------------------------------------+ | Anatomy of Data Stream | |------------------------------------| | Shard 1 | █ █ █ █ █ □ □ □ □ □ | | Shard 2 | █ █ □ □ □ □ □ □ □ □ | | Shard 3 | █ □ □ □ □ □ □ □ □ □ | +------------------------------------+
34
Draw a diagram showing the pipeline of the Apache Kafka framework and explain how it works. Mention clearly the difference between the following elements of the framework: topic, partition, brokers, cluster, and zookeeper.
How it works: 1. Producers publish messages to a topic. 2. Kafka stores messages in partitions on brokers. 3. Consumers subscribe to topics and process messages in order. Definitions: * Topic: A named category for messages where data is published. * Partition: A subdivision of a topic that enables parallelism and scalability. * Brokers: The Kafka servers that store topic partitions. * Cluster: Multiple brokers working together for fault tolerance and scalability. * Zookeeper: Manages cluster metadata and leader election. +----------------+ +----------------+ | Event Producer | | Event Consumer | +----------------+ +----------------+ \ / \ / \ / Pull Messages Push Messages \ / v v +--------------------------------------+ | Kafka Cluster | |--------------------------------------| | Topic 1 | Topic 2 | Topic 3 | +--------------------------------------+ ^ ^ | | | | +---------------------------------------+ | Anatomy of Kafka Topic | |---------------------------------------| | Partition 0 | █ █ █ █ □ □ □ □ □ □ | | Partition 1 | █ █ □ □ □ □ □ □ □ □ | | Partition 2 | █ □ □ □ □ □ □ □ □ □ | +---------------------------------------+
35
- Mention where the job is done in the case of ETL and ELT:
Job Extract ETL From sources (DBs, APIs, files). ELT From sources. Load ETL Into the target system (data warehouse) after transformation. ELT Loads raw data directly into the target system. Transform ETL Done in an ETL engine (before loading). ELT Done inside the target system using its processing power
36
True or false Airbyte is an open-source tool that supports customizable pipelines
.(T)
37
True or false Sqoop is used to move data from a database to HDFS.
.(T)
38
True or false Informatica PowerCenter is an enterprise tool with governance features.(
.(T)
39
True or false AWS Glue is a managed service on AWS
.(T)
40
True or false Airbyte provides many source connectors.
.(T)
41
True or false Talend Open Studio is an open-source option.
.(T)
42
True or false Matillion is used only for on-premise systems.
.(F)
43
True or false Apache NiFi works with Hadoop environments.
.(T)
44
True or false Talend Enterprise targets enterprise needs.
.(T)
45
True or false AWS Glue is the best choice when you need an open-source tool.
.(F)
46
Mention suitable batch ingestion tools for the following use cases: There is a demand for open-source and customizable pipelines
Airbyte, Apache NiFi, Talend (open studio).
47
Mention suitable batch ingestion tools for the following use cases:You are using Hadoop or big data stack software in your project
Sqoop (database → HDFS), NiFi.
48
Mention suitable batch ingestion tools for the following use cases:There is a need for enterprise features, data governance, and reliability
Informatica PowerCenter, Talend (enterprise), Matillion.
49
Mention suitable batch ingestion tools for the following use cases:You are using AWS cloud in your project and you and want managed service
AWS Glue, Matillion.
50
Mention suitable batch ingestion tools for the following use cases:In your project you need many source connectors and flexibility
Airbyte.
51
Describe how ETL jobs operate in the following scenarios:
# IoT sensor data ✅ 1) Sales Data Aggregation (اختصار محفوظ بسهولة) Extract: Take daily sales data from store databases. Transform: Clean data, unify currency, calculate totals per store/product. Load: Save the aggregated results into a data warehouse (Redshift/Snowflake). ✅ 2) IoT Sensor Data (اختصار واضح وسهل) Extract: Collect hourly readings from IoT devices. Transform: Remove outliers, convert units, compute averages. Load: Store data in a time-series DB or cloud warehouse. ✅ 3) Healthcare Data Pipeline (مختصر ومفهوم) Extract: Get patient records from hospital systems. Transform: Remove personal info, standardize medical units, group by diagnosis. Load: Insert cleaned data into a research database.
52
Provide a definition of virtualization within cloud computing.
Virtualization is a technology that allows you to use one physical computer as if it were many by running multiple virtual machines on it. In cloud computing, it is used to split one big server into smaller, isolated virtual servers, allowing resources to be used efficiently so businesses only pay for what they need without buying extra hardware.
53
Define containerization in the context of cloud computing.
Containerization is the process of packing an application along with all its dependencies (libraries, configuration files) into a single, isolated unit called a "container". This ensures that the application runs consistently and efficiently in any environment (whether on a laptop or a cloud server) because it is independent of the host operating system.
54
Describe the roles of a hypervisor in virtualization:
Role Description Resource allocation The hypervisor controls the virtual machines' use of physical resources Isolation It creates and runs isolated virtual machines (VMs), Management It serves as an intermediary between the physical computer and the virtual machines, Security By isolating VMs, it ensures that if one VM fails, others are unaffected,
55
What distinguishes Type 1 hypervisors from Type 2 hypervisors?
Type 1 (Bare-Metal): Installed directly onto the computer hardware without an operating system sitting in between. It is highly efficient as it has direct access to resources.  Type 2 (Hosted): Runs over an installed operating system (like Windows or macOS). It is used when you need to execute more than one OS on one machine.
56
Provide real-world examples of the following: Application Network Desktop Storage Server Data
Application virtualization :Microsoft Azure (lets people use applications without installing them locally).  Network virtualization :Google Cloud (allows companies to create networks using software).  Desktop virtualization: Amazon WorkSpaces or Google Cloud (GCP) Virtual Desktops.  Storage virtualization: Amazon S3 (combines storage into a single system).  Server virtualization: VMware vSphere, Microsoft Hyper-V, or KVM.  Data virtualization: Solutions from companies like Oracle and IBM.
57
6- Explain with an example how containers work in the context of cloud computing.
How it works: Containers virtualize the operating system of a server . They package code , systems into standard unit ;  Example: Using Docker, you can quickly deploy applications into any environment (like AWS)
58
Mention five ways to run containers on AWS.
1- Amazon Elastic Container Service (ECS): Highly scalable container management. 2- AWS Fargate: Runs containers without managing servers/infrastructure. 3- Amazon Elastic Container Service for Kubernetes (EKS): Runs Kubernetes on AWS. 4- Amazon Elastic Container Registry (ECR): Stores and manages Docker container images. 5- AWS Batch: Runs batch processing workloads using Docker containers.
59
8- Compare between containers and virtual machines:
1. Architecture Containers: Share the host OS kernel; isolated user spaces make them lightweight. Virtual Machines: A hypervisor runs on the host OS; includes a full guest OS with virtualized hardware. 2. Boot Time Containers: Much less (seconds), as they do not need to boot a full OS. Virtual Machines: Longer, because the full OS needs to be initialized. 3. Isolation Containers: Process-level isolation (less strong than VMs). Virtual Machines: Very good isolation because each VM is a system with its own OS. 4. Resource Usage Containers: Consume fewer resources (only necessary binaries/libraries). Virtual Machines: Very high, as full OS overhead is incurred for each instance.
60
9- Write the steps needed to deploy an application on AWS using Docker.
1. Build Create the containerized application using the Docker environment. 2. Store Use Amazon Elastic Container Registry (ECR) to store and manage the Docker container images securely. 3. Deploy Deploy the application from the local Docker environment to Amazon ECS. 4. Run Use AWS Fargate to run the containers without provisioning or managing servers.