Data Lakes Flashcards by Oscar Leander

What process takes the most time in producing timely insights?

A disproportionate amount of time (70%) is spent on data preparation:

acquiring
preparing
formatting
normalizing

How well did you know this?

Not at all

Perfectly

Define Data Lakes

Data lakes focus on storing disparate data and ignore how or why data is used, governed, defined and secured.

How well did you know this?

Not at all

Perfectly

Purpose of Data Lakes

Aims to solve two problems

Avoid information silos. Rather than having dozens of independently managed collections of data, you can combine these sources in the unmanaged data lake. The consolidation theoretically results in increased information use and sharing, while cutting costs through server and license reduction.

Big Data projects require a large amount of varied information. The information is so varied that it’s not clear what it is when it is received, and constraining it in something as structured as a data warehouse or relational database management system (RDBMS) constrains future analysis.

Different teams need different data.

The data lake aims to provide the appropriate data access to the business in a cost effective manner that also protects and governs data.

How well did you know this?

Not at all

Perfectly

Benefits of Data Lakes

Suited to extremely large volumes
Open and flexible architecture
- Integrating existing data assets, e,g, EDW
Future expandability
Data democratization across the business through holistic governance
Provides different teams with different data from the same source.

How well did you know this?

Not at all

Perfectly

The 4 catgories of systems and what data they produce

Systems of Record - Context data related to the business transactions of the organization (OLTP)

Systems of Engagement - Big data about the activity of the individual using a mobile device

Systems of Automation - Big data from sensors monitoring an asset or location (IoT, Industrial IoT)

Systems of Insight - Analytics based on historical data collected from multiple sources

How well did you know this?

Not at all

Perfectly

Data Lakes vs. Data Marts

Data Mart

bottled, cleansed, water

Data Lake

more natural, free water
lake fed with data from multiple sources
users can dive in, take samples, examine

How well did you know this?

Not at all

Perfectly

Data Warehouse vs. Data Lake

8 differences

1) For the most part, Data Warehouse is set up to support BI that form a large part of the requirements that business gives to BI projects. Data Lakes enable people to do their own analyses. (ad-hoc: “created or done, for a particular purpose, as necessary”)
4) Data Warehouse schema known in advance, through design, instead of finding it out as you load the data. Similarly, Data Warehouse usually run on fixed servers in a data centre (although Cloud is becoming popular) versus Data Lakes may be largely in the Cloud, with computing power called upon as needed.
7) Data Lakes –big data –used by data scientists

How well did you know this?

Not at all

Perfectly

Data Lake vs. Data Swamp

data swamp: lot’s of unkown!

unclear data…

…location
…origin
…ownership
…purity / quality
…presence
…protection
…timeliness
…reliability (of data feeds and results)
…classification

A ‘good’ data lake has no open questions.

How well did you know this?

Not at all

Perfectly

3 Elements of Data Lakes

Data Lake Repositories (data stores)
Data Lake Services (operate on the data)
Information Management and Governance (prevent Data Swamps)

How well did you know this?

Not at all

Perfectly

Simplified Data Lake Architecture

On the left, we see data coming in from systems of Record, Systems of Engagement and Systems of Automation

Various data buckets or repositories.

Less structured data can be used by Data Scientists for analysis; sometimes they will transform it as well.

Some of the data will be loaded into the Data Warehouse for use by Business Analysts.

A time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. Examples of time series are heights of ocean tides, counts of sunspots, and the daily closing value of the Dow Jones Industrial Average.

Time series analysis comprises methods for analyzing time series data in order to extract meaningful statistics and other characteristics of the data. Time series forecasting is the use of a model to predict future values based on previously observed values.

Some of the data will be loaded into an in-memory database such as SAP HANA for use by Business Analysts to do real-time or streaming analytics

The really important message is the emphasis on Governance, enabled by Metadata; keeps a record of:

Lineage (Provenance),
Age (how old the data is),
Privacy (what policies should be applied)
Usage history

How well did you know this?

Not at all

Perfectly

Users supported by (IBM’s) Data Lake

Analytics teams – Analysts, Data Scientists, etc.

Information Curator – a person who is responsible for creating, maintaining, and correctingany errors in the description of the information store in the governance catalog.

(other roles:
Information Owner: A person who is accountable for the correct classification andmanagement of the information within a system or store.
Information Steward: aperson who is responsible for correcting any errors in the actualinformation in the information store.)

Governance, Risk and Compliance Team – responsible for Information Governance

Line of Business Teams – the business users (business analysts, managers etc.)

Data Lake Operations – the IT Run Organisation that keeps the Data Lake running and providing service to the users

How well did you know this?

Not at all

Perfectly

The subsystems inside (IBM’s) Data Lake

Enterprise IT Data Exchange: about getting data into the Data Lake

Self-Service Access has various viewpoints:

A user being provided with all the capabilities they need to be able to search and find the data that they need without support.
A data owner being able to ensure that the correct level of governance is enforced on their data.
A system administrator being notified of impending resource allocations being exceededon a file system.
A fraud investigation officer being provided with the tools they need to investigatesuspicious activity within the reservoir.

2 types of self-service access

one for data scientists and their ilk (type of person)
and less sophisticated users

Catalogue: Much of the operation of the data reservoir is centered on a catalog of the information that the data reservoir is aware of. The catalog is populated using a process that is called curation.

How well did you know this?

Not at all

Perfectly

The subsystems inside (IBM’s) Data Lake - Part II

Self service access for data scientists allows them to quickly develop analytics often usinga “sandbox” system

Accuracy in the catalog builds trust with the users

The business users can use self service to do “what-if” experiments and analysis

New sources of data can be imported to be analyzed

How well did you know this?

Not at all

Perfectly

The subsystems inside (IBM’s) Data Lake - Part III

View from the user community - fraud

Data Lakes are useful for fraud investigation and protection development.

The business users in the fraud team can use self service to investigate cases of fraud

They would use data in the Data Lake to detect (perhaps using stream analytics) fraud

The data scientists would develop new models for fraud detection

The compliance team can use the data in the Lake to report their compliance to their regulators – a regulator is a public authority or government agency responsible for exercising autonomous authority over some area of human activity in a regulatory or supervisory capacity, e.g. the Information Commissioner’s Office or the Financial Conduct Authority

How well did you know this?

Not at all

Perfectly

Data Lakes have catalogs (like a library)

Key things: governance, lineage, metadata

Information curation - the creation of the description of the data in the data lake’s catalog.

The information owner is responsible for the accuracy of this definition. However, for enterprise IT systems, the information owner is often a senior person who delegates to aninformation curator. The information curator understands what that data is and can describe it in an entry in the catalog.

How well did you know this?

Not at all

Perfectly

Data Lineage

Study These Flashcards

The Lifecycle of Data

Done through Meta Data Management (?)

Data Governance

Study These Flashcards

Governance ensures proper management and use of information.

Diagram shows

the questions that governance should answer
the components of information governance

10 Considerations for a well-managed and governed data lake

Study These Flashcards

How information governance provides the mechanism for building trust

Study These Flashcards

The Information broker is the runtime server environment for running the integration processes that move data into and out of the data reservoir and between components within the reservoir. It typically includes an extract, transform, and load (ETL) engine for moving around data.

The Code hub is used primarily to facilitate transcoding of data coming into the reservoir and data feeds flowing out. Additionally, to support analytics the reference data can map the canonical forms tostrings to make it easier for the analytics user and their activities.

Staging areas are used to manage movement of data into, out of, and around the datareservoir, and to provide appropriate decoupling between systems.The implementation can include database tables, directories within Hadoop, messagequeues, or similar structures.

The operational governance hub provides dashboards and reports for reviewing andmanaging the operation of the data reservoir. Typically it is used by the following groups:

– Information owners and data stewards wanting to understand the data quality issues inthe data they are responsible for that have been discovered by the data reservoir.

– Security officers interested in the types and levels of security and data protectionissues that have been raised.

– The data reservoir operations team wanting to understand the overall usage and performance of the data reservoir.

Monitor - Like any piece of infrastructure, it is important to understand how the data reservoir isperforming. Are there hotspots? Are you getting more usage than you expected? How are you managing your storage? The data reservoir has many monitor components deployed that record the activity in thedata reservoir along with its availability, functionality, and performance. The management of any alerts that the monitors raise can be resolved using workflow.

Workflow - Successful use of a data reservoir depends on various processes involving systems,users, and administrators. For example, provisioning new data into the data reservoir might involve an information curator defining the catalog entry to describe and classify thdata. An information owner must approve the classifications, and an integration developermust create the data ingestion process. Workflow coordinates the work of these people.

Guards are controls within the reservoir to enforce restrictions on access to data andrelated protection mechanisms. These guards can include ensuring the requesting user isauthorized, data masking being applied, or certain rows of data being filtered out.

Organisations expect information governance to deliver…

Study These Flashcards

Understanding of the information they have
Confidence to share and reuse information
Protection from unauthorised use of information
Monitoring of activity around the information
Implementation of key business processes that manage information
Tracking the provenance of information and
Management of the growth and distribution of their information.

Establishing Information Governance Policies

Study These Flashcards

These are called the governance principles since they underpin all other information governance decisions.

Governance Rules

Study These Flashcards

Defined for each classification for each situation

Policies translate to rules

Data Classifications

Study These Flashcards

Classification is at the heart of information governance.

It characterizes the type, value, and cost of information, or the mechanism that manage it.
The design of the classification schemes is key to controlling the cost and effectiveness of the information governance program.

Key Requirements

Reporting mechanims
Consumption tracking
Security
Privacy
Chargeback models (used to apportion the costs of running the data lake amongst its users)

Business Classifications

Characterize information from a business perspective. This captures its value, how it is used, and the impact to the business if it is misused.

Role Classifications

Characterize the relationship that an individual has to a particular kind of data.

Resource Classifications

Characterize the capability of the IT infrastructure that supports the management of information. A resource’s capability is partly due to its innate functions and partly controlled by the way it has been configured.

Activity Classifications

Help to characterize procedures, actions and automated processes.

Semantic Mapping

Identifies the meaning of an information element. The classification scheme is a glossary of concepts from relevant subject areas (industry specific, shipped with industry models).
The semantic classifications are defined at two levels:
- Subject area mapping
- Business term mapping

Data Privacy Classifications

Crtical, Restricted, Highly Confidential, Confidential, Public

Information Governance in the broader context:
Obligations and Delegations

Study These Flashcards

Export controls/restrictions …

The governance should help you in reporting your compliance with laws or regulations

3 interlocking lifecycles of information governance

Information Governance Personas

The information governance personas describe people whose full time role is the governance and protection of information. They oversee the programs that: * set policies * monitor measure and feed back on compliance to the policies **Chief Data Officer (CDO)** – manages overall information governance **Security Officer** - manages protection of personal data and research IP **Auditor** – ensuring that governance rules and policies are followed; will interface with external auditors

Setting Up a Governance Programme

Information governance should **start small** **and prioritise** most important information, and then expand out as it demonstrates its worth. It must evolve with the business, **be responsive and accountable**, while seeking to communicate and educate people in the appropriate management of information. Most important, it **needs senior stakeholders and visible consequences** for those who ignore the requirements.

Secure access to the data lake’s data

The data lake’ssecurity is assured with this combination of business processes and technical mechanisms.

**Two layer defence** – Inner and Outer Ring

The _Data Lake Architecture_ 2 layers of protection: **Outer Ring**: where the identity of a person or caller is established - represented by the data lake’s services. **Inner Ring**: The inner boundary encircles the data lake repositories. The repositories inside this boundary have very restricted security access so only the approved analytics, processes and services sitting inside the data lakecan access the data lake repositories.

Building a data lake

The data lake needs **governance and change management** to ensure that information is protected and managed efficiently. The first step in creating the lake is to **establish the information integration and governance components**, the staging areas for integration, the catalog, the common data standards. Building the lake then **proceeds iteratively** based on the following processes: * Governance of a data lake subject area. * Managing an information source. * Managing an information view. * Enabling analytics. * Maintaining the data lake infrastructure.

Roles within the Data Lake

Ethics for Big Data and Analytics

**Context** – for what purpose was the data originally surrendered? For what purpose is the data now being used? How far removed from the original context is its new use? **Consent & Choice** – What are the choices given to an affected party? Do they know they are making a choice? Do they really understand what they are agreeing to? Do they really have an opportunity to decline? What alternatives are offered? **Reasonable** – is the depth and breadth of the data used and the relationships derived reasonable for the application it is used for? **Substantiated** – Are the sources of data used appropriate, authoritative, complete and timely for the application? **Owned** – Who owns the resulting insight? What are their responsibilities towards it in terms of its protection and the obligation to act? **Fair** – How equitable are the results of the application to all parties? Is everyone properly compensated? **Considered** – What are the consequences of the data collection and analysis? **Access** – What access to data is given to the data subject? **Accountable** – How are mistakes and unintended consequences detected and repaired? Can the interested parties check the results that affect them?

Data Lakes Flashcards

(32 cards)