What process takes the most time in producing timely insights?
A disproportionate amount of time (70%) is spent on data preparation:
Define Data Lakes
Data lakes focus on storing disparate data and ignore how or why data is used, governed, defined and secured.
Purpose of Data Lakes
Aims to solve two problems
Avoid information silos. Rather than having dozens of independently managed collections of data, you can combine these sources in the unmanaged data lake. The consolidation theoretically results in increased information use and sharing, while cutting costs through server and license reduction.
Big Data projects require a large amount of varied information. The information is so varied that it’s not clear what it is when it is received, and constraining it in something as structured as a data warehouse or relational database management system (RDBMS) constrains future analysis.
Different teams need different data.
The data lake aims to provide the appropriate data access to the business in a cost effective manner that also protects and governs data.
Benefits of Data Lakes
The 4 catgories of systems and what data they produce
Systems of Record - Context data related to the business transactions of the organization (OLTP)
Systems of Engagement - Big data about the activity of the individual using a mobile device
Systems of Automation - Big data from sensors monitoring an asset or location (IoT, Industrial IoT)
Systems of Insight - Analytics based on historical data collected from multiple sources
Data Lakes vs. Data Marts
Data Mart
Data Lake
Data Warehouse vs. Data Lake
8 differences
1) For the most part, Data Warehouse is set up to support BI that form a large part of the requirements that business gives to BI projects. Data Lakes enable people to do their own analyses. (ad-hoc: “created or done, for a particular purpose, as necessary”)
4) Data Warehouse schema known in advance, through design, instead of finding it out as you load the data. Similarly, Data Warehouse usually run on fixed servers in a data centre (although Cloud is becoming popular) versus Data Lakes may be largely in the Cloud, with computing power called upon as needed.
7) Data Lakes –big data –used by data scientists

Data Lake vs. Data Swamp
data swamp: lot’s of unkown!
unclear data…
A ‘good’ data lake has no open questions.
3 Elements of Data Lakes

Simplified Data Lake Architecture
On the left, we see data coming in from systems of Record, Systems of Engagement and Systems of Automation
Various data buckets or repositories.
Less structured data can be used by Data Scientists for analysis; sometimes they will transform it as well.
Some of the data will be loaded into the Data Warehouse for use by Business Analysts.
A time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. Examples of time series are heights of ocean tides, counts of sunspots, and the daily closing value of the Dow Jones Industrial Average.
Time series analysis comprises methods for analyzing time series data in order to extract meaningful statistics and other characteristics of the data. Time series forecasting is the use of a model to predict future values based on previously observed values.
Some of the data will be loaded into an in-memory database such as SAP HANA for use by Business Analysts to do real-time or streaming analytics
The really important message is the emphasis on Governance, enabled by Metadata; keeps a record of:

Users supported by (IBM’s) Data Lake
Analytics teams – Analysts, Data Scientists, etc.
Information Curator – a person who is responsible for creating, maintaining, and correctingany errors in the description of the information store in the governance catalog.
Governance, Risk and Compliance Team – responsible for Information Governance
Line of Business Teams – the business users (business analysts, managers etc.)
Data Lake Operations – the IT Run Organisation that keeps the Data Lake running and providing service to the users

The subsystems inside (IBM’s) Data Lake
Enterprise IT Data Exchange: about getting data into the Data Lake
Self-Service Access has various viewpoints:
2 types of self-service access
Catalogue: Much of the operation of the data reservoir is centered on a catalog of the information that the data reservoir is aware of. The catalog is populated using a process that is called curation.

The subsystems inside (IBM’s) Data Lake - Part II
Self service access for data scientists allows them to quickly develop analytics often usinga “sandbox” system
Accuracy in the catalog builds trust with the users
The business users can use self service to do “what-if” experiments and analysis
New sources of data can be imported to be analyzed

The subsystems inside (IBM’s) Data Lake - Part III
View from the user community - fraud
Data Lakes are useful for fraud investigation and protection development.
The business users in the fraud team can use self service to investigate cases of fraud
They would use data in the Data Lake to detect (perhaps using stream analytics) fraud
The data scientists would develop new models for fraud detection
The compliance team can use the data in the Lake to report their compliance to their regulators – a regulator is a public authority or government agency responsible for exercising autonomous authority over some area of human activity in a regulatory or supervisory capacity, e.g. the Information Commissioner’s Office or the Financial Conduct Authority

Data Lakes have catalogs (like a library)
Key things: governance, lineage, metadata
Information curation - the creation of the description of the data in the data lake’s catalog.
The information owner is responsible for the accuracy of this definition. However, for enterprise IT systems, the information owner is often a senior person who delegates to aninformation curator. The information curator understands what that data is and can describe it in an entry in the catalog.

Data Lineage
The Lifecycle of Data
Done through Meta Data Management (?)

Data Governance
Governance ensures proper management and use of information.
Diagram shows

10 Considerations for a well-managed and governed data lake

How information governance provides the mechanism for building trust
The Information broker is the runtime server environment for running the integration processes that move data into and out of the data reservoir and between components within the reservoir. It typically includes an extract, transform, and load (ETL) engine for moving around data.
The Code hub is used primarily to facilitate transcoding of data coming into the reservoir and data feeds flowing out. Additionally, to support analytics the reference data can map the canonical forms tostrings to make it easier for the analytics user and their activities.
Staging areas are used to manage movement of data into, out of, and around the datareservoir, and to provide appropriate decoupling between systems.The implementation can include database tables, directories within Hadoop, messagequeues, or similar structures.
The operational governance hub provides dashboards and reports for reviewing andmanaging the operation of the data reservoir. Typically it is used by the following groups:
– Information owners and data stewards wanting to understand the data quality issues inthe data they are responsible for that have been discovered by the data reservoir.
– Security officers interested in the types and levels of security and data protectionissues that have been raised.
– The data reservoir operations team wanting to understand the overall usage and performance of the data reservoir.
Monitor - Like any piece of infrastructure, it is important to understand how the data reservoir isperforming. Are there hotspots? Are you getting more usage than you expected? How are you managing your storage? The data reservoir has many monitor components deployed that record the activity in thedata reservoir along with its availability, functionality, and performance. The management of any alerts that the monitors raise can be resolved using workflow.
Workflow - Successful use of a data reservoir depends on various processes involving systems,users, and administrators. For example, provisioning new data into the data reservoir might involve an information curator defining the catalog entry to describe and classify thdata. An information owner must approve the classifications, and an integration developermust create the data ingestion process. Workflow coordinates the work of these people.
Guards are controls within the reservoir to enforce restrictions on access to data andrelated protection mechanisms. These guards can include ensuring the requesting user isauthorized, data masking being applied, or certain rows of data being filtered out.

Organisations expect information governance to deliver…
Establishing Information Governance Policies
These are called the governance principles since they underpin all other information governance decisions.

Governance Rules
Defined for each classification for each situation
Policies translate to rules

Data Classifications
Classification is at the heart of information governance.
Key Requirements
Business Classifications
Role Classifications
Resource Classifications
Activity Classifications
Semantic Mapping
Data Privacy Classifications

Information Governance in the broader context:
Obligations and Delegations
Export controls/restrictions …
The governance should help you in reporting your compliance with laws or regulations
