Whats the article about?
data warehouses are created to provide dedicated source of data to support decision-making applications data repositories play important role in understanding customer behaviour, connecting trading partners along supply chain, supporting performance management systems etc
* aim of paper: tutorial on data warehousing that introduces newest concepts in field
What are the 4 data warehouse characteristics?
Data warehouse: collection of data created to support decision making; provides data infrastructure & thus eliminates failures that are due to lack of quality
–> ‘single version of truth’
4 DATA WAREHOUSE CHARACTERISTICS
1) Subject oriented: organised around specific subjects (sales, customer, products)
2) Integrated: data collected from multiple systems (internal/external) and integrated around subjected (eg in the end data around one customer identifier)
3) Time variant: maintains historical data
4) Non-volatile: users cannot change or update data; updates happen through IT controlled load processes
What are data marts?
Data mart: similar to data warehouse but stores data only for limited number of subject areas (smaller in scope than data warehouses)
Independent data mart: built directly from source systems; often ‘point solution’ that solves immediate problem but creates new ones in long run when firm tries to implement enterprise-wide data infrastructure (eg data warehouse)
Dependent data mart: created with data drawn from data warehouse; provides copy of data extracted from data warehouse to give users customised view of specific data; are preferred over independent ones as they come from organisation-wide source - ‘single version of truth’ is maintained
What are operational data stores?
Operational data store (ODS): consolidates data from multiple source systems & provides near real-time, integrated view of volatile, current (max 60 day old) data; aim to provide integrated data for organisational purposes; often used to avoid ERP implementation
Whats the architecture for data warehousing?
Architecture for data warehousing: includes the component parts and the relationships among the parts –>Data sources, ETL software, data stores, data analysis tools & applications, users
Whats the data mart strategy?
Data mart strategy: ‘start small, think big approach’ begins with specific business need for data and grows after time
* initial data mart contains data for only limited number of subject areas, draws from limited sources
* recognising that data mart will grow: needs to be planned and exercised beforehand as mart will grow if it is successful
–>if not: difficult to integrate data logically & physically
* advantages: due to limited scope initially developed quickly, at low cost, with less financial risk
* disadvantages: difficult to successfully grow a data mart while integrating new subject areas, data, users, applications along the way
Whats the enterprise data warehouse approach?
Data warehouse approach: traditional approach t warehousing; does not preclude creation of data marts but mart are created after warehouse is built & pull data from warehouse, not from source systems (dependent data)
* advantages: faster system response time as marts + simpler data view as marts are dependent; results in integrated data warehouse containing many subject areas for multiple users
* disadvantages: risk that it is never completed or fails to meet end users needs
What is ETL?
ETL: data extraction, transformation, loading
–> takes fata from source systems, prepares it for decision-support purposes & places it in target data base; ‘plumbing’ – dirty, complex, time-consuming, expensive
What are the two types of data extraction methods?
Custom-write data extraction programs: firm writes own ETL software when: wanting to avoid costs of purchasing, knows source systems well, understand complexities
Purchasing commercial ETL software: available from major database vendors & firms specialising in ETL software; allows firms to specify source systems, indicate tables and columns to be used, move data to specified targets & automate process
What is meant by data transformation?
Data transformation: several ways - cleansing the data, integrating the data, other transformations
DATA CLEANSING
Dirty data: result of poor data quality practice which little can be done about the data that is already there
* sources of dirty data: dummy values - inappropriate values entered; absence of data; cryptic data; multipurpose fields; contradicting data; violations of business rules; non-unique identifiers
* solutions to data cleansing: 1) rely on basic cleaning capabilities of ETL software 2) custom-write cleansing routines 3) use special-purpose data cleansing software
What is meant by data loading?
What is meta data?
Metadata: data about the data; needed by both IT personnel & users who access data
* IT personnel: data sources and targets, database, table, column names; refresh schedules etc
* data users: attribute definitions, report tools available, report distribution info, held desk info
* meta data has not received much attention yet: uncertainty about what metadata should be stored & lack of methods to share it across vendors’ products
* developments improving ability to create metadata: 1) attempts to create standards for metadata models 2) use of application program interfaces APIs
What are the three different types of information users?
Information producers: like analysts, create info for own and for others use
Information consumers: like executives, consume info that is prepared for them
1) Tourists: do not know what they want; look at overview of things with little deep data analysis; design of database is not issue as they do not do much with it; need metadata
2) Explorers: have idea what they are looking for but do not know how to find ‘priceless gem’ in data; biggest problem is vast amount of data; require metadata to get started
3) Farmers: know what they want, where and how to get it; find ‘flakes of gold’ in data rather than ‘gem’; use the same data regularly and thus do not rely on metdata
Whats the conclusion?