What are the FOUR dimensions of Big Data?
What characterizes big data/how is it defined?
Big Data is the Information asset characterized by such a High Volume, Velocity and Variety to require specific Technology and Analytical Methods for its transformation into Value
Gartner: Big data is high-volume, highvelocity and high-variety information assets that demand cost-effective, innovative forms of information processing for
enhanced insight and decision making
What are the drivers behind big data?
What is NoSQL?
“Not only SQL” –> alternate model for data management
What is MPP? and how is it related to Big Data?
Massively parallel Processing
–> A type of computer, utilizing high-bandwidth networks and massive I/O devices
RELATION TO BD:
- Big data is smarter, since it couples clusters of hardware components with open source tools and technology
What five aspects will a corporation considering incorporating Big Data need to consider?
• Feasibility: Is the enterprise aligned in a way that allows for new and emerging technologies to be brought into
the organization?
• Reasonability: will the resource requirements exceed the capability of the existing or planned environment?
• Value: do the result warrant the investment?
• Integrability: any constraints or impediments within the organization from a technical, social, or political
perspective?
• Sustainability: are costs associated with maintenance, configuration, skills maintenance, and adjustments to the
level of agility sustainable?
Name the 7 types of people needed for implementing Big Data?
1) Business envangelist –> understands current limitations of existing tech infrastructure
2) Technical envangelist –> undestands the emerging tech and the science behind
3) business analyst –> engages the business process owners, and identify measures to quantify
4) Big Data application architect –> Experienced in high performance computing
5) application developer –> Identify the technical resources with the right set of skills for programming
6) Program manager –> experienced in project managment
7) data scientist –> Experienced in coding and statistics/AI
What is the Big Data framework? and what key components does it consist of?
Overall picture of the Big Data landscape, consists of:
What is API?
Application programming interface
is a set of routines, protocols, and tools for building software applications. Basically, an API specifies how software components should interact. Additionally, APIs are used when programming graphical user interface (GUI) components
Which is better row or column-oriented data?
Column-oriented data; since it reduces the latency by storing each column separately
Access performance; ROW: not good for many simultaneous queries (as opposed to column)
Speed of aggregation; Much faster in column-oriented data
Suitability to compression; column-data better suited for compression, decreasing storage needs.
Data load speed; faster in column, since data is stored separately you can load in parallel using multiple threads
Hardware versus software?
Go to slide 36 and 37 and discuss
Name the four tools and techniques?
Processing capability
- Often interconnected by several nodes, allowing tasks to be run simultaneously called MULTITHREADING
Storage of data
Memory
- Holds the data in the node currently running
Network
- Communication infrastructure between the nodes
What types of architectural clusters exist? And what the two OVERALL?
Slide 42 and 43:
OVERALL: centralized and decentralized
What does the general architecture distinguish between? and what are their roles?
Management of computing resources
- oversees the pool of processing nodes, assign tasks and monitors activity
Management of data/storage
- oversees the data storage pool and distributes datasets across the collection of storage resources
What are the three important layers of Hadoop?
What are the main function of HDFS?
What are the four advantages of using HDFS?
1) decreasing the cost of specialty large-scale storage systems
2) providing the ability to rely on commodity components
3) enabling the ability to deploy using cloud-based services
4) reducing system management costs
What is MapReduce?
What are the two steps in MapReduce?
Map: Describes the computation analysis applied to a set of input key/value pairs to produce a set of intermediate key/value pairs
Reduce: the set of values associated with the intermediate key/value pairs output by the Map operation are combined to provide the results
Example; count the number of occurences of a word in a corpus:
key: is the word
value: is the number of times the word is counted
What is parallelization?
the act of designing a computer program or system to process data in parallel. Normally, computer programs compute data serially: they solve one problem, and then the next, then the next. If a computer program or system is parallelized, it breaks a problem down into smaller pieces that can each independently be solved at the same time by discrete computing resources
What are the four use cases for big data?
Counting; document indexing, filtering, aggregation
Scanning; sorting, text analysis, pattern recognition
Modeling; analysis and prediction
Storing; rapid access to stored large datasets
What is data mining?
The art and science of discovering knowledge, insights and patterns in data
It helps recognizing the hidden value in data
Describe the typical process of data mining?
OR:
Data input –> data consolidation –> data cleaning –> data transformation –> data reduction –> well-formed data
In terms of data mining, what does ETL stand for?
Extract, transform, load