What is ‘Big Data’?
‘Big Data’ is a catch-all term for data that won’t fit the usual containers.
What are the three defining features of Big Data known as ‘the three Vs’?
What does ‘Volume’ refer to in Big Data?
There is too much data for it all to fit on a conventional hard drive or even a server.
What does ‘Velocity’ refer to in Big Data?
Data on the servers is created and modified rapidly, requiring responses within milliseconds.
What does ‘Variety’ refer to in Big Data?
The data consists of many different types, including binary and multimedia files.
Why is the unstructured nature of Big Data challenging?
It makes it difficult to analyze the data using conventional databases that require a structured format.
What is a primary requirement for processing Big Data stored over multiple servers?
The processing must be distributed across more than one machine.
What programming paradigm is particularly suited for distributed processing of Big Data?
Functional programming.
What are the characteristics of functional programming that aid in distributed code?
What is the fact-based model for representing data?
A way of storing each piece of information as a fact that is immutable and includes a timestamp.
What does it mean that facts in the fact-based model are immutable?
Facts never change once created and cannot be overwritten.
How does the fact-based model reduce data loss?
It prevents accidental data loss due to human error by not allowing overwriting of facts.
What is graph schema used for in Big Data?
To graphically represent the structure of a dataset using nodes and edges.
In a graph schema, what do nodes represent?
Entities that can contain properties.
What do edges represent in a graph schema?
Relationships between entities, labelled with a brief description.
Are timestamps commonly included in graph schema diagrams?
No, timestamps are rarely included; it is assumed that each node contains the most recent information.
Fill in the blank: The processing associated with using Big Data must be split across multiple _______.
machines
True or False: Conventional databases are well-suited for storing Big Data.
False
What is an alternative method to represent properties in graph schema?
Listing an entity’s properties inside rectangles joined to entities with a dashed line.
examples of big data volume
Hundreds of terabytes
Large medical datasets for diagnosis
Gene sequencing
Predicting disease outbreaks
Results of large-scale scientific
experiments
examples of big data variety
Cannot be represented in a table // by a
relational database
Email messages
Videos
Images
Web site contents
Facial recognition
examples of big data velocity
Thousands of items to process per
second.
Data must be processed as it is received
– it cannot be batched and processed
later
Card payment fraud detection
Recommendations systems
Explain some of the challenges that Big Data brings with it and the approaches that can be
taken to overcome these, in relation to programming and hardware.
Challenges:
* Data cannot be stored on one server
* Not possible to process data quickly enough with one computer.
* Data cannot be represented in a relational database.
* Some unstructured data are difficult to analyse.
How overcome:
* Distributed database systems
distributed across multiple servers.