How can we classify data?
structured, semi-structured, or unstructured
What is structured data?
Structured data is data that adheres to a fixed schema, so all of the data has the same fields or properties.
What is semi-structured data?
Semi-structured data is information that has some structure, but which allows for some variation between entity instances.
What are some common formats of semi-structured data?
JSON
What are unstructured data?
Not all data is structured or even semi-structured. For example, documents, images, audio and video data, and binary files might not have a specific structure. This kind of data is referred to as unstructured data.
Give me some examples of unstructured data?
documents, images, audio and video data, and binary files
What two categories of a data store do we have?
File stores and databases
What should one consider if we are to use a file store or a database?
What is Delimited Text Files?
Data is often stored in plain text format with specific field delimiters and row terminators. The most common format for delimited data is comma-separated values (CSV) in which fields are separated by commas, and rows are terminated by a carriage return / new line.
What type of files can one store in a file store?
What is some of the popular optimized file formats?
What is Avro?
Avro is a row-based format. It was created by Apache. Each record contains a header that describes the structure of the data in the record. This header is stored as JSON. The data is stored as binary information. An application uses the information in the header to parse the binary data and extract the fields it contains. Avro is a good format for compressing data and minimizing storage and network bandwidth requirements.
What is ORC?
ORC (Optimized Row Columnar format) organizes data into columns rather than rows. It was developed by HortonWorks for optimizing read and write operations in Apache Hive (Hive is a data warehouse system that supports fast data summarization and querying over large datasets). An ORC file contains stripes of data. Each stripe holds the data for a column or set of columns. A stripe contains an index into the rows in the stripe, the data for each row, and a footer that holds statistical information (count, sum, max, min, and so on) for each column.
What is Parquet
Parquet is another columnar data format. It was created by Cloudera and Twitter. A Parquet file contains row groups. Data for each column is stored together in the same row group. Each row group contains one or more chunks of data. A Parquet file includes metadata that describes the set of rows found in each chunk. An application can use this metadata to quickly locate the correct chunk for a given set of rows, and retrieve the data in the specified columns for these rows. Parquet specializes in storing and processing nested data types efficiently. It supports very efficient compression and encoding schemes.
What optimized file format should we use to compress data and to minimizing storage and network bandwidth requirements
Avro
What optimized file format should we use to optimize read and write operations in apache hive
ORC
What optimized file format should we use that specialices in storing and processing nested data types efficiently
Parquet
What is normalization of data?
The elimination of duplicate data values
What type of non relational databases do we have?
What is key-value type in non relational database?
Key-value databases in which each record consists of a unique key and an associated value, which can be in any format.
What is Document type in non relational database?
Document databases, which are a specific form of key-value database in which the value is a JSON document (which the system is optimized to parse and query)
What is Column family type in non relational database?
Column family databases, which store tabular data comprising rows and columns, but you can divide the columns into groups known as column-families. Each column family holds a set of columns that are logically related together.
What is Graph databases type in non relational database?
Graph databases, which store entities as nodes with links to define relationships between them.
What is Online Transactional Processing (OLTP)?
OLTP solutions rely on a database system in which data storage is optimized for both read and write operations in order to support transactional workloads in which data records are created, retrieved, updated, and deleted (often referred to as CRUD operations).