Normalisation
Normalisation is the process of structuring the data in a database to
reduce redundancy and improve integrity.
First Normal Form
Requirements: There are no repeated groups of data. The data is atomic. Each field contains one value only.
Second Normal Form
Requirements: Meets the requirements for 1NF. No partial dependency. All non-key attributes fully depend on the entire primary key and not only on part of the primary key.
Third Normal Form
Requirements: Meets the requirements for 2NF. No transitive dependency. Non-key attributes depend only on the primary key.
SQL
Structured query language is the language used to search databases. A query language allows users to retrieve and manipulate data from a database. Here is the statement used to output or display data.
Big Data
Extremely large and complex datasets that traditional databases are unable to store and process them within acceptable time frames.
The 5 V’s
Volume: Huge amounts of data.
Example: Facebook generates 4 petabytes/day.
Velocity: Data is created at high speed.
Example: Social media posts in real-time.
Variety: Different types of data – structured, unstructured, semi-structured.
Example: Text, images, videos.
Veracity: Data quality and accuracy.
Example: Inconsistent or missing values in surveys.
Value: Useful information extracted from data.
Example: Targeted advertising from customer data.
Data Warehousing
A central repository where data from multiple sources is stored in an organised way for reporting and analysis.
Why Data Warehousing is needed
Data Mining
The retrieval and analysis of large sets of data in data warehouses to identify trends and patterns. E.g. Marketing opportunities.
Structured Data
Data that is organised in a fixed format, usually in rows and columns.
Characteristics: Stored in tables. Easy to search, sort, and analyse.
E.g. Student records, Bank transactions.
Unstructured Data
Data that does not have a predefined format or organised structure.
Characteristics: Not stored in traditional table format & difficult to analyse using simple tools. Requires advanced tools.
E.g., emails, social media Posts, images, videos and PDF files.
Techniques
Classification: Sorting data into categories.
Clustering: Grouping similar data together.
Association: Finding relationships between data items.
Examples: Market basket analysis: Customers who buy bread often buy butter.
Detecting fraudulent credit card transactions.
Predictive Analysis
A subset of data mining used to make predictions about future events based on historical behaviour. E.g. weather/economic forecasting.
Techniques used for Predictive Analysis
Predictive Score - Assigns a probability for the likelihood that something, such as a customer, will behave a certain way. It is used to predict behaviour and assess risk over a wide variety of disciplines.
Statistical modelling - Using statistical techniques to build models to predict what might happen in the future.
Machine learning algorithms - AI
Distributed Systems/Processing
Distributed Systems/Processing is a system/technique of carrying out a large computing task by sharing the processing between computers in different locations.
How Distributed Systems/Processing Works
Each computer runs its own programs and has its own store of data, but will share data with other computers.
Computers in various locations will be linked in a wide-area network.
Each computer will have the software necessary to carry out database operations on records and to display any associated information/images.
Records will be held locally, but additional records may be held centrally when needed.
Staff/users may access and update information at any of the locations by means of the network.
The overall system may provide summary management data.
The system will be able to inform users of updates and any actions needed.
Advantages of Distributed Systems/Processing
Faster processing - Tasks are split into smaller parts and processed at the same time by multiple machines to complete the work faster.
Fault tolerance - if one machine fails, others continue.
Distribution of Data
When data is distributed, the database is
physically stored in multiple locations. Each site holds part or all of the data.
Distribution of Processing
When processing is distributed, the computations and database operations are executed across several computers rather than on a single machine. This means: Each node can process requests. Workload is shared among multiple machines. Tasks can run in parallel.