Data Fellowship Content Flashcards

(51 cards)

1
Q

The components of the data analytics life-cycle

A
  1. Plan
  2. Data Preparation
  3. Analysis
  4. Modelling
  5. Refine & Compare
  6. Communicate & Implement
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

GDPR

A

General Data Protection Regulation.

A legal framework that sets guidelines for the collection and processing of personal information from individuals who live and outside of the European Union (EU).

Its aim is to give consumers control over their own personal data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

RDBMS (Relational Database Management System)

A

a type of database that stores and provides access to data points that are related to one another.

Based on the relational model, an intuitive, straightforward way of representing data in tables. Each row in the table is a record with a unique ID called the key. The columns of the table hold attributes of the data, and each record usually has a value for each attribute, making it easy to establish the relationships among data points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Personally Identifiable Information (PII)

A

Information that, when used alone or with other relevant data, can identify an individual.

May contain direct identifiers (e.g., passport information) that can identify a person uniquely, or quasi-identifiers (e.g., race) that can be combined with other quasi-identifiers (e.g., date of birth) to successfully recognise an individual

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Modelling

A

The stage where models such as algorithms and predictive analysis are built.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Refine & Compare

A

reflects on the proposed models and solutions, considers alternatives, and iterates current work into one optimised solution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Lawfulness, fairness and transparency

A

One of the principles of GDPR.

Gathering data and processing it with a valid legal basis, for example getting user consent to process their data in a certain way.

Your processing of data is in the best interest of the person the data is about, and the scope of the processing can be reasonably expected by the person.

You clearly communicate what, how and why you process data to those whose data you process. Should be in a way that enables data subjects to easily understand what you are doing with their data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Purpose Limitation

A

One of the principles of GDPR.

Data collected for specified, explicit, and legitimate purposes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Data minimisation

A

One of the Principles of GDPR

Only collect data that is adequate, relevant, and limited to what is necessary.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Accuracy

A

One of the Principles of GDPR

Having data records that represent the current truth. Records must be kept up to date and correct, and the data processor must take reasonable measures to ensure that.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Storage Limitation

A

One of the Principles of GDPR

If personal data is no longer required it must be deleted. Exceptions when data can be kept for longer include data for scientific purposes or in the interest of the public (e.g. criminal records).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Accountability

A

One of the Principles of GDPR

Taking responsibility for your data processing. The data controller and/or processor must be responsible for proper processing of personal data and compliance with the rules of GDPR

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Integrity and confidentiality

A

One of the Principles of GDPR

Personal data must be processed or stored in a manner that ensures its security. This includes protection against unauthorized or unlawful processing and accidental loss, destruction or damage.

It must not be made available or disclosed to unauthorized individuals, entities or processes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Administrative Data

A

information created when people interact with services and are collated by organisations.

It is used to help with the operational services of an organisation.

Examples from Multiverse may include attendance data, otj% hours data, which is collected and used to track apprentice progress, and evidence compliance processes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Structured Data

A

Data that can be organized and formatted in a way that is easy for computers to read, organize, and understand; and (3) can be inserted into a database in a seamless fashion.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Unstructured data

A

Data that cannot be stored in a traditional relational database or RDBMS. Text and multimedia are two common types of unstructured content. Many business documents are unstructured, as are email messages, videos, photos, webpages, and audio files.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Inner join

A

Most common type of join; includes rows in the query only when the joined field matches records in both tables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Left Join

A

The LEFT JOIN keyword returns all rows from the left table (table1), with the matching rows in the right table (table2). The result is NULL in the right side when there is no match.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Right Join

A

The RIGHT JOIN keyword returns all rows from the right table (table2), with the matching rows in the left table (table1). The result is NULL in the left side when there is no match.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Cartesian Join

A

Links table data so each record in the first table is matched with each individual record in the second table. Also called a Cartesian product or cross join.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Schema

A

defines how data is organised within a relational database. Inclusive of logical constraints such as table names, fields, data types, and the relationships between these entities.

22
Q

Descriptive Analytics

A

the use of data to understand past and current business performance and make informed decisions

23
Q

Predictive Analytics

A

This type of analytics involves analyzing historical data and using statistical and machine-learning techniques to make predictions or forecasts about future events or outcomes.

It identifies patterns and relationships in data to generate probabilistic predictions about what is likely to happen.

Predictive analytics answers questions like “What is likely to happen in the future?” or “What will be the impact of a specific action?” It helps organizations anticipate future trends, identify potential risks or opportunities, and make proactive decisions.

Example in the workplace: Forecasting customer demand for a product based on historical sales data, market trends, and external factors like seasonality or economic indicators.

24
Q

Prescriptive Analytics

A

This type of analytics goes beyond descriptive analytics by providing recommendations and suggestions on what actions to take based on the analysis of data and various possible scenarios.

It leverages advanced techniques, such as optimization algorithms, machine learning, and simulation models, to generate actionable insights. Prescriptive analytics answers questions like “What should we do?” or “What is the best course of action?” It helps in making informed decisions and optimizing outcomes by considering constraints, objectives, and potential risks.

Example in the workplace: Optimizing supply chain operations by recommending the most efficient routes for deliveries, considering factors like traffic, cost, and delivery time

25
Pandas
Panel Data. a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series
26
Dependent variable
The measurable effect, outcome, or response in which the research is interested.
27
Independent variable
The variable you manipulate or vary in an experimental study to explore its effects. It's not influenced by any other variables in the study.
28
Boolean Data
A data type with only two possible values, usually true or false
29
Logistic Regression
'Often used for classification and predictive analytics. Estimates the probability of an event occurring, such as voted or didn't vote, based on a given data set of independent variables'
30
Linear regression
used to predict the value of a variable based on the value of another variable. The variable you want to predict is called the dependent variable. The variable you are using to predict the other variable's value is called the independent variable.'
31
Inferential statistics
numerical methods used to determine whether research data support a hypothesis or whether results were due to chance.
32
Null hypothesis
A prediction that there is no difference between groups or conditions, or a statement or an idea that can be falsified, or proved wrong.
33
Coefficient
The change in the dependent variable based on the change in one of the independent variables, when all other independent variables remain constant.
34
Research Data
Any Information that has been collected, observed, generated or created to validate original research findings. Examples of research data can include: - Results of machine learning models - Documents and spreadsheets - Artefacts and specemin samples. - Models, algorithms, and scripts The ways that this applies to multiverse are: - Machine learning models of different uses, for example the apprentice risk model - Qualitative UX research done by the product and learning teams - Psychometrics research for hiring and apprentice applications.
35
Open Data
A subset of public data that is freely available to the public, typically without any restrictions on access, use, or redistribution. This data is often provided in standardized, machine-readable formats, making it easily accessible and reusable by individuals or organizations. The goal is to promote transparency, accountability, and collaboration. Examples of open data in the workplace can include government datasets on demographics, economic indicators, or geographic information that are made freely available for public use and analysis.
36
Public Data
Any information that is available to the general public. It can include a wide range of data from different sources, including government agencies, public institutions, or other entities. This data can be collected for various purposes and may include demographic information, statistical data, environmental records, or public reports. In the workplace, public data can be used for market research, policy analysis, or to inform decision-making processes.
37
Principles of User Experience
considerations to ensure data products are accessible and easily useable by all. Key features of these principles include: - Font readability - Accessibility (for example colour blindness) - Layout and clarity of dashboards
38
p-value
The probability level which forms basis for deciding if results are statistically significant (not due to chance).
39
Alternate hypothesis
is the hypothesis to be considered as an alternative to the null hypothesis. will only be accepted if there is significant evidence to suggest that the null hypothesis is not correct
40
Autoregressive model
A regression model in which a regression relationship based on past time series values is used to predict the future time series values. it takes autocorrelation into account.
41
Database
an organised collection of data that can be easily accessed, managed, and updated. Within construction data management, databases ensure that model revisions, metadata, and validation results are structured for reliable querying and analysis.
42
Relational Database
organises data into tablular form with defined relationships between them, typically through primary and foreign keys.
43
Non-relational or NoSQL database
Stores data in formats such as document-based structures. These are better suited for unstructured or semi-structured data.
44
Data Warehouse
Large, central repository used to store data from multiple sources for analytics and reporting. While this project uses Power BI as the front-end visualisation tool rather than a dedicated warehouse, it mimics a warehouse approach by integrating multiple datasets into a unified analytical model.
45
Data Lake
Differs from a warehouse in that it stores raw, unstructured data in its native format, allowing greater flexibility.
46
Normalisation
Normalising a dataset involves restructuring it so that redundant or unstructured information is removed, resulting in a clearer, more efficient, and logically organised data model.
47
ETL
(Extract, Transform, Load) process underpins the flow of data from source to analysis. Data from Solibri was extracted to Excel (tabular format), transformed using Power Query for cleaning and schema alignment, and loaded into Power BI’s relational model.
48
Primary Key
The primary key is a unique identifier for each record in a table. The primary key ensures that there are no duplicate values, and that therefore, each row can be referenced reliably.
49
Foreign Key
The foreign key is a field in one table that refers to a primary key in another table. This relationship ensures that tables can be linked without redundancy.
50
Composite Key
The composite key uses two or more columns together to identify a unique record. In the case of this project, by combining Building Number and Revision into a single field, this created a unique identifier for each model version.
51
Standard deviation
Measures how spread-out compliance percentages are from the mean, indicating whether performance is consistent or variable.