Data Fellowship Content Flashcards

Question

Pandas

Answer 1

Panel Data. a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series

Answer 2

The measurable effect, outcome, or response in which the research is interested.

Answer 3

The variable you manipulate or vary in an experimental study to explore its effects. It's not influenced by any other variables in the study.

Answer 4

A data type with only two possible values, usually true or false

Answer 5

'Often used for classification and predictive analytics. Estimates the probability of an event occurring, such as voted or didn't vote, based on a given data set of independent variables'

Answer 6

used to predict the value of a variable based on the value of another variable. The variable you want to predict is called the dependent variable. The variable you are using to predict the other variable's value is called the independent variable.'

Answer 7

numerical methods used to determine whether research data support a hypothesis or whether results were due to chance.

Answer 8

A prediction that there is no difference between groups or conditions, or a statement or an idea that can be falsified, or proved wrong.

Answer 9

The change in the dependent variable based on the change in one of the independent variables, when all other independent variables remain constant.

Answer 10

Any Information that has been collected, observed, generated or created to validate original research findings. Examples of research data can include: - Results of machine learning models - Documents and spreadsheets - Artefacts and specemin samples. - Models, algorithms, and scripts The ways that this applies to multiverse are: - Machine learning models of different uses, for example the apprentice risk model - Qualitative UX research done by the product and learning teams - Psychometrics research for hiring and apprentice applications.

Answer 11

A subset of public data that is freely available to the public, typically without any restrictions on access, use, or redistribution. This data is often provided in standardized, machine-readable formats, making it easily accessible and reusable by individuals or organizations. The goal is to promote transparency, accountability, and collaboration. Examples of open data in the workplace can include government datasets on demographics, economic indicators, or geographic information that are made freely available for public use and analysis.

Answer 12

Any information that is available to the general public. It can include a wide range of data from different sources, including government agencies, public institutions, or other entities. This data can be collected for various purposes and may include demographic information, statistical data, environmental records, or public reports. In the workplace, public data can be used for market research, policy analysis, or to inform decision-making processes.

Answer 13

considerations to ensure data products are accessible and easily useable by all. Key features of these principles include: - Font readability - Accessibility (for example colour blindness) - Layout and clarity of dashboards

Answer 14

The probability level which forms basis for deciding if results are statistically significant (not due to chance).

Answer 15

is the hypothesis to be considered as an alternative to the null hypothesis. will only be accepted if there is significant evidence to suggest that the null hypothesis is not correct

Answer 16

A regression model in which a regression relationship based on past time series values is used to predict the future time series values. it takes autocorrelation into account.

Answer 17

an organised collection of data that can be easily accessed, managed, and updated. Within construction data management, databases ensure that model revisions, metadata, and validation results are structured for reliable querying and analysis.

Answer 18

organises data into tablular form with defined relationships between them, typically through primary and foreign keys.

Answer 19

Stores data in formats such as document-based structures. These are better suited for unstructured or semi-structured data.

Answer 20

Large, central repository used to store data from multiple sources for analytics and reporting. While this project uses Power BI as the front-end visualisation tool rather than a dedicated warehouse, it mimics a warehouse approach by integrating multiple datasets into a unified analytical model.

Answer 21

Differs from a warehouse in that it stores raw, unstructured data in its native format, allowing greater flexibility.

Answer 22

Normalising a dataset involves restructuring it so that redundant or unstructured information is removed, resulting in a clearer, more efficient, and logically organised data model.

Answer 23

(Extract, Transform, Load) process underpins the flow of data from source to analysis. Data from Solibri was extracted to Excel (tabular format), transformed using Power Query for cleaning and schema alignment, and loaded into Power BI’s relational model.

Answer 24

The primary key is a unique identifier for each record in a table. The primary key ensures that there are no duplicate values, and that therefore, each row can be referenced reliably.

Answer 25

The foreign key is a field in one table that refers to a primary key in another table. This relationship ensures that tables can be linked without redundancy.

Answer 26

The composite key uses two or more columns together to identify a unique record. In the case of this project, by combining Building Number and Revision into a single field, this created a unique identifier for each model version.

Answer 27

Measures how spread-out compliance percentages are from the mean, indicating whether performance is consistent or variable.

Data Fellowship Content Flashcards

(51 cards)