Lecture 11 – Issues Flashcards

Question 1

Q

What is meant when talking about “human-in-the-loop” in data science?

Answer

A

Automated systems may speed up the processes but humans are better at understanding the context and should be involved in designing, understanding, and reviewing of the data science process

Question 2

Q

Name different ways bias can be introduced into a data science project

Answer

A

Bias of design
- are the variables appropriate for all situations being modeled?
- assumptions made about the stakeholders who the data related to

Bias of data
- regional
- undertested in varied contexts
- gender
- ethnicity/race

Question 3

Q

Explain data management and data governance

Question 4

Q

What are the stages of the CSIRO research data lifecycle?

Question 5

Q

Explain the Capability Maturity Model

Answer

A

Good management happens all through the data lifecycle
4 key process areas:
➡ Data acquisition, processing and quality assurance
Goal: Reliably capture and describe scientific data in a way that facilitates preservation and reuse

➡ Data description and representation
Goal: Create quality metadata for data discovery, preservation, and provenance functions

➡ Data dissemination
Goal: Design and implement interfaces for users to obtain and interact with data

➡ Repository services/preservation
Goal: Preserve collected data for long-term use

Good data governance uses a good management system
➡ A mature system manages data all through the data lifecycle and
throughout all projects.

Question 6

Q

What is linked data?

Question 7

Q

How does semantic web work?

Question 8

Q

Name a format for linked (open) data and explain it

Question 9

Q

Explain ethics in data science

Answer

A

Ethics is the moral handling of data –> e.g. don’t sell private data to scammers

People have rights (privacy, access, erasure, etc.)

Companies have rights (ownership, confidentiality, intellectual property, copyright)

confidentiality vs. privacy:
privacy: I shall decide what happens with the data
confidentiality: is my data kept as I decided

Companies and gorvernment build business models on data
–> data as a valuable asset
–> data as a valuable product

Breaking it down:
What can you do?
What should you do?
How can you make sure the right things are done?

Question 10

Q

Surveillance

Australian government

Answer

A

My.gov.au provides access to the public to their data
➡ Greater dependency on online interfaces
➡ Less pen and paper data processing
➡ More automation of processing
➡ Cf. RoboDebt, Census

Less clear what access each government can have to the data

(Australian) Data retention laws
* “require some telecommunications service providers to retain specific telecommunications data (the data set) relating to the services they offer for at least 2 years”

➡ Who talks to whom on the phone & when
➡ Who emails whom & when
➡ The IP address

What doesn’t it include?
➡ information about telecommunications content or web, browsing history
Who has access to the data without a warrant?
➡ 20 intelligence agencies, criminal law enforcement agencies,
ATO, ASIC and ACCC
➡ Civil litigation exemption

Question 11

Q

Data retention laws - issues

Answer

A

Rights vs functionality
* Change in responsibilities
➡ Change in processes and technology in response

Where does automation and AI fit?
➡ Where is the responsibility and accountability?
➡ Snowden and the NSA surveillance

Question 12

Q

AI veracity

Can you trust the analysis?

Answer

A

Various factors can affect the “accuracy” of any analysis
➡ Data quality
➡ Choice of analysis
➡ Design of analysis
➡ Choice of data

It is easy for the modelling to misrepresent what the data is supposed to reflect.
➡ Even statistical analysis can be biased!

Question 13

Q

AI veracity

What is meant by ‘bias of design’?

Answer

A

Not all bias is in the numbers
* Bias can also be in how you have designed the research
➡ Are the variables appropriate for all situations being modelled?
➡ Are assumptions made about the stakeholders who the data relates to?
➡ Are assumptions being made about the context of the data?

Question 14

Q

AI veracity

What is meant by ‘bias of data’?

Answer

A

Sometimes the data used to train a ML system is biased, regardless of its volume
➡ Narrow
➡ Regional
➡ Undertested in varied contexts

Biased system may discriminate in its results, forinstance by
➡ gender
➡ ethnic associations
➡ generalities
Biased system may not be as accurate in its results for unfamiliar contexts and subjects

e.g. Google:
Shows ads for high paying jobs to men more
than women

Question 15

Q

Sampling

What do you have to look out for sampling populations?

Answer

A

When collecting data for processing, it has to be relevant

➡ Can you get all data relating to the scenario you are modelling?
➡ Can you only get a random sample of data?The sample data has to be representative of the population being modelled

Observe the population before you make any unqualified assumptions

Question 16

Q

A/B testing and significance testing

Answer

Study These Flashcards

A

Blind experiments or A/B testing may be used to show if relationship between various variables
The experimental scenario needs to be divided into:
➡ A: Sample is subject to the known variable
➡ B: Sample is not subject to the known variable (the Control set)

Must test the statistical significance
➡ p-value: units of chance of your “surprise”(0 to 1) => Considering how likely you could get the same results regardless of the hypothesis

Lecture 11 – Issues Flashcards

(16 cards)