What is meant when talking about “human-in-the-loop” in data science?
Automated systems may speed up the processes but humans are better at understanding the context and should be involved in designing, understanding, and reviewing of the data science process
Name different ways bias can be introduced into a data science project
Bias of design
- are the variables appropriate for all situations being modeled?
- assumptions made about the stakeholders who the data related to
Bias of data
- regional
- undertested in varied contexts
- gender
- ethnicity/race
Explain data management and data governance
What are the stages of the CSIRO research data lifecycle?
Explain the Capability Maturity Model
➡ Data description and representation
Goal: Create quality metadata for data discovery, preservation, and provenance functions
➡ Data dissemination
Goal: Design and implement interfaces for users to obtain and interact with data
➡ Repository services/preservation
Goal: Preserve collected data for long-term use
What is linked data?
How does semantic web work?
Name a format for linked (open) data and explain it
Explain ethics in data science
Ethics is the moral handling of data –> e.g. don’t sell private data to scammers
People have rights (privacy, access, erasure, etc.)
Companies have rights (ownership, confidentiality, intellectual property, copyright)
confidentiality vs. privacy:
privacy: I shall decide what happens with the data
confidentiality: is my data kept as I decided
Companies and gorvernment build business models on data
–> data as a valuable asset
–> data as a valuable product
Breaking it down:
What can you do?
What should you do?
How can you make sure the right things are done?
Surveillance
Australian government
My.gov.au provides access to the public to their data
➡ Greater dependency on online interfaces
➡ Less pen and paper data processing
➡ More automation of processing
➡ Cf. RoboDebt, Census
(Australian) Data retention laws
* “require some telecommunications service providers to retain specific telecommunications data (the data set) relating to the services they offer for at least 2 years”
➡ Who talks to whom on the phone & when
➡ Who emails whom & when
➡ The IP address
Data retention laws - issues
Rights vs functionality
* Change in responsibilities
➡ Change in processes and technology in response
AI veracity
Can you trust the analysis?
Various factors can affect the “accuracy” of any analysis
➡ Data quality
➡ Choice of analysis
➡ Design of analysis
➡ Choice of data
AI veracity
What is meant by ‘bias of design’?
Not all bias is in the numbers
* Bias can also be in how you have designed the research
➡ Are the variables appropriate for all situations being modelled?
➡ Are assumptions made about the stakeholders who the data relates to?
➡ Are assumptions being made about the context of the data?
AI veracity
What is meant by ‘bias of data’?
Sometimes the data used to train a ML system is biased, regardless of its volume
➡ Narrow
➡ Regional
➡ Undertested in varied contexts
e.g. Google:
Shows ads for high paying jobs to men more
than women
Sampling
What do you have to look out for sampling populations?
When collecting data for processing, it has to be relevant
➡ Can you get all data relating to the scenario you are modelling?
➡ Can you only get a random sample of data?The sample data has to be representative of the population being modelled
Observe the population before you make any unqualified assumptions
A/B testing and significance testing
Must test the statistical significance
➡ p-value: units of chance of your “surprise”(0 to 1) => Considering how likely you could get the same results regardless of the hypothesis