Chapter 3.1: Big Data - Issues arising from Big Data Flashcards

Question 1

Q

How did the term “Big Data” come about?

Question 2

Q

What are the 4 big challenges associated with big data?

Answer

A

▶ Volume (size of the dataset)
▶ Velocity (rate at which the data arrives)
▶ Data quality (missingness or errors)
▶ Sampling frame (data is often collected)

Question 3

Q

What is a genome?

How are theser represented?

How many bases does a human genome consist of? And over home many chromosome pairs?

What is DNA sequencing?

Question 4

Q

WHat is the problem with DNA sequencing?

Question 5

Q

What is SNP in DNA sequencing? What further complication arise from this?

Question 6

Q

What are Big n and Big P with regards to Big Data in statistical machine learning?

What are the 4 main issues that arise from Big Data from this perspective?

Nowadays, what will be the case with the models we look at?

Question 7

Q

Computation Issues: CPU times?

Question 8

Q

Computation Issues: Moores Law, CPUs and RAM?

Answer

A

Moore Laws: v the number of transistors on integrated circuit chips doubles every 24 months

Clock Rates actually flatten out around 2000s (even though Moore’s law still shows evidence of the doubling up until 2018). –> physically limitations should as adequate cooling to mainain computational efficiency up, smaller chips have the a quantum tunneling issue.
To overcome this issue people have started to add more cores into the processor

Question 9

Q

Computation issues: What are some solutions?

Question 10

Q

What are the Algorthmic issues?

Answer

A

for a logt of statistical machine learning problems you need to design, or, draw up new algorithms to try to fit or try to solve yyour machine learning problem

Question 11

Q

Statistical issues: Generally what are the challenges when n and p increase?

Question 12

Q

Answer

A

Note 5e₁ means 5 * a vector (1,0,0,0,0).

X-Y is a Chi-squared distribution

Question 13

Q

How close is the neighbour for:

p = 2
p= 5
p = 10
p = 100

Answer

A

In order to get 1% of the training data for p=100 you’re already looking at a hypercuve of 0.95 of a p-dim unit length hypercube –> does this qualify as a neighbourhood of your data

SO when p is big your edge needs to be very big

Question 14

Q

How big should n be to control the bias/control the variance?

Specially if we fix the length of the hypercube as E=0.1 and ensure the neighbourhood hypercube contains at least 10 points on average?

Answer

A

Basically Big p destroys everything –> even with p=100, 10^101 is not a feasible amount of training data to obtain
* Thus you can not obtain a “statistical guarentee” when P is big as n scales exponentially with P

This is a core problem for statisticians

Question 15

Q

What is the Ethical issues associated with big data?

Answer

A

demographic parity –> across certain demographic features you need to ensure that hte conditional probability e.g. of recruiting (+ve) versus not-hiring(-ve) is roughly equal. –> controversial but is one of the ways consider of enforcing fairness

Chapter 3.1: Big Data - Issues arising from Big Data Flashcards

(15 cards)