Chapter 3.1: Big Data - Issues arising from Big Data Flashcards

(15 cards)

1
Q

How did the term “Big Data” come about?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the 4 big challenges associated with big data?

A

▶ Volume (size of the dataset)
▶ Velocity (rate at which the data arrives)
▶ Data quality (missingness or errors)
▶ Sampling frame (data is often collected)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a genome?

How are theser represented?

How many bases does a human genome consist of? And over home many chromosome pairs?

What is DNA sequencing?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

WHat is the problem with DNA sequencing?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is SNP in DNA sequencing? What further complication arise from this?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are Big n and Big P with regards to Big Data in statistical machine learning?

What are the 4 main issues that arise from Big Data from this perspective?

Nowadays, what will be the case with the models we look at?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Computation Issues: CPU times?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Computation Issues: Moores Law, CPUs and RAM?

A

Moore Laws: v the number of transistors on integrated circuit chips doubles every 24 months

  • Clock Rates actually flatten out around 2000s (even though Moore’s law still shows evidence of the doubling up until 2018). –> physically limitations should as adequate cooling to mainain computational efficiency up, smaller chips have the a quantum tunneling issue.
  • To overcome this issue people have started to add more cores into the processor
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Computation issues: What are some solutions?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the Algorthmic issues?

A
  • for a logt of statistical machine learning problems you need to design, or, draw up new algorithms to try to fit or try to solve yyour machine learning problem
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Statistical issues: Generally what are the challenges when n and p increase?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q
A
  • Note 5e1 means 5 * a vector (1,0,0,0,0).

X-Y is a Chi-squared distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How close is the neighbour for:

  • p = 2
  • p= 5
  • p = 10
  • p = 100
A

In order to get 1% of the training data for p=100 you’re already looking at a hypercuve of 0.95 of a p-dim unit length hypercube –> does this qualify as a neighbourhood of your data

  • SO when p is big your edge needs to be very big
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How big should n be to control the bias/control the variance?

Specially if we fix the length of the hypercube as E=0.1 and ensure the neighbourhood hypercube contains at least 10 points on average?

A

Basically Big p destroys everything –> even with p=100, 10^101 is not a feasible amount of training data to obtain
* Thus you can not obtain a “statistical guarentee” when P is big as n scales exponentially with P

  • This is a core problem for statisticians
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the Ethical issues associated with big data?

A
  • demographic parity –> across certain demographic features you need to ensure that hte conditional probability e.g. of recruiting (+ve) versus not-hiring(-ve) is roughly equal. –> controversial but is one of the ways consider of enforcing fairness
How well did you know this?
1
Not at all
2
3
4
5
Perfectly