How did the term “Big Data” come about?
What are the 4 big challenges associated with big data?
▶ Volume (size of the dataset)
▶ Velocity (rate at which the data arrives)
▶ Data quality (missingness or errors)
▶ Sampling frame (data is often collected)
What is a genome?
How are theser represented?
How many bases does a human genome consist of? And over home many chromosome pairs?
What is DNA sequencing?
WHat is the problem with DNA sequencing?
What is SNP in DNA sequencing? What further complication arise from this?
What are Big n and Big P with regards to Big Data in statistical machine learning?
What are the 4 main issues that arise from Big Data from this perspective?
Nowadays, what will be the case with the models we look at?
Computation Issues: CPU times?
Computation Issues: Moores Law, CPUs and RAM?
Moore Laws: v the number of transistors on integrated circuit chips doubles every 24 months
Computation issues: What are some solutions?
What are the Algorthmic issues?
Statistical issues: Generally what are the challenges when n and p increase?
X-Y is a Chi-squared distribution
How close is the neighbour for:
In order to get 1% of the training data for p=100 you’re already looking at a hypercuve of 0.95 of a p-dim unit length hypercube –> does this qualify as a neighbourhood of your data
How big should n be to control the bias/control the variance?
Specially if we fix the length of the hypercube as E=0.1 and ensure the neighbourhood hypercube contains at least 10 points on average?
Basically Big p destroys everything –> even with p=100, 10^101 is not a feasible amount of training data to obtain
* Thus you can not obtain a “statistical guarentee” when P is big as n scales exponentially with P
What is the Ethical issues associated with big data?