what is big data?
refers to data sets too large or complex to process using traditional data processing methods
- large volumes of data, often comprising multiple data types
- there is substantial variation within the data which is complex to analyse
- integrative analysis of different types of big data reveals interactions between variables
who analyses big data?
are big data experiments hypothesis-based or hypothesis-generating?
they are unbiased and hypothesis-generating
- they have huge power for discovery
- no need to choose and exclude markers in advance
where can big data be generated from?
what are OMICs in big data?
how can microscopy be used to generate big data?
what big data can microscopy generate?
how can big data on human physiology/health be generated?
what knowledge does big data contribute to biology?
what is transcriptomics?
studies gene expression and mRNA
- to determine the functional consequences of something on the expression of every gene in the tissue/organ/particular cell type of interest, or on a developmental stage
may be:
- wildtype vs mutant
- treated vs untreated
- untreated vs environmental change
what is an experimental strategy in transcriptomics? what steps does it involve
identify genes exhibiting differential expression in the compared cell types
what plot can be used to display big data on transcriptomics?
volcano plot
- each dot represents a gene
- fold-change on x-axis is how much gene expression is increases/decreases
- significance is the Y-axis showing statistical significance of the difference in gene expression
- red dots = downregulated genes
- green dots = upregulated genes
what methods can help to interpret the consequences of gene expression changes?
gene ontology and biological pathway algorithms:
- These algorithms can be ran on the data to interpret consequences of gene expression changes
- Differentially expressed genes are fed into algorithms which extract information from databases about the functions of those genes and summarise it
how can the transcriptome of 100-10,000s of individual cells be collected?
single cell RNA-seq:
1. Dissect tissue, treat with enzymes
2. Single cell suspension – contains a mixture of cell types from tissue
3. Prepare libraries and sequence the transcriptome of every cell
what plot can be used to display the transcriptome of thousands of individual cells? what do these plots give insights into?
UMAP plots:
- Each dot is a cell
- Close = similar, far away = more different
- Each colour marks ‘clusters’ of similar cells
Potential insights into:
- Which genes are expressed by particular cells
- Cell type-specific gene expression changes
- Cell lineage/differentiation trajectories
- Tissue composition changes
how can genetic causes of disease/disease-associated genes be identified?
Genome-Wide Association Studies (GWAS) can identify genes affecting disease risk:
- humans have ~3x10^7 single nucleotide polymorphisms (SNPs) distributed randomly across the genome
- some people may have a different nucleotide in a certain position compared to others
GWAS studies identify SNP alleles that are found more frequently in patients (cases) compared to healthy individuals (controls)
- high scoring SNPs are thus associated with the disease and may play causative roles in the disease process
how are GWAS results presented?
Manhattan plots:
- these map DNA sequence variants associated with a disease at genome-scale
- strong disease-associated SNPs are outliers
why must we be careful when interpreting disease-associated SNPs?
further investigation is required to understand SNP disease-association
what can combining GWAS results with gene expression data achieve?
Big data integration reveals and refines insights into the biological process
Give an example of a population-scale big data project?
The 100,000 genomes project:
- Whole genome sequencing (WGS) to improve diagnosis of rare diseases and cancer care in the NHS through personalised medicine
- Data available to researchers
- 100,000 Genomes: 16.1% of rare disease patients received a molecular diagnosis
what is the UK biobank?
a prospective cohort study of 500,000 UK adults aged 40-69 at recruitment:
- Monitored over time: years/decades
- An integrated database for population-scale studies of health and disease, combining genetics, deep phenotyping, and electronic medical records:
- Demographic / socioeconomic
- Electronic health records (NHS)
- Physical activity monitoring
- Anatomical, Physiological, Biochemical, Genomic
Doctors can then use these past records to help with diagnosis – identify biomarkers of disease
why is big data important in the social gradient of health?
There is a social gradient in health, affecting Total and Healthy Life Expectancy:
- In England, poor neighbourhoods have a greater burden of ill-health than wealthy ones
- COVID-19 has had a proportionally higher impact on the most deprived areas of England
Big data is essential to understand how genetic predispositions, environmental exposures and social factors lead to disease