Module 2: Exploring the Genome Flashcards

(64 cards)

1
Q

What is a permutation?

A

A number of ways to organize items in a specific order. In our case, base pairs of DNA. Eg. Codons (AGC and TAG are different permutations)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How many distinct permutations in sequence can a stretch of a sequence 2 bases long have?

A

4 possibilities of base pairs (A,G,C,T) and a stretch of 2 bases. Therefore, 4^n = # of permutations.

So 4^2 = 16 different permutations in a stretch of 2 bases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do we determine the number of sites of n bases in a genome of size m?

A

if n = 1000
and m = 5.0x10^6

number of sites = m - (n-1)
(5.0x10^6) - (1000-1)
= 4.999x10^6

There are 4.999x10^6 sites of n = 1000 in a 5.0x10^6 base genome

Note: we assume this is single stranded DNA, so if double stranded, double your answer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How long does a DNA sequence have to be for us to expect it to be unique within a given genome?

A

We can expect a sequence to be unique in a genome when the number of permutations is greater than the number of sites of n bases in the genome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do we calculate n when the genome is known?

A

Bacterial genome is approx. 8x10^6 in bacteria. when is the number of permutations greater than the length of the genome? set 8x10^6 = 4^n then solve for n using logs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are Universal Bits?

A

Universal bits refer to measuring DNA sequence information in terms of bits of information, rather than just counting base pairs. In DNA, the information content at any position in DNA is 2 bits (1 bp, 2 bases)
Ex. GGCCC = 5, so 2(5) = 10.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Information Theory: Minimum amount of bits to find our target object

A

The minimum amount of bits we need to find our target object is determined by the log2N equation. Ex. If N = 8, our minimum bits is 3. So we need to measure in 3 bit increments. (If there is only 1 bit present when we need 3, we will find additional targets to our selected target. We will not be able to accurately identify our target object.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a BLAST Search?

A

Blast Stands for: Basic Local Alignment Search Tool. BLAST is a method used to compare DNA or protein sequences against a database to find similar sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How does BLAST Work?

A

You give BLAST a DNA or Protein Sequence and it searches a database to find: similar sequences, homologous genes, evolutionary relatives, or possible functions. BLAST looks for local regions of similarity, not full sequence matches.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

BLAST: Query Strand

A

You break your sequence of interest, the Query, into shorter segments. These segments are then searched across the database where potential matches will be found.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the Two Main Statistics BLAST Search Gives?

A

Score and E-Value.
- Alignment Score = higher the score the better
- E-value = How likely this match occurred by chance (lower e-value the better)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Whata re the 4 Methods of DNA Sequencing?

A

Sanger (functions through chain termination and di-deoxy nucleotides)
Illumina (DNA sequencing by synthesis, uses fluorescence and photo imaging)
Nanopore (Assesses current change of nucleotides through pore)
PacBio (Uses DNA Pol and circularizes targets to track what bases are added wiht fluorescence)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Illumina Sequencing

A

Illumina uses fluorescently labelled nucleotides which stops elongation from occurring - images are taken after fluorescent nucleotides are added, then they are washed off, then repeat the cycle.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Step 1 and 2 of Illumina Sequencing

A
  1. Library Preparation
    - Genomic DNA is fragmented into small pieces (mechanically or via. tagmentation: cut and tag DNA by transposomes)
    - Adapter sequences are then added to each ends of the fragment with primer binding sites and flow cell binding regions (sometimes barcode regions if multiple samples being sequenced)
  2. Cluster Generation (Bridge Amplification)
    - DNA fragments are denatured into single strands which bind complementary oligos on the flow cell
    - DNA pol the synthesizes a new strand and the old strand is washed away
    - New strand bends over and its adaptor hybridizes to the second type of oligo
    - DNA pol synthesizes a complementary strand and double strand is denatured into two copies attached to oligos
    - this repeats many times forming a cluster of identical copies (amplifies fluorescence)
    - Reverse strand is then cleaves and washed away
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Steps 3 and 4 of Illumina Sequencing

A
  1. Sequencing by Synthesis
    - Primer binds to the adaptor region
    - Modified dNTPs are added and have fluorescent labels and 3’ reversible blockers
    - Only one nucleotide is aded per cycle
    - After cycle, fluorescent dye and blocker are chemically removed and next cycle begins
    - Number of cycles = read length
  2. Index Reads (if Multiplexing)
    - After read 1, read product is washed away and index 1 primer is added
    - Index read is generated and the index 2 read follows
    - this allows for post-sequencing separation of pooled samples
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Steps 5 and 6 of Illumina Sequencing

A
  1. Paired-End Sequencing
    - After read 1, teplates are regenerated and the forward strands are washed away.
    - Reverse strands remain and read 2 primer is adde so sequencing repeats from the opposite end
    - Forward and reverse reads can be paired during analysis
  2. Data Output and Analysis
    - Hundred of millions/billions of reads
    - Each cluster is one original DNA fragment
    - Analysis separates reads by index, pairs forward and reverse reads, aligns reads to reference genome, and identifies variants (SNPs, insertions, etc.)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Pacific Bioscience Sequencing

A

Method which uses many different wells which immobilize DNA polymerase and the target DNA. Target is circularized and as each base is added it fluoresces which can be tracked by the microwells.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Steps 1 and 2 of PacBio Sequencing

A
  1. Library preparation
    - DNA (or cDNA from RNA) is isolated
    - Special hairpin adaptors are ligated to both ends which creates a circular DNA molecule
    - Allows the DNA pol to sequence the forward and reverse strand (multiple times)
  2. Single-Molecule Immobilization
    - Load into SMRT cell which have thousands of tiny wells called ZMWs(Zero-Mode Waveguides)
    - ZMWs are illuminated and have a detection window at the bottom
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Steps 3, 4 and 5 of PacBio Sequencing

A
  1. Real-Time Sequencing
    - Each nucleotide has a fluorescent label attached to the phosphate and emits light when incorporated
    - When the light is detected the fluorescent tag is cleaved off and the polymerase continues
  2. Circular Sequencing Advantage
    - DNA is circular so polymerase can go around the circle, sequence forward and reverse strands, and repeat multiple passes
    - This improves accuracy
  3. Sequencing Modes
    - Circular Consensus Sequencing: Polymerase goes around multiple times and reads same molecule repeatedly so it is very hgih accuracy. Produces shorter read lengths
    - Continuous Long Read: Polymerase reads long fragment once and produces extremely long reads. Less accurate than CCS but useful for structural variation detection
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Nanopore Sequencing

A

Method of DNA sequencing which assesses a change in current when nucleotides pass through a nuclear pore.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Steps of Nanopore Sequencing

A
  • Single stranded RNA or DNA is threaded through the nuclear pore
  • Disruption of the ionic current is measured in signal trace (each nucleotide disrupts the current in a unique and distinguishable way)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Comparison of 4 Methodologies

A
  • Sanger is less efficient for many reads (1 at a time) but it is high quality/accurate. Very cheap
  • Illumina is very efficient and high accurate. Uses two reads for each different fragment (forward and reverse) and you can get billions of reads per cycle. Very expensive.
  • PacBio is efficient for long strands. More accurate than Illumina but less costly
  • Nanopore is good for long strands of DNA/RNA molecules and can have millions of reads but reduced accuracy. Similar pricing to PacBio.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is DNA Sequencing and DNA Sequence Alignment?

A

DNA Sequencing: The process of determining the exact order of nucleotides in a DNA molecule

DNA Sequence Alignment: The process of aligning two or more DNA sequences to identify regions that are similar or different

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is a Contig?

A

A continuous DNA sequence from a collection of overlapping sequences which will hopefully represent/span the whole genome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
How do we fragment the genome?
Randomly. Millions and millions of fragments will be produced which will be sampled through DNA sequencing.
26
What are Gaps and why do they occur?
Gaps exist in our contigs (when compared to reference genome) because of bases missing between the contigs where there is no overlap between fragments. Gaps occur because of coverage.
27
What is Coverage? (and Coverage Formula)
Coverage in DNA sequencing refers to how many times a given base (position) in the genome has been sequenced by different reads. It’s basically a measure of how well a region of DNA is supported by sequencing data. Formula: Coverage (m) = (number of reads x read length)/(genome length)
28
What is Genome Assembly?
Genome Assembly is the process of reconstructing a complete genome sequence by combining overlapping sequencing reads
29
What is a Scaffold?
A scaffold is an ordered and oriented set of contigs linked together using paired-end or long-read information, with gaps still present between contigs.
30
Why is high coverage important in genome sequencing?
High coverage improves accuracy, reduces gaps, and allows better assembly of contigs
31
What is the Poisson Distribution?
A Poisson distribution is a discrete probability distribution. It gives the probability of an event happening a certain number of times (k) within a given interval of time or space. The Poisson distribution has only one parameter, λ (lambda), which is the mean number of events.
32
Why is the Poisson Distribution used in DNA Sequencing?
It models the probability of how many times a given base is sequenced when reads are randomly sampled
33
What is the biggest problem predicted by the Poisson Distribution in genome sequencing?
The 0-Coverage class - Regions that are never sequenced, creating gaps
34
What formula gives the proportion of the genome with zero coverage
Proportion of genome with no sequence = e^(-m) Where m = average coverage
35
Why is low coverage insufficient for genome assembly?
Low coverage results in many unsequenced regions, creating gaps and fragmented contigs
36
Why does random sampling lead to uneven coverage?
Because sequencing fragments are randomly selected, some regions are sequenced many times while others may not be sequenced at all
37
What determines Contig length?
The amount of overlap between reads and the presence of repetitive or low complexity DNA
38
Why do repetitive DNA sequences cause assembly problems
Reads from repetitive regions look identical, so the assembler cannot determine their correct genome position.
39
Why are unique junctions important in genome assembly?
Unique junction sequences allow repetitive regions to be placed correctly in the genome.
40
What is low complexity DNA?
DNA regions with repetitive or limited nucleotide diversity, such as repeats of one or two bases (AAACCCC or ACACACA)
41
Why does low complexity DNA cause gaps in assembly?
It reduces alignment specificity, making it difficult to determine correct sequence placement
42
Why does Alignment become harder when base diversity is low?
Because many genome regions look identical, reducing ability to uniquely match reads
43
Why can high coverage still produce many contigs?
Because repetitive and low complexity DNA prevents proper assembly, even when coverage is sufficient.
44
What information does scaffolding provide that contigs alone do not?
The relative order, orientation, and approximate spacing of contigs
45
How are scaffolds assembled?
Using paired-end reads information to link contigs together
46
What are paired-end reads?
Sequencing reads obtained from both ends of the same DNA fragment
47
Why are paired end reads useful?
They provide distance and orientation information between contigs
48
How do paired-end reads determine contig order?
If two reads map to different contigs but are known to come from the same fragment, those contigs must be adjacent.
49
Why can contigs be flipped during assembly?
Because sequencing does not initially indicate which strand or orientation the read came from
50
Why are short reads problematic for repetitive DNA?
Short reads may match multiple genome locations, making placement ambiguous
51
How do long reads help genome assembly?
Long reads span repetitive regions and extend into unique sequences, allowing correct placement
52
Name two long-read sequencing technologies
Nanopore and PacBio
53
Why combine short and long-read sequencing?
Short reads provide high accuracy and coverage while long reads resolve repetitive regions and gaps. This is called hybrid assembly
54
What is a reference genome?
A complete assembled genome used as a standard for comparison
55
Why align reads to a reference genome?
To identify mutations, insertions, deletions, and sequence variation
56
What is a SNP?
A single nucleotide polymorphism - a single base difference between genomes
57
How are deletions identified in alignment?
A base appears in reference but not in the reads
58
How are insertions identified in alignment?
A base appears in the reads but not in the reference
59
Heterozygous vs Homozygous alleles
Two different alleles or two of the same alleles at the same genomic position
60
Why doesn't genome assembly produce one contig per chromosome, even with high coverage?
Even with high coverage, assembly breaks due to repetitive DNA, low complexity DNA, and assembly ambiguity. This prevents the assembler from connecting sequencing, breaking chromosome into multiple contigs.
61
Why might the assembled genome size be larger than expected?
Assembly errors, duplicated regions, or unresolved repeats
62
Why sequence mutant strains?
To identify genetic mutations responsible for phenotype changes
63
How can sequencing help drug development?
By identifying mutations involved with diseases
64