Module 2: Exploring the Genome Flashcards by ava coonfer

What is a permutation?

A number of ways to organize items in a specific order. In our case, base pairs of DNA. Eg. Codons (AGC and TAG are different permutations)

How well did you know this?

Not at all

Perfectly

How many distinct permutations in sequence can a stretch of a sequence 2 bases long have?

4 possibilities of base pairs (A,G,C,T) and a stretch of 2 bases. Therefore, 4^n = # of permutations.

So 4^2 = 16 different permutations in a stretch of 2 bases.

How well did you know this?

Not at all

Perfectly

How do we determine the number of sites of n bases in a genome of size m?

if n = 1000
and m = 5.0x10^6

number of sites = m - (n-1)
(5.0x10^6) - (1000-1)
= 4.999x10^6

There are 4.999x10^6 sites of n = 1000 in a 5.0x10^6 base genome

Note: we assume this is single stranded DNA, so if double stranded, double your answer

How well did you know this?

Not at all

Perfectly

How long does a DNA sequence have to be for us to expect it to be unique within a given genome?

We can expect a sequence to be unique in a genome when the number of permutations is greater than the number of sites of n bases in the genome.

How well did you know this?

Not at all

Perfectly

How do we calculate n when the genome is known?

Bacterial genome is approx. 8x10^6 in bacteria. when is the number of permutations greater than the length of the genome? set 8x10^6 = 4^n then solve for n using logs

How well did you know this?

Not at all

Perfectly

What are Universal Bits?

Universal bits refer to measuring DNA sequence information in terms of bits of information, rather than just counting base pairs. In DNA, the information content at any position in DNA is 2 bits (1 bp, 2 bases)
Ex. GGCCC = 5, so 2(5) = 10.

How well did you know this?

Not at all

Perfectly

Information Theory: Minimum amount of bits to find our target object

The minimum amount of bits we need to find our target object is determined by the log2N equation. Ex. If N = 8, our minimum bits is 3. So we need to measure in 3 bit increments. (If there is only 1 bit present when we need 3, we will find additional targets to our selected target. We will not be able to accurately identify our target object.)

How well did you know this?

Not at all

Perfectly

What is a BLAST Search?

Blast Stands for: Basic Local Alignment Search Tool. BLAST is a method used to compare DNA or protein sequences against a database to find similar sequences

How well did you know this?

Not at all

Perfectly

How does BLAST Work?

You give BLAST a DNA or Protein Sequence and it searches a database to find: similar sequences, homologous genes, evolutionary relatives, or possible functions. BLAST looks for local regions of similarity, not full sequence matches.

How well did you know this?

Not at all

Perfectly

BLAST: Query Strand

You break your sequence of interest, the Query, into shorter segments. These segments are then searched across the database where potential matches will be found.

How well did you know this?

Not at all

Perfectly

What are the Two Main Statistics BLAST Search Gives?

Score and E-Value.
- Alignment Score = higher the score the better
- E-value = How likely this match occurred by chance (lower e-value the better)

How well did you know this?

Not at all

Perfectly

Whata re the 4 Methods of DNA Sequencing?

Sanger (functions through chain termination and di-deoxy nucleotides)
Illumina (DNA sequencing by synthesis, uses fluorescence and photo imaging)
Nanopore (Assesses current change of nucleotides through pore)
PacBio (Uses DNA Pol and circularizes targets to track what bases are added wiht fluorescence)

How well did you know this?

Not at all

Perfectly

Illumina Sequencing

Illumina uses fluorescently labelled nucleotides which stops elongation from occurring - images are taken after fluorescent nucleotides are added, then they are washed off, then repeat the cycle.

How well did you know this?

Not at all

Perfectly

Step 1 and 2 of Illumina Sequencing

Library Preparation
- Genomic DNA is fragmented into small pieces (mechanically or via. tagmentation: cut and tag DNA by transposomes)
- Adapter sequences are then added to each ends of the fragment with primer binding sites and flow cell binding regions (sometimes barcode regions if multiple samples being sequenced)
Cluster Generation (Bridge Amplification)
- DNA fragments are denatured into single strands which bind complementary oligos on the flow cell
- DNA pol the synthesizes a new strand and the old strand is washed away
- New strand bends over and its adaptor hybridizes to the second type of oligo
- DNA pol synthesizes a complementary strand and double strand is denatured into two copies attached to oligos
- this repeats many times forming a cluster of identical copies (amplifies fluorescence)
- Reverse strand is then cleaves and washed away

How well did you know this?

Not at all

Perfectly

Steps 3 and 4 of Illumina Sequencing

Sequencing by Synthesis
- Primer binds to the adaptor region
- Modified dNTPs are added and have fluorescent labels and 3’ reversible blockers
- Only one nucleotide is aded per cycle
- After cycle, fluorescent dye and blocker are chemically removed and next cycle begins
- Number of cycles = read length
Index Reads (if Multiplexing)
- After read 1, read product is washed away and index 1 primer is added
- Index read is generated and the index 2 read follows
- this allows for post-sequencing separation of pooled samples

How well did you know this?

Not at all

Perfectly

Steps 5 and 6 of Illumina Sequencing

Paired-End Sequencing
- After read 1, teplates are regenerated and the forward strands are washed away.
- Reverse strands remain and read 2 primer is adde so sequencing repeats from the opposite end
- Forward and reverse reads can be paired during analysis
Data Output and Analysis
- Hundred of millions/billions of reads
- Each cluster is one original DNA fragment
- Analysis separates reads by index, pairs forward and reverse reads, aligns reads to reference genome, and identifies variants (SNPs, insertions, etc.)

How well did you know this?

Not at all

Perfectly

Pacific Bioscience Sequencing

Method which uses many different wells which immobilize DNA polymerase and the target DNA. Target is circularized and as each base is added it fluoresces which can be tracked by the microwells.

How well did you know this?

Not at all

Perfectly

Steps 1 and 2 of PacBio Sequencing

Library preparation
- DNA (or cDNA from RNA) is isolated
- Special hairpin adaptors are ligated to both ends which creates a circular DNA molecule
- Allows the DNA pol to sequence the forward and reverse strand (multiple times)
Single-Molecule Immobilization
- Load into SMRT cell which have thousands of tiny wells called ZMWs(Zero-Mode Waveguides)
- ZMWs are illuminated and have a detection window at the bottom

How well did you know this?

Not at all

Perfectly

Steps 3, 4 and 5 of PacBio Sequencing

Real-Time Sequencing
- Each nucleotide has a fluorescent label attached to the phosphate and emits light when incorporated
- When the light is detected the fluorescent tag is cleaved off and the polymerase continues
Circular Sequencing Advantage
- DNA is circular so polymerase can go around the circle, sequence forward and reverse strands, and repeat multiple passes
- This improves accuracy
Sequencing Modes
- Circular Consensus Sequencing: Polymerase goes around multiple times and reads same molecule repeatedly so it is very hgih accuracy. Produces shorter read lengths
- Continuous Long Read: Polymerase reads long fragment once and produces extremely long reads. Less accurate than CCS but useful for structural variation detection

How well did you know this?

Not at all

Perfectly

Nanopore Sequencing

Method of DNA sequencing which assesses a change in current when nucleotides pass through a nuclear pore.

How well did you know this?

Not at all

Perfectly

Steps of Nanopore Sequencing

Single stranded RNA or DNA is threaded through the nuclear pore
Disruption of the ionic current is measured in signal trace (each nucleotide disrupts the current in a unique and distinguishable way)

How well did you know this?

Not at all

Perfectly

Comparison of 4 Methodologies

Sanger is less efficient for many reads (1 at a time) but it is high quality/accurate. Very cheap
Illumina is very efficient and high accurate. Uses two reads for each different fragment (forward and reverse) and you can get billions of reads per cycle. Very expensive.
PacBio is efficient for long strands. More accurate than Illumina but less costly
Nanopore is good for long strands of DNA/RNA molecules and can have millions of reads but reduced accuracy. Similar pricing to PacBio.

How well did you know this?

Not at all

Perfectly

What is DNA Sequencing and DNA Sequence Alignment?

DNA Sequencing: The process of determining the exact order of nucleotides in a DNA molecule

DNA Sequence Alignment: The process of aligning two or more DNA sequences to identify regions that are similar or different

How well did you know this?

Not at all

Perfectly

What is a Contig?

A continuous DNA sequence from a collection of overlapping sequences which will hopefully represent/span the whole genome.

How well did you know this?

Not at all

Perfectly

How do we fragment the genome?

Randomly. Millions and millions of fragments will be produced which will be sampled through DNA sequencing.

What are Gaps and why do they occur?

Gaps exist in our contigs (when compared to reference genome) because of bases missing between the contigs where there is no overlap between fragments. Gaps occur because of coverage.

What is Coverage? (and Coverage Formula)

Coverage in DNA sequencing refers to how many times a given base (position) in the genome has been sequenced by different reads. It’s basically a measure of how well a region of DNA is supported by sequencing data. Formula: Coverage (m) = (number of reads x read length)/(genome length)

What is Genome Assembly?

Genome Assembly is the process of reconstructing a complete genome sequence by combining overlapping sequencing reads

What is a Scaffold?

A scaffold is an ordered and oriented set of contigs linked together using paired-end or long-read information, with gaps still present between contigs.

Why is high coverage important in genome sequencing?

High coverage improves accuracy, reduces gaps, and allows better assembly of contigs

What is the Poisson Distribution?

A Poisson distribution is a discrete probability distribution. It gives the probability of an event happening a certain number of times (k) within a given interval of time or space. The Poisson distribution has only one parameter, λ (lambda), which is the mean number of events.

Why is the Poisson Distribution used in DNA Sequencing?

It models the probability of how many times a given base is sequenced when reads are randomly sampled

What is the biggest problem predicted by the Poisson Distribution in genome sequencing?

The 0-Coverage class - Regions that are never sequenced, creating gaps

What formula gives the proportion of the genome with zero coverage

Proportion of genome with no sequence = e^(-m) Where m = average coverage

Why is low coverage insufficient for genome assembly?

Low coverage results in many unsequenced regions, creating gaps and fragmented contigs

Why does random sampling lead to uneven coverage?

Because sequencing fragments are randomly selected, some regions are sequenced many times while others may not be sequenced at all

What determines Contig length?

The amount of overlap between reads and the presence of repetitive or low complexity DNA

Why do repetitive DNA sequences cause assembly problems

Reads from repetitive regions look identical, so the assembler cannot determine their correct genome position.

Why are unique junctions important in genome assembly?

Unique junction sequences allow repetitive regions to be placed correctly in the genome.

What is low complexity DNA?

DNA regions with repetitive or limited nucleotide diversity, such as repeats of one or two bases (AAACCCC or ACACACA)

Why does low complexity DNA cause gaps in assembly?

It reduces alignment specificity, making it difficult to determine correct sequence placement

Why does Alignment become harder when base diversity is low?

Because many genome regions look identical, reducing ability to uniquely match reads

Why can high coverage still produce many contigs?

Because repetitive and low complexity DNA prevents proper assembly, even when coverage is sufficient.

What information does scaffolding provide that contigs alone do not?

The relative order, orientation, and approximate spacing of contigs

How are scaffolds assembled?

Using paired-end reads information to link contigs together

What are paired-end reads?

Sequencing reads obtained from both ends of the same DNA fragment

Why are paired end reads useful?

They provide distance and orientation information between contigs

How do paired-end reads determine contig order?

If two reads map to different contigs but are known to come from the same fragment, those contigs must be adjacent.

Why can contigs be flipped during assembly?

Because sequencing does not initially indicate which strand or orientation the read came from

Why are short reads problematic for repetitive DNA?

Short reads may match multiple genome locations, making placement ambiguous

How do long reads help genome assembly?

Long reads span repetitive regions and extend into unique sequences, allowing correct placement

Name two long-read sequencing technologies

Nanopore and PacBio

Why combine short and long-read sequencing?

Short reads provide high accuracy and coverage while long reads resolve repetitive regions and gaps. This is called hybrid assembly

What is a reference genome?

A complete assembled genome used as a standard for comparison

Why align reads to a reference genome?

To identify mutations, insertions, deletions, and sequence variation

What is a SNP?

A single nucleotide polymorphism - a single base difference between genomes

How are deletions identified in alignment?

A base appears in reference but not in the reads

How are insertions identified in alignment?

A base appears in the reads but not in the reference

Heterozygous vs Homozygous alleles

Two different alleles or two of the same alleles at the same genomic position

Why doesn't genome assembly produce one contig per chromosome, even with high coverage?

Even with high coverage, assembly breaks due to repetitive DNA, low complexity DNA, and assembly ambiguity. This prevents the assembler from connecting sequencing, breaking chromosome into multiple contigs.

Why might the assembled genome size be larger than expected?

Assembly errors, duplicated regions, or unresolved repeats

Why sequence mutant strains?

To identify genetic mutations responsible for phenotype changes

How can sequencing help drug development?

By identifying mutations involved with diseases

Module 2: Exploring the Genome Flashcards

(64 cards)