Miscelanious Flashcards

Question

How do you calculate an MSA?

Answer 1

1) Find homologs (BLAST/HHblits/JackHMMER against sequence DBs). 2) Build an initial alignment (progressive/HMM-based). 3) Iteratively refine/realign; optionally weight sequences and trim/filter. Output is an aligned matrix of sequences with gaps.

Answer 2

Pairwise: Needleman–Wunsch (global), Smith–Waterman (local), Gotoh (affine gaps). Heuristics: BLAST. Profile/HMM: HMMER, HHsearch/HHalign. Multiple alignment: Clustal Omega, MAFFT, MUSCLE, T-Coffee; plus iterative refinement and consistency methods.

Answer 3

Antibodies are generated by V(D)J recombination and somatic hypermutation, so each sequence (especially CDR3) is often unique with few true homologs; the resulting “MSA” is shallow/poorly defined and doesn’t provide the evolutionary constraints typical protein MSAs do.

Answer 4

ELISA; Western blot; flow cytometry; immunofluorescence/IHC; immunoprecipitation; protein/peptide microarrays; competition/epitope binning; knockout/knockdown controls; cross-reactivity panels; peptide scanning/alanine scanning for epitope mapping.

Answer 5

SPR (Biacore), BLI (Octet), ITC, microscale thermophoresis (MST), KinExA, equilibrium binding titrations (e.g., flow/ELISA-based) to estimate Kd and kinetics.

Answer 6

Cells/particles are labeled (often with fluorescent antibodies) and pass single-file through a laser; detectors measure forward scatter (size), side scatter (granularity), and fluorescence channels; gating/compensation yields populations and marker expression.

Answer 7

It takes an MSA as a 2D input (sequences × positions) and applies axial attention: row attention within sequences and column attention across sequences at each site. Trained with masked-language modeling on MSAs, it learns embeddings and attention patterns that correlate with contacts/structure.

Answer 8

AbLang is an antibody-specific language model trained on large antibody repertoires (heavy or light chains). It captures antibody-specific patterns (framework/CDRs) and can impute missing residues and produce useful embeddings better than general protein LMs for antibody-centric tasks.

Answer 9

Character/byte tokenization; WordPiece; BPE (incl. byte-level BPE); Unigram language-model tokenizer (SentencePiece); whitespace/regex tokenizers; hybrid approaches (e.g., sentencepiece with byte fallback).

Answer 10

Absolute: learned position embeddings or fixed sinusoidal. Relative: learned relative bias/embeddings (e.g., Shaw), ALiBi. Rotary/phase: RoPE (rotary positional embeddings) and RoPE variants/scaling.

Answer 11

Efficient attention (FlashAttention, sparse/windowed, block-sparse), linear/approx attention, recurrence/memory (Transformer-XL), retrieval augmentation (RAG), RoPE scaling/YaRN/NTK scaling, ALiBi, chunking + sliding window, KV-cache optimizations, and sequence compression/summarization.

Answer 12

An ensemble of decision trees trained on bootstrap-resampled data with random feature subsampling at each split; predictions are averaged (regression) or majority-voted (classification) to reduce variance.

Answer 13

For classification: maximize information gain / reduce impurity (Gini or entropy). For regression: minimize MSE/variance of targets in child nodes.

Answer 14

An additive ensemble method that builds weak learners (often trees) sequentially, each new learner fitting the residuals/negative gradient of the current model (e.g., XGBoost/LightGBM/CatBoost).

Answer 15

k-means, hierarchical (agglomerative), DBSCAN/HDBSCAN, Gaussian mixture models (EM), spectral clustering, mean-shift, affinity propagation.

Answer 16

A top‑K evaluation: measure how good the first K ranked predictions are (e.g., precision@K, recall@K, hit-rate@K, NDCG@K).

Answer 17

Define an nn.Module (layers + forward), choose a loss, optimizer, and DataLoader; run a training loop: forward → loss → backward → optimizer.step() with eval on a validation set.

Answer 18

JAX is more functional and emphasizes composable transforms (grad, jit, vmap, pmap) compiled with XLA; arrays are typically immutable and you often write pure functions. PyTorch is imperative/eager by default (with optional torch.compile) and uses nn.Module-centric APIs.

Answer 19

Replicate the full model on each device, split the batch across devices, compute gradients locally, then all-reduce gradients to keep parameters in sync.

Answer 20

Split the model layers into stages across devices and run microbatches through the stages in a pipeline (often with 1F1B scheduling) to increase throughput and fit larger models.

Answer 21

Partition the model’s parameters across devices (e.g., tensor/“sharded” weights) so a single layer’s computation is split across GPUs.

Answer 22

Combining data parallel + tensor/model parallel + pipeline parallel (often called 3D parallelism) to scale to very large models efficiently.

Answer 23

Typical recipe: tensor parallel within a node (shard big matmuls), pipeline parallel across nodes (split layers), and data parallel across replicas; add optimizer/gradient sharding (ZeRO/FSDP) and activation checkpointing to reduce memory.

Answer 24

Partition layers into stages; microbatch the input; send/recv activations between stages; run forward passes; run backward passes with gradient send/recv; accumulate gradients over microbatches; apply optimizer step; optionally checkpoint/recompute activations and manage pipeline scheduling (e.g., 1F1B).

Answer 25

Common layout: Q,K,V are (B, H, L, d) where B=batch, L=sequence length, H=#heads, d=head_dim (so d_model = H·d). Attention weights are (B, H, L, L).

Answer 26

Learned dense vector representations of discrete inputs (tokens, residues, etc.) produced by an embedding matrix; can also refer to intermediate hidden states used as representations for downstream tasks.

Answer 27

Take intermediate activations from the denoiser/UNet (or diffusion transformer) as features; or use the latent representation (e.g., VAE latent in latent diffusion) as an embedding; optionally average‑pool across timesteps or use the final denoised latent.

Answer 28

Freeze the base weights W and learn a low-rank update ΔW = (α/r)·B·A (rank r). During training only A and B are updated; at inference ΔW is added (or merged) into W.

Answer 29

Adapters, LoRA/QLoRA, prefix tuning, prompt tuning, P-tuning, IA³, BitFit (bias-only), partial layer fine-tuning/linear probing, and low-rank/soft prompts.

Answer 30

SFT (supervised fine-tuning) learns from labeled targets (e.g., next-token or instruction-response pairs). RL optimizes a policy to maximize a reward signal (often from human/AI preferences) using RL algorithms (policy gradients, PPO, etc.).

Answer 31

An off-policy RL method that learns action-value Q(s,a): Q ← (1−α)Q + α[r + γ·max_a' Q(s',a')]. The greedy policy picks argmax_a Q(s,a).

Answer 32

More data;

Answer 33

learns from labeled targets (e.g., next-token or instruction-response pairs). RL optimizes a policy to maximize a reward signal (often from human/AI preferences) using RL algorithms (policy gradients, PPO, etc.).

Answer 34

More data; proper train/val split; regularization (L2/weight decay), dropout; early stopping; data augmentation; simpler model; ensembling; cross-validation; noise injection.

Answer 35

Deduplicate/clean data; dropout + weight decay; early stopping on validation perplexity; smaller model or fewer steps; label smoothing; augmentation/noising; mixout; curriculum; regularize with constraints (e.g., KL to base) during tuning; evaluate on held-out domains.

Answer 36

A regularization technique that randomly zeroes a fraction of activations (or weights) during training so the network can’t rely on any single pathway; scaled appropriately at inference.

Answer 37

Residual connections that add a block’s input to its output (x + f(x)), improving gradient flow and enabling deeper networks.

Answer 38

Learning rate (and schedule/warmup), batch size, sequence length/context, model size (layers/width/heads), optimizer settings (βs, ε), weight decay, dropout, gradient clipping, training steps, and tokenization/vocab.

Answer 39

They define a forward process that gradually adds noise to data, then learn a reverse denoising model that removes noise step-by-step to generate samples from noise (often by predicting noise/score).

Answer 40

The reverse distribution is hard to model in one jump; a Markov chain (or discretized SDE/ODE) of small denoising steps makes training stable and generation controllable, gradually refining coarse structure into detail.

Answer 41

#steps/timesteps, noise schedule (β/α), model architecture/capacity, learning rate, guidance scale (if used), sampler (DDPM/DDIM/ODE), batch size, latent size/resolution, and conditioning dropout.

Answer 42

Discrete/continuous text diffusion examples include D3PM (discrete diffusion), Diffusion-LM, SeqDiffuSeq (diffusion for seq2seq), and masked/iterative denoisers inspired by diffusion (e.g., MaskGIT-style).

Answer 43

r (rank) is the low-rank dimension of the update matrices; α (alpha) is a scaling hyperparameter. The effective update is typically scaled by α/r.

Answer 44

Static sites: S3 + CloudFront (+ Route53 + ACM). Dynamic apps: ECS/Fargate or Elastic Beanstalk or EC2 behind an ALB; serverless: Lambda + API Gateway. Add CI/CD (CodePipeline/GitHub Actions), logging (CloudWatch), secrets (SSM/Secrets Manager), and a database (RDS/DynamoDB) if needed.

Answer 45

Neural networks trained to reconstruct inputs via an encoder → latent code → decoder; used for dimensionality reduction, denoising, and representation learning.

Answer 46

Variational Autoencoders: probabilistic autoencoders where the encoder outputs a distribution (μ,σ) over latents; trained with reconstruction loss + KL divergence to a prior, enabling sampling via the reparameterization trick.

Answer 47

Population: σ = sqrt( (1/n)·Σ(xᵢ−μ)² ). Sample: s = sqrt( (1/(n−1))·Σ(xᵢ−x̄)² ).

Answer 48

(TP + TN) / (TP + TN + FP + FN).

Answer 49

TP / (TP + FP).

Answer 50

TP / (TP + FN) (also called sensitivity/TPR).

Answer 51

Usually called the precision–recall (PR) curve: precision vs recall as you vary the classification threshold; area under it is Average Precision (AP).

Answer 52

Plot of TPR (recall) vs FPR (FP/(FP+TN)) as the threshold varies; AUC summarizes ranking quality.

Answer 53

When a model fits training data (including noise) too closely and fails to generalize, leading to a large train–test performance gap.

Answer 54

Rich Sutton’s observation: methods that scale with compute and data (general learning + search) tend to outperform clever human-engineered, domain-specific tricks in the long run.

Answer 55

As dimensions grow, space becomes sparse: distances concentrate, nearest neighbors get far away, and the amount of data needed to cover the space grows exponentially.

Answer 56

Bias–variance tradeoff: increasing model complexity often lowers bias but raises variance; the goal is to minimize total generalization error.

Answer 57

Sample covariance: cov(X,Y) = (1/(n−1))·Σ(xᵢ−x̄)(yᵢ−ȳ). Population uses 1/n.

Answer 58

Split data into n folds; train on n−1 folds and validate on the held-out fold; repeat n times and average metrics.

Answer 59

Benefits: uses data efficiently; more stable estimate than a single split. Drawbacks: n trainings (expensive); folds may still be correlated. LOOCV (n = dataset size) has very low bias but high variance and is especially computationally expensive.

Answer 60

pLDDT: per‑residue local confidence (0–100). PAE: predicted aligned error between residue pairs (useful for domain/interface placement). pTM: predicted TM-score for overall fold (single-chain/global). ipTM: predicted TM-score focused on inter-chain interface quality (complex ranking).

Answer 61

Low PAE blocks along the diagonal indicate confident local/domain structure; low PAE between two domains/chains indicates confident relative placement. High off-diagonal PAE suggests uncertain domain orientation or flexible/alternative arrangements; interfaces with low PAE at contacting regions are more trustworthy.

Answer 62

RMSD is distance-based and sensitive to outliers/size and domain motions; TM-score is length-normalized and less sensitive to local errors, better for overall topology; lDDT is local distance agreement, robust to rigid-body domain shifts. RMSD can look bad for correct multi-domain models; TM-score can hide local errors; lDDT can be high even if domain orientation is wrong.

Answer 63

Ranking is which model AF thinks is best among its outputs (e.g., by ipTM+pTM). Confidence is the predicted correctness level (pLDDT/PAE/ipTM). A top-ranked model can still be low-confidence if all candidates are uncertain.

Answer 64

Disordered regions or alternative conformations; incorrect domain orientations (hinges); wrong stoichiometry/assembly; weak/transient interfaces; missing ligands/cofactors/PTMs; induced-fit changes; membrane/low-homology targets; antibody CDR flexibility and antigen-dependent rearrangements.

Answer 65

Recycling feeds the model’s own predicted representations/coordinates back into the trunk for additional refinement iterations. It helps correct errors iteratively, improving long-range consistency and interfaces.

Answer 66

Templates provide structural priors from known related structures. They help when homologous structures exist (especially for low-MSA targets) and can guide domain arrangement; they can hurt if the template is too distant, wrong conformation, or biases the model away from the true structure.

Answer 67

MSA depth is the number of aligned homologs; Neff is the diversity‑weighted effective count. Higher Neff usually provides stronger coevolution signals and improves accuracy; shallow/low‑Neff MSAs often yield lower confidence and more errors.

Answer 68

Global docking searches relative orientations/positions of partners without assuming a known complex; template-based docking uses a known similar complex/interface as a starting point/constraint.

Answer 69

Conformational selection: partners already sample binding-competent conformations and binding selects them. Induced fit: binding triggers conformational change. Docking is harder with large induced-fit changes because a single rigid structure may not represent the bound state.

Answer 70

DockQ is a composite score for protein–protein docking quality combining interface RMSD, ligand RMSD, and fraction of native contacts; it correlates with CAPRI quality categories.

Answer 71

Buried surface area (BSA), hydrogen bonds, salt bridges, hydrophobic contacts, shape complementarity. Compute via tools like PDBePISA/FreeSASA for BSA, and contact/H-bond analysis via PyMOL, Biopython, or dedicated interface analyzers.

Answer 72

Cryo‑EM density maps, SAXS profiles, crosslinking mass spec (XL‑MS), NMR restraints (NOEs/RDCs), mutagenesis/epitope mapping, FRET distances, hydrogen–deuterium exchange (HDX‑MS), and co-evolution/paired MSAs.

Answer 73

Homologs share common ancestry (umbrella term). Orthologs diverged via speciation (often similar function). Paralogs diverged via gene duplication (may change function).

Answer 74

A profile Hidden Markov Model represents position-specific residue and gap probabilities from an MSA. It captures conservation patterns and indels better than pairwise alignment, improving detection of distant homologs.

Answer 75

Gap penalties discourage insertions/deletions. Linear charges per gap character; affine uses gap_open + gap_extend, reflecting biology (few long gaps preferred over many short ones). They affect alignment sensitivity/specificity.

Answer 76

Weighting down-weights redundant/closely related sequences so diverse sequences contribute more. It reduces phylogenetic bias and improves downstream statistics like coevolution and profiles.

Answer 77

Paired MSAs match sequences of two interacting proteins from the same organism to capture inter-protein coevolution. Useful for stable conserved interactions; dangerous if pairing is wrong (paralogs, promiscuous interactions), which can mislead models.

Answer 78

Framework regions (FR1–FR4) form the antibody scaffold; CDR1–CDR3 are hypervariable loops that often dominate binding. IMGT/Kabat/Chothia are conventions for assigning residue numbers/loop boundaries; they differ slightly in definitions, especially around CDRs.

Answer 79

V(D)J recombination assembles variable regions from V, D (heavy only), and J gene segments to create initial diversity. Somatic hypermutation introduces point mutations in activated B cells, followed by selection to increase affinity.

Answer 80

After antigen activation, B cells with productive receptors proliferate (clonal expansion) and diversify via mutation; the repertoire is the population of B-cell receptor/antibody sequences shaped by recombination, selection, and exposure history.

Answer 81

Affinity is the strength of a single binding site interaction (often Kd). Avidity is the overall functional binding strength from multivalent interactions (e.g., IgG bivalent binding), which can be much stronger than affinity alone.

Answer 82

Epitope binning groups antibodies by whether they compete for the same/overlapping epitope. In SPR/BLI, one antibody captures antigen, then a second antibody is flowed; reduced binding indicates competition (same bin).

Answer 83

Binding means the antibody recognizes the antigen; neutralization means it blocks biological function (e.g., viral entry), often requiring binding to specific functional epitopes and sufficient potency.

Answer 84

HEp‑2 cell staining, polyspecificity reagent (PSR) assays, binding to dsDNA/LPS/insulin panels, protein microarrays, tissue cross-reactivity, and off-target panels via ELISA/SPR/BLI/flow.

Answer 85

Self-attention uses Q,K,V from the same sequence (intra-sequence context). Cross-attention uses queries from one sequence (e.g., decoder) and keys/values from another (e.g., encoder or conditioning input).

Answer 86

Temperature scales logits before softmax. Higher temperature (T>1) makes distributions flatter (more uncertainty); lower (T<1) makes them sharper (more confident).

Answer 87

Dot products grow in magnitude with dimension d, which can push softmax into saturation and harm gradients. Dividing by √d keeps scores in a reasonable range for stable training.

Answer 88

In autoregressive generation, keys/values from previous tokens are stored (cached) so each new token only computes attention against cached K,V instead of recomputing for the whole prefix, reducing per-step cost.

Answer 89

Pre-norm applies LayerNorm before the sublayer (x + f(LN(x))); post-norm applies LayerNorm after the residual (LN(x + f(x))). Pre-norm generally stabilizes deep training.

Answer 90

MLM predicts masked tokens using bidirectional context (e.g., BERT). Causal LM predicts next tokens left-to-right with a causal mask (e.g., GPT).

Answer 91

Warmup prevents early instability; cosine decay gradually lowers LR for convergence; one-cycle increases then decreases LR to speed training and improve generalization. Schedules balance fast learning early with fine-tuning late.

Answer 92

Clipping caps gradient norm/value to prevent exploding gradients, especially in RNNs, very deep nets, or large-batch/unstable training regimes.

Answer 93

It replaces hard one-hot targets with a slightly smoothed distribution to reduce overconfidence and improve calibration. It can harm tasks needing exact probabilities or when data is already noisy/low-signal.

Answer 94

Accuracy measures correctness; calibration measures whether predicted probabilities match true frequencies. ECE summarizes miscalibration; reliability diagrams plot predicted confidence vs observed accuracy.

Answer 95

Confusion matrix counts TP/FP/TN/FN. F1 = harmonic mean of precision and recall; MCC measures correlation between predictions and labels (robust to imbalance); balanced accuracy averages recall across classes.

Answer 96

When classes have very different frequencies. Handle via reweighting, resampling (over/under/SMOTE), appropriate metrics (PR-AUC), threshold tuning, focal loss, and collecting more minority data.

Answer 97

ZeRO (and PyTorch FSDP) shard training state across devices. Depending on stage/config they shard optimizer states, gradients, and parameters, reducing per-GPU memory while using collectives (all-gather/reduce-scatter).

Answer 98

Instead of storing all activations for backprop, you save a subset and recompute others during backward. It reduces memory at the cost of extra forward compute.

Answer 99

All-reduce aggregates gradients across replicas (data parallel). All-gather collects sharded tensors (FSDP/tensor parallel). Reduce-scatter sums and distributes shards (often paired with all-gather for efficient sharded training).

Answer 100

Throughput is samples/tokens processed per second (rate). Latency is time to produce one response/output (delay). Optimizations often trade one for the other.

Answer 101

Using lower-precision dtypes speeds training and reduces memory. Issues include overflow/underflow and instability; mitigations include loss scaling, BF16 use, careful normalization, and keeping some ops in FP32.

Answer 102

DDPM uses stochastic reverse diffusion (many steps). DDIM is a deterministic (or less stochastic) variant enabling fewer steps with similar quality. ODE samplers integrate a probability-flow ODE (deterministic) and can use adaptive step solvers.

Answer 103

A conditioning technique that trains with random condition dropout; at sampling, combine conditional and unconditional predictions and scale their difference to steer samples toward the condition without an external classifier.

Answer 104

It defines how much noise is added per timestep (β/α schedule). It affects training signal distribution and sampling quality/speed; good schedules improve stability and reduce required steps.

Answer 105

Different parameterizations of the denoiser target: predict added noise ε, the clean data x0, or a v-parameterization combining both. They change loss scaling and can improve stability/quality.

Answer 106

Select comparable atoms (e.g., Cα), superpose structures (least-squares alignment) to remove rigid-body differences, then compute RMSD on the aligned coordinates; for multi-domain proteins consider domain-wise RMSD.

Answer 107

Store pLDDT in the B-factor field (common in AF PDBs) and use spectrum b, then set appropriate range; e.g., 'spectrum b, blue_white_red, minimum=50, maximum=100' (colors optional) or use 'spectrum b' with defaults.

Answer 108

Define selections for each chain, find residues within a cutoff (e.g., within 4–5 Å) of the other chain using 'byres' and 'within', then use 'distance' for specific pairs and count contacts via selection sizes or scripts.

Answer 109

Cartoon shows secondary structure/backbone topology; surface shows solvent-accessible envelope and binding pockets/interfaces; sticks shows detailed side-chain/ligand interactions (often combined with cartoon).

Answer 110

WHERE filters rows before aggregation; GROUP BY forms groups for aggregates; HAVING filters groups after aggregation (e.g., HAVING COUNT(*) > 10).

Answer 111

Aggregations (COUNT/SUM/AVG/MIN/MAX) summarize groups. Window functions compute per-row values over a window defined by OVER (PARTITION BY/ORDER BY), e.g., running totals or ranks, without collapsing rows.

Answer 112

Indexes speed lookups/joins/orderings by maintaining auxiliary data structures (e.g., B-trees). They can hurt by slowing writes/inserts/updates and consuming storage; poor indexes can increase query planner overhead.

Answer 113

Critical Assessment of Structure Prediction (CASP) Start = 1994 [CASP-01] AF1 = 2018 [CASP-13] AF2 = 2020 [CASP-14]

Answer 114

Mean distance between corresponding atoms (all or only C-alpha), range [0,infinity), requires a global superposition, lower is better

Answer 115

Mean distance between corresponding C-alpha atoms scaled by lenght-dependent distance parameter, range [0,1], higher is better, requires global superposition

Answer 116

Mean percentage of C-alpha atoms that fit under 4 distance thresholds: GDT-TS: 1,2,4,8 angstrøm, GDT-HA: 0.5,1,2,4 angstrøm, requires global superposition (but for each threshold the optimal superposition is chosen separately)

Answer 117

The angstrom is a unit of length equal to 10⁻¹⁰ m; that is, one ten-billionth of a metre, a hundred-millionth of a centimetre, 0.1 nanometre, or 100 picometres.

Answer 118

Mean fraction of preserved all-atom distances using 4 tolerance thresholds (0.5,1,2,4) within the 15 anstrøm inclusion radius (only for atoms within this radius the distances are included in the calculation)

Answer 119

delta G = delta(H) - T * delta(S) - delta(H): enthalpy - T: temperature - ΔS: entropy Binding/folding is favorable if delta G < 0.

Answer 120

Enthalpy is the sum of a thermodynamic system's internal energy and the product of its pressure and volume: H(S,p)=U+pV

Answer 121

ΔH (enthalpy): “bonding/interactions energy” hydrogen bonds electrostatics (salt bridges) dispersion/van der Waals These generally favor more ordered, well-packed structures and make ΔH more negative (stabilizing).

Answer 122

ΔS (entropy): “number of accessible microstates” - protein conformational entropy (flexibility) - solvent entropy (especially water around hydrophobics) Entropy can favor disorder for the protein itself (more conformations), but favor folding via the hydrophobic effect because burying hydrophobics can increase solvent entropy (water becomes less ordered).

Answer 123

- all bonds - all angles - all torsion angles - all non bonded pairs - all partial charges

Answer 124

Rosetta scores a structure as a weighted sum of individual energy terms. Some terms are physics inspired (e.g. electrostatics) and others are knowledge based (e.g. torsion preferences)

Answer 125

Side chains can rotate around single bonds (the χ / chi dihedral angles: χ1, χ2, …), but often adopt the same discrete angles: ~60°, ~180°, or ~300° (g+,t, g-)

Answer 126

χ, chi1, chi2, chi3, chi4

Miscelanious Flashcards

(151 cards)