Computer Vision Flashcards

Question

What do **computers see** in images?

Answer 1

A grid of numbers (pixel intensities) ## Footnote Computer vision bridges the gap between raw pixel values and high-level understanding.

Answer 2

TRUE ## Footnote Humans are exceptionally good at recognizing meaning from noisy, incomplete images.

Answer 3

1966 ## Footnote It started as a summer undergraduate project at MIT.

Answer 4

Marvin Minsky ## Footnote He is a Turing Award winner from 1969.

Answer 5

* Artificial Intelligence * Image Processing * Machine Learning * Robotics * Cognitive Science * Neuroscience * Computer Graphics ## Footnote It started as part of AI but is now a large independent field.

Answer 6

* Edge detection * 3D reconstruction * Stereo vision * Feature-based methods * Learning-based approaches * Deep learning revolution (2010s) ## Footnote Modern successes include 3D body scanning and face recognition.

Answer 7

* Semantic information * Geometric information ## Footnote This allows systems to reason about outdoor vs indoor scenes and city vs countryside.

Answer 8

Reconstruct 3D shape from images or video ## Footnote Applications include robotics and augmented reality.

Answer 9

Read printed or handwritten text from images ## Footnote Examples include digit recognition and license plate recognition.

Answer 10

* Detect faces in real time * Identify or verify individuals ## Footnote Detection is not the same as recognition.

Answer 11

Using face, iris, or fingerprints ## Footnote The Afghan Girl was identified years later using iris patterns.

Answer 12

* Faster * More secure * Harder to steal than passwords ## Footnote Examples include fingerprint scanners and face unlock.

Answer 13

* Removing noise * Increasing resolution * Filling missing regions * Enhancing low-light photos * Simulating depth of field ## Footnote Used heavily in smartphone cameras and photo editing software.

Answer 14

Bird identification (e.g. Merlin Bird ID) ## Footnote This task requires attention to subtle visual details.

Answer 15

* Tracking facial expressions * Animating realistic characters * Creating digital doubles ## Footnote Used in movies like The Matrix.

Answer 16

* Generate new images * Translate styles * Transform objects ## Footnote This includes style transfer and image-to-image translation.

Answer 17

* Detecting lanes * Recognizing pedestrians * Understanding traffic signs * Interpreting dynamic scenes ## Footnote Self-driving cars rely heavily on cameras and computer vision.

Answer 18

Full 3D models of cities or landmarks ## Footnote Examples include Rome and Venice.

Answer 19

* Identifying what objects are present * Locating them with bounding boxes * Assigning labels ## Footnote Modern systems can recognize multiple objects simultaneously.

Answer 20

Find images similar to a query image or sketch ## Footnote Applications include search engines and digital libraries.

Answer 21

* Viewpoint variation * Illumination * Scale ## Footnote These challenges affect how objects are perceived.

Answer 22

* Intra-class variation * Background clutter * Motion * Occlusion * Local ambiguity ## Footnote These factors complicate recognition tasks.

Answer 23

* Colour * Texture * Shape * Motion * Context * Depth ## Footnote Combining cues improves robustness.

Answer 24

* Learning from fewer labels * Low-shot learning * Semi/self/weakly supervised learning * Continual learning * Domain adaptation ## Footnote Active research areas include autonomous driving and fine-grained recognition.

Answer 25

* Images and videos are everywhere * Vision problems are high impact * Field is growing rapidly * Huge industry demand ## Footnote Conference scale (e.g. CVPR) includes thousands of paper submissions.

Answer 26

* Convert a 2D image into a set of curves * Extract salient features * Represent structure more compactly than pixels ## Footnote Edges highlight boundaries and shape in a scene.

Answer 27

* Depth discontinuity * Surface colour discontinuity * Illumination discontinuity * Surface normal discontinuity ## Footnote Different physical causes can produce visually similar edges.

Answer 28

* Rapid change in image intensity ## Footnote Edges correspond to extrema (peaks) of the first derivative.

Answer 29

* Reconstruct a continuous image, then differentiate * Use discrete derivatives (finite differences) ## Footnote In practice, linear filters are used to approximate derivatives.

Answer 30

* Direction of intensity change * Strength of intensity change ## Footnote The gradient points perpendicular to the edge.

Answer 31

* Approximates image derivatives ## Footnote Common approximation of image derivatives; standard Sobel omits the 1/8 scaling factor.

Answer 32

* Gaussian smoothing + derivative ## Footnote This improves robustness to noise.

Answer 33

* Smoothing (noise reduction) * Detecting edges * Detecting zero-crossings ## Footnote These derivatives form the basis of many edge detectors.

Answer 34

Produces thick edges ## Footnote Non-maximal suppression is needed to identify the true edge.

Answer 35

* Compute gradient magnitude and direction * Keep pixel only if it is a local maximum ## Footnote Result: thin (1-pixel wide) edges.

Answer 36

* Some noise remains * Not all edges are equally important ## Footnote This leads to the need for double thresholding.

Answer 37

* High threshold (T) * Low threshold (t) ## Footnote Three cases: strong edge, weak edge, not an edge.

Answer 38

* Strong edges are always edges * Weak edges are edges only if connected to strong edges ## Footnote Connectivity is checked in a local neighbourhood.

Answer 39

* Gaussian smoothing (σ) * Gradient computation * Non-maximal suppression * Double thresholding * Edge linking (hysteresis) ## Footnote It is one of the most widely used edge detectors.

Answer 40

* Scale of edge detection ## Footnote Small σ detects fine details; large σ detects large-scale edges.

Answer 41

* Prewitt filter * Roberts filter ## Footnote Sobel is preferred due to better smoothing and improved noise robustness.

Answer 42

Laplacian of Gaussian ## Footnote It detects edges via zero-crossings.

Answer 43

* Compute LoG response S(x) * Look for adjacent pixels where the sign of S(x) changes ## Footnote This produces thin, precise edge elements (edgels).

Answer 44

* Edges = rapid intensity changes * Derivatives reveal edges * Sobel approximates Gaussian derivatives * Canny detector includes smoothing, gradient, non-max suppression, double thresholding, edge linking ## Footnote LoG detects edges via zero-crossings; Prewitt & Roberts are alternative filters.

Answer 45

* An image region where two or more edges intersect * Highly distinctive * Easier to localise than edges * Less ambiguous than flat regions * Ideal feature points for many vision tasks ## Footnote Corners provide reliable information for tracking and matching.

Answer 46

* Viewing motion through a small window * An edge alone does not reveal full motion direction * Motion is ambiguous ## Footnote Corners solve this by providing unambiguous motion information.

Answer 47

* Geometric invariance: Translation, Rotation, Scale (approximately) * Photometric invariance: Brightness changes, Exposure changes ## Footnote This robustness allows corners to perform well across different viewpoints and lighting conditions.

Answer 48

* Panorama stitching * 3D reconstruction * Photo tourism * Image matching ## Footnote The key idea is to find and match the same corners in different images.

Answer 49

* Detection: Find distinctive keypoints (corners) * Description: Extract a feature vector around each keypoint * Matching: Compare feature vectors ## Footnote This pipeline underpins algorithms like SIFT, SURF, and ORB.

Answer 50

Little or no change in any direction ## Footnote This is in contrast to corners, where a shift causes a significant change.

Answer 51

Sum of Squared Differences ## Footnote SSD quantifies change when a window is shifted.

Answer 52

Assumes shift (u, v) is small ## Footnote This allows the use of first-order Taylor expansion for approximations.

Answer 53

Intensity change in all directions ## Footnote It is central to corner detection.

Answer 54

An ellipse ## Footnote The shape is determined by the eigenvalues of the matrix.

Answer 55

* Flat region: Both eigenvalues small * Edge: One eigenvalue large, one small * Corner: Both eigenvalues large ## Footnote This classification helps in identifying the nature of the region.

Answer 56

R = det(H) - k(trace(H))^2 ## Footnote This function helps to identify corners based on the determinant and trace of the Second Moment Matrix.

Answer 57

To reduce noise in the image ## Footnote It uses a Gaussian window to weight pixels, making the results more stable.

Answer 58

Keeps only local maxima of the corner response ## Footnote This ensures one clean point per corner and avoids clusters of detections.

Answer 59

TRUE ## Footnote Corners are distinctive and unambiguous features in image processing.

Answer 60

Directional change ## Footnote It is a key component in corner detection algorithms.

Answer 61

* Detection * Description * Matching ## Footnote Each stage plays a crucial role in identifying and comparing features.

Answer 62

Good features must be robust to changes in the image ## Footnote Types of robustness include geometric and photometric.

Answer 63

* Geometric robustness * Photometric robustness ## Footnote Geometric robustness includes rotation and scale; photometric robustness includes intensity changes and lighting variations.

Answer 64

* Invariance: Feature locations do not change under transformation * Equivariance: Feature locations change in a predictable way ## Footnote Invariance is desired for photometric changes, while equivariance is desired for geometric changes.

Answer 65

* Translation * Rotation ## Footnote Harris maintains consistent corner locations under these transformations.

Answer 66

Find the scale at which a feature best matches the image structure ## Footnote This involves searching for local maxima of a response function.

Answer 67

Blob detection ## Footnote It finds maxima and minima in both space and scale.

Answer 68

* Take a neighbourhood around keypoint * Divide into 4 × 4 subregions * Build an 8-bin orientation histogram for each ## Footnote The final descriptor size is 128-dimensional.

Answer 69

* Uses integral images * Approximates Hessian-based detection ## Footnote SURF is designed to be faster than SIFT.

Answer 70

Binary strings ## Footnote It avoids computing full descriptors and results in very fast matching.

Answer 71

Mainly used for object detection ## Footnote It involves computing gradient magnitude and direction.

Answer 72

Ratio of distances to nearest and second-nearest neighbours ## Footnote It helps interpret the quality of matches.

Answer 73

FALSE ## Footnote Harris is not invariant to scale and can misclassify points.

Answer 74

Detect features across pyramid levels ## Footnote It allows for handling different scales without increasing window size.

Answer 75

3780 ## Footnote This is based on a 128 × 64 window.

Answer 76

* Euclidean distance * Cosine similarity ## Footnote These metrics help in comparing feature vectors.

Answer 77

* Lower threshold → fewer false positives * Higher threshold → more matches, more errors ## Footnote The threshold depends on the specific application.

Answer 78

A method to analyze features at multiple scales ## Footnote It is crucial for detecting features that may appear at different sizes.

Answer 79

Achieves rotation invariance ## Footnote It involves computing gradients and building a histogram of orientations.

Answer 80

Reliably match the same physical points across different images ## Footnote This is essential for various computer vision applications.

Answer 81

Bag of Visual Words ## Footnote This model adapts ideas from text retrieval to images.

Answer 82

apple ## Footnote Other examples include pear, cow, dog.

Answer 83

* Extract image features * Train a classifier ## Footnote This phase uses training images and labels.

Answer 84

Predicted label ## Footnote The input is a new image.

Answer 85

Bag of Words ## Footnote It counts how many times each word appears.

Answer 86

Bag of Visual Words (BoVW) ## Footnote This approach counts how often each visual word appears.

Answer 87

Detect local features ## Footnote Examples include corners and blobs.

Answer 88

Clustering (typically k-means) ## Footnote Each cluster centre represents a visual word.

Answer 89

Assigning descriptor to the nearest visual word ## Footnote This converts image features into visual words.

Answer 90

BoVW representation ## Footnote This representation is fixed length and independent of the number of detected features.

Answer 91

Ignores spatial information ## Footnote Images with the same objects but different layouts can look identical.

Answer 92

Spatial Pyramids ## Footnote This method divides the image into regions and computes histograms for each.

Answer 93

Concatenation of all histograms ## Footnote This preserves coarse spatial layout.

Answer 94

* Nearest Neighbour (NN) * Support Vector Machines (SVM) ## Footnote These classifiers are applied to the image vectors.

Answer 95

TRUE ## Footnote It requires a distance or similarity function and does not scale well.

Answer 96

A hyperplane that maximises the margin between classes ## Footnote Key concepts include support vectors and margin width.

Answer 97

Map data to a higher-dimensional space ## Footnote This allows for linear separability.

Answer 98

Define a kernel without explicitly computing the transformation ## Footnote This enables efficient non-linear classification.

Answer 99

* Polynomial * RBF (Gaussian) ## Footnote These kernels help in defining decision boundaries.

Answer 100

Histograms of visual words ## Footnote Spatial pyramids add spatial layout information.

Answer 101

* 3D structure * Camera motion ## Footnote SfM uses multiple views of the same scene to reconstruct 3D shape and camera motion.

Answer 102

* Recovering 3D structure * Estimating camera motion ## Footnote These problems are interrelated and essential for reconstructing scenes from images.

Answer 103

* Focal length * Principal point * Pixel scaling ## Footnote These parameters define how a camera captures images.

Answer 104

* 3D rotation R * 3D translation t ## Footnote These parameters determine the camera's position and orientation in 3D space.

Answer 105

* Different camera positions provide different projections * Depth can be recovered using geometry ## Footnote A single image loses depth information, making multiple views essential.

Answer 106

* Compute the 3D coordinates of a point ## Footnote This is done using projections of the same 3D point in two or more images.

Answer 107

* Estimating camera parameters * Rotation * Translation ## Footnote This process recovers how the camera moved between images.

Answer 108

* Minimising reprojection error ## Footnote This involves aligning projected 3D points with observed image points.

Answer 109

* Distance between observed image point * Predicted image point from 3D model ## Footnote Minimising this error is crucial for accurate reconstruction.

Answer 110

* Perspective: far objects appear smaller * Orthographic: all objects appear same scale ## Footnote Many SfM methods start with orthographic projection for simplicity.

Answer 111

* Size: 2F×P * Stacks all image coordinates from all frames ## Footnote This matrix contains all observed data necessary for reconstruction.

Answer 112

* Build measurement matrix W * Enforce rank constraint * Recover motion and shape ## Footnote This method simplifies the problem of estimating motion and structure.

Answer 113

* Number of linearly independent rows/columns ## Footnote The rank provides insights into the structure of the matrix and its factorisation.

Answer 114

* W=UΣV^T ## Footnote This decomposition is used to analyze the structure of the measurement matrix.

Answer 115

* Keep only the top 3 singular values * Discard the rest ## Footnote This helps in simplifying the factorisation process.

Answer 116

* Centre the 3D points around the origin * Centre image points in each frame ## Footnote This removes the translation term, making calculations easier.

Answer 117

To address limitations of hand-crafted features ## Footnote Neural networks learn representations from data instead of relying on manually designed features.

Answer 118

* Perceptrons * Multi-Layer Perceptrons (MLPs) * Convolutions * Activation functions * Dropout and normalisation ## Footnote These components are essential for building and training neural networks.

Answer 119

Image → Hand-crafted Features → Classifier → Labels ## Footnote This pipeline relies on manually designed features which can be limiting.

Answer 120

* Scale changes * Viewpoint variation * Illumination changes * Motion * Background clutter * Occlusion ## Footnote Hand-crafted features struggle to handle these challenges reliably.

Answer 121

Image → Learnable Parameters → Output → Loss → Labels ## Footnote This approach allows neural networks to learn features and classifiers jointly.

Answer 122

Early CNN example with ~1 million parameters trained on MNIST ## Footnote Important historically but limited by small datasets and compute.

Answer 123

* 20,000 categories * ~14 million images * Based on WordNet hierarchy ## Footnote ImageNet enabled deep learning to scale significantly.

Answer 124

* Better algorithms * Massive datasets * Powerful computation (GPUs) ## Footnote All three factors improved together, leading to advancements in deep learning.

Answer 125

* ~60 million parameters * Trained on ImageNet * Won ILSVRC 2012 by a large margin ## Footnote Triggered the modern deep learning boom.

Answer 126

* 1958 – Perceptron (Rosenblatt) * 1980s – Backpropagation * 1990s – CNNs (LeCun) * 2006 – Autoencoders * 2012 – AlexNet * 2015 – Deep learning dominates vision * 2020 – Vision Transformers ## Footnote Shows slow progress followed by a sudden explosion in the field.

Answer 127

TRUE ## Footnote ImageNet results show that learned representations have better performance than traditional methods.

Answer 128

* Linear classifier * Uses weight vector w and bias b ## Footnote The output rule determines the classification based on the linear combination of inputs.

Answer 129

Cannot model non-linear relationships ## Footnote Fails on real-world data that requires fitting curved functions.

Answer 130

Stacks multiple perceptrons where outputs of one layer feed into the next ## Footnote Key idea is that complex functions can be composed of simple functions.

Answer 131

* Hidden layers: layers between input and output * Hidden units: neurons in hidden layers ## Footnote Without non-linearity, multiple layers collapse into one linear function.

Answer 132

They allow the network to model complex patterns ## Footnote Without them, the network remains linear and cannot approximate real-world non-linear functions.

Answer 133

* Sigmoid * ReLU ## Footnote Sigmoid is smooth but can saturate; ReLU is simple and efficient.

Answer 134

Core operation that applies a kernel across the image ## Footnote Benefits include local connectivity, parameter sharing, and translation equivariance.

Answer 135

* Padding * Stride ## Footnote These parameters control the output size and step size during convolution.

Answer 136

Prevents overfitting by randomly disabling hidden units during training ## Footnote It forces robustness and reduces co-adaptation.

Answer 137

* Zero mean * Unit variance ## Footnote Benefits include reducing internal covariate shift, speeding up training, and improving stability.

Answer 138

Measures how wrong the model is ## Footnote It guides optimisation and allows for model comparison.

Answer 139

* Training set * Validation set * Test set ## Footnote Each split serves a different purpose in the model training and evaluation process.

Answer 140

Minimise loss ## Footnote Achieved using methods like gradient descent.

Answer 141

Converts raw scores into probabilities ## Footnote Used for multi-class classification ensuring probabilities sum to 1.

Answer 142

Penalises incorrect confident predictions ## Footnote Commonly used with softmax output.

Answer 143

Line between any two points lies above the function ## Footnote Properties include being easy to optimise and guaranteed global minimum.

Answer 144

* Non-convex loss surfaces * Many local minima * No guarantees of global optimum ## Footnote Despite challenges, gradient-based methods work well in practice.

Answer 145

Compute how changing each weight affects the loss ## Footnote This is essential for training deep neural networks effectively.

Answer 146

How do we know which weights to change? ## Footnote The answer lies in computing gradients using backpropagation.

Answer 147

* Operations as nodes * Data flowing between them ## Footnote This structure makes gradient calculation systematic.

Answer 148

Outputs ## Footnote This process moves from input to loss.

Answer 149

Compute derivatives of loss with respect to inputs ## Footnote This process moves from loss back to input.

Answer 150

Chain rule ## Footnote This rule is fundamental for calculating gradients.

Answer 151

Gradient coming from later nodes ## Footnote It represents the influence on loss.

Answer 152

Derivative of the current operation ## Footnote It is essential for calculating downstream gradients.

Answer 153

Downstream = Upstream × Local ## Footnote This pattern is crucial during backpropagation.

Answer 154

Gradient is copied to both inputs ## Footnote This is because both inputs affect the output equally.

Answer 155

Gradients are multiplied by the opposite input ## Footnote This is often referred to as the 'swap multiplier'.

Answer 156

Gradient only flows to the largest input ## Footnote It acts like a router for gradient flow.

Answer 157

Gradients from branches are added together ## Footnote This ensures that all influences are accounted for.

Answer 158

* Save inputs during forward pass * Multiply by local gradient during backward pass ## Footnote This process helps in efficiently calculating gradients.

Answer 159

* Vectors * Matrices * Tensors ## Footnote These structures are necessary because weights and activations are often multidimensional.

Answer 160

Single input → single output ## Footnote It describes the relationship between one input and one output.

Answer 161

Vector of derivatives ## Footnote It tells how output changes with each input element.

Answer 162

Matrix of derivatives ## Footnote It is used when there are multiple inputs and multiple outputs.

Answer 163

1. Receive upstream gradient 2. Multiply by local Jacobian 3. Produce downstream gradient ## Footnote This often results in matrix-vector multiplication.

Answer 164

Multiplying gradients by transposed matrices ## Footnote This is essential for passing gradients through layers.

Answer 165

* ∂ℒ/∂A = (∂ℒ/∂U) × Bᵀ * ∂ℒ/∂B = Aᵀ × (∂ℒ/∂U) ## Footnote This shows how to compute gradients for both matrices involved.

Answer 166

Each element of the output depends on a row of A and a column of B ## Footnote The gradient of one element equals the dot product between the corresponding row and column.

Answer 167

Gradients are reused and computation graph stores intermediate values ## Footnote This makes complex networks manageable.

Answer 168

Compute outputs ## Footnote This is a simple way to remember the function of the forward pass.

Answer 169

Send gradients back ## Footnote This captures the essence of what happens during the backward pass.

Answer 170

Multiply gradients ## Footnote This highlights the fundamental operation in backpropagation.

Answer 171

Copy gradient ## Footnote This helps recall how gradients flow through addition.

Answer 172

Swap inputs ## Footnote This reminds us of how gradients are affected by multiplication.

Answer 173

Route gradient ## Footnote This indicates how gradients are directed in max operations.

Answer 174

Use transpose ## Footnote This is crucial for understanding how gradients are calculated in matrix operations.

Answer 175

Automatic backprop ## Footnote This emphasizes the convenience provided by autograd systems.

Answer 176

* Convolution layers * Pooling layers * Normalisation layers ## Footnote A CNN also includes activation functions (ReLU) and fully connected layers.

Answer 177

To keep spatial size the same ## Footnote Common choice for padding: P = (K - 1) / 2.

Answer 178

How far the kernel moves ## Footnote Stride = 1 for normal sliding, Stride = 2 for downsampling.

Answer 179

Area of input affecting one output value ## Footnote Each output depends on a K × K region.

Answer 180

* Edges * Colour contrasts * Texture patterns ## Footnote Similar to the derivative of Gaussian.

Answer 181

B × C_out × H' × W' ## Footnote Where H' = (H - K + 2P)/S + 1 and W' = (W - K + 2P)/S + 1.

Answer 182

Preserve spatial structure and reduce parameters ## Footnote Compared to fully connected networks.

Answer 183

* Max Pooling * Average Pooling ## Footnote Pooling reduces spatial size and introduces invariance to small shifts.

Answer 184

TRUE ## Footnote BN normalises layer outputs to zero mean and unit variance.

Answer 185

Conv → ReLU → Conv → ReLU → … ## Footnote Each layer increases abstraction and learns more complex features.

Answer 186

6 × 26 × 26 ## Footnote Each filter detects a different feature.

Answer 187

Fix output size ## Footnote The framework calculates required parameters.

Answer 188

* Input dimensions * Kernel size * Padding * Stride ## Footnote Output dimensions are calculated using specific formulas.

Answer 189

Must have the same number of channels as input ## Footnote This ensures proper feature extraction.

Answer 190

Output shrinks each layer ## Footnote This can be mitigated by adding zero-padding.

Answer 191

Uses running averages of μ and σ ## Footnote Becomes a simple linear operation that can be fused with the convolution layer.

Answer 192

* Reduce spatial size * Introduce invariance to small shifts * Reduce computation ## Footnote Pooling has no learnable parameters.

Answer 193

C × H' × W' ## Footnote H' and W' are determined by the pooling operation parameters.

Answer 194

* Convolution layers * Pooling layers * Non-linearity (ReLU) * Normalisation * MLP (Fully connected layers) ## Footnote General pattern: Conv → ReLU → Pool → (repeat) → Flatten → FC

Answer 195

3 × 227 × 227 ## Footnote AlexNet was the winner of the ImageNet competition in 2012.

Answer 196

W^′=(W−K+2P)/S+1 ## Footnote This formula calculates the output dimensions based on input size, kernel size, padding, and stride.

Answer 197

96 × 55 × 55 ## Footnote Calculated using the input size and parameters of the conv1 layer.

Answer 198

1134KB ## Footnote Memory is calculated based on the number of output elements and their size.

Answer 199

34,944 ## Footnote This includes weights and biases for the layer.

Answer 200

105,705,600 ≈ 106 MFLOPs ## Footnote Convolution layers dominate computation in CNNs.

Answer 201

96 × 27 × 27 ## Footnote Pooling layers have no learnable parameters and very small FLOPs.

Answer 202

* conv1 * pool1 * conv2 * pool2 * conv3 * conv4 * conv5 * pool5 * flatten → 9216 * fc6 → 4096 * fc7 → 4096 * fc8 → 1000 classes ## Footnote Flatten size is calculated as 256×6×6=9216.

Answer 203

* Most memory → early convolution layers * Most parameters → fully connected layers * Most FLOPs → convolution layers ## Footnote Important points for exam preparation.

Answer 204

* Edges * Colours ## Footnote Mid layers learn textures and parts; deep layers learn objects and shapes.

Answer 205

Deeper networks with regular structure ## Footnote VGG emphasizes simplicity and depth.

Answer 206

* All conv layers: 3 × 3 kernel, stride 1, pad 1 * All max pooling: 2 × 2 kernel, stride 2 * After each pooling: double number of channels ## Footnote VGG-19 includes extra conv layers in stages 4 and 5.

Answer 207

* Stem Network * Inception Module * Global Average Pooling * Auxiliary Classifiers ## Footnote These innovations aim for efficiency in terms of parameters, memory, and FLOPs.

Answer 208

Let network learn: F(X)+X ## Footnote This introduces shortcut connections to ease optimization.

Answer 209

* Conv 3×3 * ReLU * Conv 3×3 * + skip connection ## Footnote Output is F(X)+X, facilitating identity mapping.

Answer 210

* Basic Block: Two 3×3 convolutions * Bottleneck Block: 1×1 → 3×3 → 1×1 ## Footnote Bottleneck blocks reduce computation and enable deeper networks.

Answer 211

* AlexNet / VGG: [Conv + Pool + ReLU] × N, Flatten, [FC + ReLU] × N, FC * GoogLeNet: Inception modules, Global average pooling, No big FC layers * ResNet: Residual blocks, Identity shortcuts, Global average pooling ## Footnote Each architecture has unique features that define its structure and efficiency.

Answer 212

* AlexNet = first big CNN * VGG = deep + simple 3×3 * GoogLeNet = inception + efficient * ResNet = skip connections * Memory heavy = early conv * Parameters heavy = FC layers * FLOPs heavy = conv layers * Residual = F(X) + X ## Footnote These hooks help in remembering key concepts related to CNN architectures.

Computer Vision Flashcards

(236 cards)