Computer Vision Flashcards

(236 cards)

1
Q

What is a digital image?

A
  • A grid (matrix) of intensity values
  • Each cell = one pixel

A digital image is literally a matrix of numbers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

In grayscale images, how many values are there per pixel?

A

One value per pixel

Commonly stored as 1 byte (8 bits), where 0 = black and 255 = white.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

A grayscale image can be treated as a function: f(x,y)→______?

A

intensity

Here, x,y = pixel location and f(x,y) = brightness at that pixel.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How can a colour image be represented?

A
  • Three grayscale images: Red (R), Green (G), Blue (B)
  • One 3D vector per pixel: (R,G,B)

Each pixel represents colour intensity in all three channels.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

In Python, how are images stored?

A

As arrays (matrices)

For an image im of size N × M × 3, im[y, x, 0] → red value, im[y, x, 1] → green value, im[y, x, 2] → blue value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the purpose of image filtering?

A
  • Remove noise
  • Smooth images
  • Detect edges or contours
  • Preprocess for feature detection
  • Enhance or sharpen images
  • Fundamental to Convolutional Neural Networks (CNNs)

Filtering creates a new image where each output pixel depends on a local neighbourhood of the input image.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does the sliding window idea in filtering involve?

A
  • A small window (e.g. 3×3, 5×5)
  • Moves across the image pixel by pixel
  • Computes a value from the neighbourhood at each position

This is the basis of convolution and cross-correlation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Define cross-correlation in image processing.

A
  • For each pixel, take a dot product of:
    • Kernel values
    • Corresponding neighbourhood in the image

The kernel is applied as-is (not flipped).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the difference between convolution and cross-correlation?

A

Convolution involves flipping the kernel horizontally and vertically before applying

Convolution is commutative and associative.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is linear filtering?

A
  • Replace each pixel with a weighted sum of its neighbours
  • Weights defined by the kernel (filter or mask)

This includes mean filters, Gaussian filters, and sharpening filters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a mean filter?

A
  • Kernel filled with equal values (e.g. all 1s)
  • Output pixel = average of neighbourhood

It smooths the image, reduces noise, but blurs edges.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does padding do in image processing?

A

Adds extra border pixels (often zeros)

Padding controls output image size.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is stride in the context of image filtering?

A

Step size of the sliding window

A stride of 1 results in detailed output, while a larger stride leads to smaller output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Name some linear filter examples.

A
  • Identity filter → image unchanged
  • Shift filter → image moves
  • Mean filter → blur
  • Sharpening filter → enhance edges

Sharpening emphasizes differences between pixels.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the process of sharpening via detail extraction?

A
  • Blur the image
  • Subtract blurred image from original → detail
  • Add scaled detail back

This is also called a high-pass filter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are ringing artifacts?

A

Oscillations near sharp edges caused by using a box (mean) filter

This occurs because the box filter has sharp cut-offs in frequency space.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Describe a Gaussian filter.

A
  • Uses a kernel shaped like a Gaussian bell curve
  • Nearby pixels weighted more than distant ones

It smooths images naturally and preserves edges better than mean filters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

When is Gaussian filtering preferred over mean filtering?

A

When the image has sharp edges and smooth noise reduction without strong artifacts is desired

Mean filter is simpler but causes ringing and edge blurring.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is a median filter?

A

Replaces each pixel with the median of its neighbourhood

It is robust to outliers and preserves edges better than mean or Gaussian filters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the purpose of thresholding in image processing?

A

Convert image to binary

Rule: If pixel ≥ threshold → white, else → black.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Summarize the key points about images and filtering.

A
  • Images = matrices of intensity values
  • Sliding window underpins filtering
  • Cross-correlation vs convolution (kernel flipping)
  • Linear filters: Mean, Gaussian, Sharpening
  • Gaussian preferred over mean for edges
  • Non-linear filters: Median, Thresholding

These concepts are fundamental in image processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the core idea of Computer Vision?

A

Every image tells a story

Computer Vision aims to understand that story automatically.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the main goal of Computer Vision?

A

Extract meaning from pixels

This includes understanding geometric shape, identifying objects and people, and interpreting scenes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What do humans see in images?

A
  • Objects
  • People
  • Actions
  • Context

Computers see a grid of numbers (pixel intensities).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What do **computers see** in images?
A grid of numbers (pixel intensities) ## Footnote Computer vision bridges the gap between raw pixel values and high-level understanding.
26
True or false: **Vision is one of the hardest problems** in AI.
TRUE ## Footnote Humans are exceptionally good at recognizing meaning from noisy, incomplete images.
27
When did **Computer Vision** begin?
1966 ## Footnote It started as a summer undergraduate project at MIT.
28
Who asked a student to connect a television camera to a computer in 1966?
Marvin Minsky ## Footnote He is a Turing Award winner from 1969.
29
What fields does **Computer Vision** overlap with?
* Artificial Intelligence * Image Processing * Machine Learning * Robotics * Cognitive Science * Neuroscience * Computer Graphics ## Footnote It started as part of AI but is now a large independent field.
30
Name a major development in **Computer Vision**.
* Edge detection * 3D reconstruction * Stereo vision * Feature-based methods * Learning-based approaches * Deep learning revolution (2010s) ## Footnote Modern successes include 3D body scanning and face recognition.
31
What does **scene understanding** extract?
* Semantic information * Geometric information ## Footnote This allows systems to reason about outdoor vs indoor scenes and city vs countryside.
32
What can Computer Vision do regarding **3D shape**?
Reconstruct 3D shape from images or video ## Footnote Applications include robotics and augmented reality.
33
What does **Optical Character Recognition (OCR)** allow computers to do?
Read printed or handwritten text from images ## Footnote Examples include digit recognition and license plate recognition.
34
What can modern cameras do in terms of **face detection and recognition**?
* Detect faces in real time * Identify or verify individuals ## Footnote Detection is not the same as recognition.
35
What is an example of **biometric identification**?
Using face, iris, or fingerprints ## Footnote The Afghan Girl was identified years later using iris patterns.
36
What benefits does **vision-based authentication** provide?
* Faster * More secure * Harder to steal than passwords ## Footnote Examples include fingerprint scanners and face unlock.
37
What does **image enhancement** in computational photography improve?
* Removing noise * Increasing resolution * Filling missing regions * Enhancing low-light photos * Simulating depth of field ## Footnote Used heavily in smartphone cameras and photo editing software.
38
What is an example of **fine-grained recognition**?
Bird identification (e.g. Merlin Bird ID) ## Footnote This task requires attention to subtle visual details.
39
What is **shape and motion capture** used for?
* Tracking facial expressions * Animating realistic characters * Creating digital doubles ## Footnote Used in movies like The Matrix.
40
What can Computer Vision do in terms of **image synthesis**?
* Generate new images * Translate styles * Transform objects ## Footnote This includes style transfer and image-to-image translation.
41
What is essential for **autonomous cars**?
* Detecting lanes * Recognizing pedestrians * Understanding traffic signs * Interpreting dynamic scenes ## Footnote Self-driving cars rely heavily on cameras and computer vision.
42
What can be reconstructed from **photo collections**?
Full 3D models of cities or landmarks ## Footnote Examples include Rome and Venice.
43
What does **recognition** mean in Computer Vision?
* Identifying what objects are present * Locating them with bounding boxes * Assigning labels ## Footnote Modern systems can recognize multiple objects simultaneously.
44
What is the goal of **image retrieval**?
Find images similar to a query image or sketch ## Footnote Applications include search engines and digital libraries.
45
Name a major challenge in **Computer Vision**.
* Viewpoint variation * Illumination * Scale ## Footnote These challenges affect how objects are perceived.
46
What are additional challenges in **Computer Vision**?
* Intra-class variation * Background clutter * Motion * Occlusion * Local ambiguity ## Footnote These factors complicate recognition tasks.
47
What are some **visual cues** that vision systems exploit?
* Colour * Texture * Shape * Motion * Context * Depth ## Footnote Combining cues improves robustness.
48
What are current challenges in **Computer Vision & ML**?
* Learning from fewer labels * Low-shot learning * Semi/self/weakly supervised learning * Continual learning * Domain adaptation ## Footnote Active research areas include autonomous driving and fine-grained recognition.
49
Why study **Computer Vision**?
* Images and videos are everywhere * Vision problems are high impact * Field is growing rapidly * Huge industry demand ## Footnote Conference scale (e.g. CVPR) includes thousands of paper submissions.
50
What is the main goal of **edge detection**?
* Convert a 2D image into a set of curves * Extract salient features * Represent structure more compactly than pixels ## Footnote Edges highlight boundaries and shape in a scene.
51
What causes **edges** in images?
* Depth discontinuity * Surface colour discontinuity * Illumination discontinuity * Surface normal discontinuity ## Footnote Different physical causes can produce visually similar edges.
52
An **edge** is characterized by a location of:
* Rapid change in image intensity ## Footnote Edges correspond to extrema (peaks) of the first derivative.
53
To compute **derivatives** in a digital image, we can use:
* Reconstruct a continuous image, then differentiate * Use discrete derivatives (finite differences) ## Footnote In practice, linear filters are used to approximate derivatives.
54
The **image gradient** measures:
* Direction of intensity change * Strength of intensity change ## Footnote The gradient points perpendicular to the edge.
55
What does the **Sobel operator** do?
* Approximates image derivatives ## Footnote Common approximation of image derivatives; standard Sobel omits the 1/8 scaling factor.
56
The **Sobel operator** can be seen as an approximation of:
* Gaussian smoothing + derivative ## Footnote This improves robustness to noise.
57
What are the derivatives of a **Gaussian** used for?
* Smoothing (noise reduction) * Detecting edges * Detecting zero-crossings ## Footnote These derivatives form the basis of many edge detectors.
58
What is the problem with using **gradient magnitude** alone?
Produces thick edges ## Footnote Non-maximal suppression is needed to identify the true edge.
59
What is the process of **non-maximal suppression**?
* Compute gradient magnitude and direction * Keep pixel only if it is a local maximum ## Footnote Result: thin (1-pixel wide) edges.
60
After **non-maximal suppression**, what issues may still remain?
* Some noise remains * Not all edges are equally important ## Footnote This leads to the need for double thresholding.
61
Define **double thresholding**.
* High threshold (T) * Low threshold (t) ## Footnote Three cases: strong edge, weak edge, not an edge.
62
In **edge linking**, what are the rules for weak edges?
* Strong edges are always edges * Weak edges are edges only if connected to strong edges ## Footnote Connectivity is checked in a local neighbourhood.
63
What are the steps in the **Canny edge detector** pipeline?
* Gaussian smoothing (σ) * Gradient computation * Non-maximal suppression * Double thresholding * Edge linking (hysteresis) ## Footnote It is one of the most widely used edge detectors.
64
What does the parameter **σ** control in the Canny edge detector?
* Scale of edge detection ## Footnote Small σ detects fine details; large σ detects large-scale edges.
65
Name two **similar gradient filters** to the Sobel operator.
* Prewitt filter * Roberts filter ## Footnote Sobel is preferred due to better smoothing and improved noise robustness.
66
What does **LoG** stand for in edge detection?
Laplacian of Gaussian ## Footnote It detects edges via zero-crossings.
67
What is the procedure for **zero-crossing edge detection**?
* Compute LoG response S(x) * Look for adjacent pixels where the sign of S(x) changes ## Footnote This produces thin, precise edge elements (edgels).
68
Summarize the key points of **edge detection**.
* Edges = rapid intensity changes * Derivatives reveal edges * Sobel approximates Gaussian derivatives * Canny detector includes smoothing, gradient, non-max suppression, double thresholding, edge linking ## Footnote LoG detects edges via zero-crossings; Prewitt & Roberts are alternative filters.
69
What is a **corner** in image processing?
* An image region where two or more edges intersect * Highly distinctive * Easier to localise than edges * Less ambiguous than flat regions * Ideal feature points for many vision tasks ## Footnote Corners provide reliable information for tracking and matching.
70
What is the **aperture problem**?
* Viewing motion through a small window * An edge alone does not reveal full motion direction * Motion is ambiguous ## Footnote Corners solve this by providing unambiguous motion information.
71
Name the types of **invariance** that make corners robust.
* Geometric invariance: Translation, Rotation, Scale (approximately) * Photometric invariance: Brightness changes, Exposure changes ## Footnote This robustness allows corners to perform well across different viewpoints and lighting conditions.
72
List some **real applications** of corner detection.
* Panorama stitching * 3D reconstruction * Photo tourism * Image matching ## Footnote The key idea is to find and match the same corners in different images.
73
What are the **three stages** of corner-based systems?
* Detection: Find distinctive keypoints (corners) * Description: Extract a feature vector around each keypoint * Matching: Compare feature vectors ## Footnote This pipeline underpins algorithms like SIFT, SURF, and ORB.
74
What happens when a small image window is shifted in a **flat region**?
Little or no change in any direction ## Footnote This is in contrast to corners, where a shift causes a significant change.
75
What does **SSD** stand for in image processing?
Sum of Squared Differences ## Footnote SSD quantifies change when a window is shifted.
76
What is the **Small Displacement Assumption**?
Assumes shift (u, v) is small ## Footnote This allows the use of first-order Taylor expansion for approximations.
77
What does the **Second Moment Matrix** (Auto-Correlation Matrix) capture?
Intensity change in all directions ## Footnote It is central to corner detection.
78
What does the quadratic form of the Second Moment Matrix define?
An ellipse ## Footnote The shape is determined by the eigenvalues of the matrix.
79
What do **eigenvalues** indicate in corner detection?
* Flat region: Both eigenvalues small * Edge: One eigenvalue large, one small * Corner: Both eigenvalues large ## Footnote This classification helps in identifying the nature of the region.
80
What is the **Harris Corner Response Function**?
R = det(H) - k(trace(H))^2 ## Footnote This function helps to identify corners based on the determinant and trace of the Second Moment Matrix.
81
What is the purpose of **Gaussian Smoothing** in corner detection?
To reduce noise in the image ## Footnote It uses a Gaussian window to weight pixels, making the results more stable.
82
What is **Non-Maximal Suppression**?
Keeps only local maxima of the corner response ## Footnote This ensures one clean point per corner and avoids clusters of detections.
83
True or false: Corners are useful for **matching**, **tracking**, and **reconstruction**.
TRUE ## Footnote Corners are distinctive and unambiguous features in image processing.
84
What does the **Auto-Correlation Matrix** capture?
Directional change ## Footnote It is a key component in corner detection algorithms.
85
What are the **main components** of every local feature method?
* Detection * Description * Matching ## Footnote Each stage plays a crucial role in identifying and comparing features.
86
Define **feature robustness**.
Good features must be robust to changes in the image ## Footnote Types of robustness include geometric and photometric.
87
What types of **robustness** are there?
* Geometric robustness * Photometric robustness ## Footnote Geometric robustness includes rotation and scale; photometric robustness includes intensity changes and lighting variations.
88
What is the difference between **invariance** and **equivariance**?
* Invariance: Feature locations do not change under transformation * Equivariance: Feature locations change in a predictable way ## Footnote Invariance is desired for photometric changes, while equivariance is desired for geometric changes.
89
The **Harris Detector** is equivariant to which transformations?
* Translation * Rotation ## Footnote Harris maintains consistent corner locations under these transformations.
90
What is the goal of **scale-invariant detection**?
Find the scale at which a feature best matches the image structure ## Footnote This involves searching for local maxima of a response function.
91
What is the **Laplacian of Gaussian (LoG)** used for?
Blob detection ## Footnote It finds maxima and minima in both space and scale.
92
What are the steps involved in the **SIFT descriptor**?
* Take a neighbourhood around keypoint * Divide into 4 × 4 subregions * Build an 8-bin orientation histogram for each ## Footnote The final descriptor size is 128-dimensional.
93
What distinguishes **SURF** from SIFT?
* Uses integral images * Approximates Hessian-based detection ## Footnote SURF is designed to be faster than SIFT.
94
What does **BRIEF** use for feature matching?
Binary strings ## Footnote It avoids computing full descriptors and results in very fast matching.
95
What is the purpose of **HoG**?
Mainly used for object detection ## Footnote It involves computing gradient magnitude and direction.
96
What is the **nearest neighbour distance ratio (NNDR)**?
Ratio of distances to nearest and second-nearest neighbours ## Footnote It helps interpret the quality of matches.
97
True or false: The **Harris Detector** is invariant to scale.
FALSE ## Footnote Harris is not invariant to scale and can misclassify points.
98
What is the **Gaussian pyramid** used for?
Detect features across pyramid levels ## Footnote It allows for handling different scales without increasing window size.
99
What is the final descriptor size for **HoG** in human detection?
3780 ## Footnote This is based on a 128 × 64 window.
100
What are the **distance metrics** commonly used in feature matching?
* Euclidean distance * Cosine similarity ## Footnote These metrics help in comparing feature vectors.
101
What is the trade-off when choosing the **ratio threshold** in feature matching?
* Lower threshold → fewer false positives * Higher threshold → more matches, more errors ## Footnote The threshold depends on the specific application.
102
What does the term **scale-space** refer to?
A method to analyze features at multiple scales ## Footnote It is crucial for detecting features that may appear at different sizes.
103
What is the significance of **orientation estimation** in feature detection?
Achieves rotation invariance ## Footnote It involves computing gradients and building a histogram of orientations.
104
What is the **goal** of feature matching?
Reliably match the same physical points across different images ## Footnote This is essential for various computer vision applications.
105
What does **BoVW** stand for in image classification?
Bag of Visual Words ## Footnote This model adapts ideas from text retrieval to images.
106
The goal of image classification is to assign a label to an image, such as _______.
apple ## Footnote Other examples include pear, cow, dog.
107
What are the main steps in the **training phase** of the image classification pipeline?
* Extract image features * Train a classifier ## Footnote This phase uses training images and labels.
108
In the **testing phase**, what is the output after applying the learned classifier?
Predicted label ## Footnote The input is a new image.
109
In text retrieval, documents are represented as frequencies of words, ignoring word order. This is called the _______.
Bag of Words ## Footnote It counts how many times each word appears.
110
What is the concept of treating local image features like **visual words** called?
Bag of Visual Words (BoVW) ## Footnote This approach counts how often each visual word appears.
111
What is the first step in **feature extraction** for image classification?
Detect local features ## Footnote Examples include corners and blobs.
112
The goal of creating a **visual vocabulary** is to represent local descriptors through what method?
Clustering (typically k-means) ## Footnote Each cluster centre represents a visual word.
113
What does **vector quantisation** involve in the context of image features?
Assigning descriptor to the nearest visual word ## Footnote This converts image features into visual words.
114
Each image is represented as a histogram of visual word frequencies, which is a characteristic of the _______.
BoVW representation ## Footnote This representation is fixed length and independent of the number of detected features.
115
What is a limitation of the basic **BoVW** model?
Ignores spatial information ## Footnote Images with the same objects but different layouts can look identical.
116
What is the solution to the limitation of BoVW regarding spatial information?
Spatial Pyramids ## Footnote This method divides the image into regions and computes histograms for each.
117
In **Spatial Pyramid Matching**, what is the final representation composed of?
Concatenation of all histograms ## Footnote This preserves coarse spatial layout.
118
What are the two main classifiers used after images are converted to vectors?
* Nearest Neighbour (NN) * Support Vector Machines (SVM) ## Footnote These classifiers are applied to the image vectors.
119
True or false: The **Nearest Neighbour** classifier is sensitive to noise.
TRUE ## Footnote It requires a distance or similarity function and does not scale well.
120
What does the **Maximum Margin Classifier (SVM)** aim to find?
A hyperplane that maximises the margin between classes ## Footnote Key concepts include support vectors and margin width.
121
What is the goal when dealing with non-linear data in SVMs?
Map data to a higher-dimensional space ## Footnote This allows for linear separability.
122
What is the **kernel trick** in SVMs?
Define a kernel without explicitly computing the transformation ## Footnote This enables efficient non-linear classification.
123
Common kernels used in SVMs include _______.
* Polynomial * RBF (Gaussian) ## Footnote These kernels help in defining decision boundaries.
124
In summary, BoVW adapts ideas from text retrieval to images by representing them as _______.
Histograms of visual words ## Footnote Spatial pyramids add spatial layout information.
125
What does **Structure from Motion (SfM)** aim to recover?
* 3D structure * Camera motion ## Footnote SfM uses multiple views of the same scene to reconstruct 3D shape and camera motion.
126
What are the **two main problems** combined in SfM?
* Recovering 3D structure * Estimating camera motion ## Footnote These problems are interrelated and essential for reconstructing scenes from images.
127
What are the **intrinsic parameters** of a camera?
* Focal length * Principal point * Pixel scaling ## Footnote These parameters define how a camera captures images.
128
What are the **extrinsic parameters** of a camera?
* 3D rotation R * 3D translation t ## Footnote These parameters determine the camera's position and orientation in 3D space.
129
Why are **multiple views** necessary in SfM?
* Different camera positions provide different projections * Depth can be recovered using geometry ## Footnote A single image loses depth information, making multiple views essential.
130
What is the goal of **triangulation** in SfM?
* Compute the 3D coordinates of a point ## Footnote This is done using projections of the same 3D point in two or more images.
131
What does **camera pose** estimation involve?
* Estimating camera parameters * Rotation * Translation ## Footnote This process recovers how the camera moved between images.
132
What is the **objective** of SfM?
* Minimising reprojection error ## Footnote This involves aligning projected 3D points with observed image points.
133
What is **reprojection error**?
* Distance between observed image point * Predicted image point from 3D model ## Footnote Minimising this error is crucial for accurate reconstruction.
134
What is the difference between **perspective** and **orthographic projection**?
* Perspective: far objects appear smaller * Orthographic: all objects appear same scale ## Footnote Many SfM methods start with orthographic projection for simplicity.
135
What is the **measurement matrix W** in SfM?
* Size: 2F×P * Stacks all image coordinates from all frames ## Footnote This matrix contains all observed data necessary for reconstruction.
136
What does the **factorisation method** in SfM involve?
* Build measurement matrix W * Enforce rank constraint * Recover motion and shape ## Footnote This method simplifies the problem of estimating motion and structure.
137
What is the **rank** of a matrix?
* Number of linearly independent rows/columns ## Footnote The rank provides insights into the structure of the matrix and its factorisation.
138
What is the **Singular Value Decomposition (SVD)** of a matrix?
* W=UΣV^T ## Footnote This decomposition is used to analyze the structure of the measurement matrix.
139
What is the purpose of **truncated SVD** in SfM?
* Keep only the top 3 singular values * Discard the rest ## Footnote This helps in simplifying the factorisation process.
140
What is a key **trick** to simplify equations in SfM?
* Centre the 3D points around the origin * Centre image points in each frame ## Footnote This removes the translation term, making calculations easier.
141
What is the main purpose of **neural networks** in computer vision?
To address limitations of hand-crafted features ## Footnote Neural networks learn representations from data instead of relying on manually designed features.
142
Name the **core components** of neural networks.
* Perceptrons * Multi-Layer Perceptrons (MLPs) * Convolutions * Activation functions * Dropout and normalisation ## Footnote These components are essential for building and training neural networks.
143
What is the **traditional vision pipeline**?
Image → Hand-crafted Features → Classifier → Labels ## Footnote This pipeline relies on manually designed features which can be limiting.
144
List some **challenges in computer vision**.
* Scale changes * Viewpoint variation * Illumination changes * Motion * Background clutter * Occlusion ## Footnote Hand-crafted features struggle to handle these challenges reliably.
145
What does the new pipeline for learning representations involve?
Image → Learnable Parameters → Output → Loss → Labels ## Footnote This approach allows neural networks to learn features and classifiers jointly.
146
What is **LeNet** and its significance?
Early CNN example with ~1 million parameters trained on MNIST ## Footnote Important historically but limited by small datasets and compute.
147
Describe the **ImageNet dataset**.
* 20,000 categories * ~14 million images * Based on WordNet hierarchy ## Footnote ImageNet enabled deep learning to scale significantly.
148
What are the three factors that contributed to the success of **deep learning**?
* Better algorithms * Massive datasets * Powerful computation (GPUs) ## Footnote All three factors improved together, leading to advancements in deep learning.
149
What was the impact of **AlexNet** in 2012?
* ~60 million parameters * Trained on ImageNet * Won ILSVRC 2012 by a large margin ## Footnote Triggered the modern deep learning boom.
150
What are the **key milestones** in the timeline of deep learning?
* 1958 – Perceptron (Rosenblatt) * 1980s – Backpropagation * 1990s – CNNs (LeCun) * 2006 – Autoencoders * 2012 – AlexNet * 2015 – Deep learning dominates vision * 2020 – Vision Transformers ## Footnote Shows slow progress followed by a sudden explosion in the field.
151
True or false: **Learned representations** outperform manual ones in computer vision.
TRUE ## Footnote ImageNet results show that learned representations have better performance than traditional methods.
152
What is a **perceptron**?
* Linear classifier * Uses weight vector w and bias b ## Footnote The output rule determines the classification based on the linear combination of inputs.
153
What is the limitation of a **single perceptron**?
Cannot model non-linear relationships ## Footnote Fails on real-world data that requires fitting curved functions.
154
What is a **Multi-Layer Perceptron (MLP)**?
Stacks multiple perceptrons where outputs of one layer feed into the next ## Footnote Key idea is that complex functions can be composed of simple functions.
155
What are **hidden layers** and **hidden units**?
* Hidden layers: layers between input and output * Hidden units: neurons in hidden layers ## Footnote Without non-linearity, multiple layers collapse into one linear function.
156
Why are **non-linear activation functions** important in MLPs?
They allow the network to model complex patterns ## Footnote Without them, the network remains linear and cannot approximate real-world non-linear functions.
157
What are the two common **activation functions** mentioned?
* Sigmoid * ReLU ## Footnote Sigmoid is smooth but can saturate; ReLU is simple and efficient.
158
What is **convolution** in the context of CNNs?
Core operation that applies a kernel across the image ## Footnote Benefits include local connectivity, parameter sharing, and translation equivariance.
159
What are the key parameters in **convolution settings**?
* Padding * Stride ## Footnote These parameters control the output size and step size during convolution.
160
What is the purpose of **dropout** in neural networks?
Prevents overfitting by randomly disabling hidden units during training ## Footnote It forces robustness and reduces co-adaptation.
161
What does **normalisation** achieve in neural networks?
* Zero mean * Unit variance ## Footnote Benefits include reducing internal covariate shift, speeding up training, and improving stability.
162
What is a **loss function**?
Measures how wrong the model is ## Footnote It guides optimisation and allows for model comparison.
163
What are the three standard **dataset splits**?
* Training set * Validation set * Test set ## Footnote Each split serves a different purpose in the model training and evaluation process.
164
What is the **goal of optimisation** in neural networks?
Minimise loss ## Footnote Achieved using methods like gradient descent.
165
What does **softmax** do?
Converts raw scores into probabilities ## Footnote Used for multi-class classification ensuring probabilities sum to 1.
166
What is **Negative Log-Likelihood (NLL)** loss?
Penalises incorrect confident predictions ## Footnote Commonly used with softmax output.
167
What characterises a **convex function**?
Line between any two points lies above the function ## Footnote Properties include being easy to optimise and guaranteed global minimum.
168
What are the characteristics of **non-convex optimisation** in neural networks?
* Non-convex loss surfaces * Many local minima * No guarantees of global optimum ## Footnote Despite challenges, gradient-based methods work well in practice.
169
What is the **core goal** of neural networks learning using backpropagation?
Compute how changing each weight affects the loss ## Footnote This is essential for training deep neural networks effectively.
170
In a deep neural network, the training process involves adjusting weights so the **loss decreases**. What is the key question that arises?
How do we know which weights to change? ## Footnote The answer lies in computing gradients using backpropagation.
171
What does a **computation graph** represent?
* Operations as nodes * Data flowing between them ## Footnote This structure makes gradient calculation systematic.
172
During the **forward pass**, what is computed step by step?
Outputs ## Footnote This process moves from input to loss.
173
What is the goal of the **backward pass** in backpropagation?
Compute derivatives of loss with respect to inputs ## Footnote This process moves from loss back to input.
174
What key rule is used in backpropagation?
Chain rule ## Footnote This rule is fundamental for calculating gradients.
175
Define **upstream gradient** in the context of backpropagation.
Gradient coming from later nodes ## Footnote It represents the influence on loss.
176
What is a **local gradient**?
Derivative of the current operation ## Footnote It is essential for calculating downstream gradients.
177
What is the relationship between **upstream**, **local**, and **downstream gradients**?
Downstream = Upstream × Local ## Footnote This pattern is crucial during backpropagation.
178
How does the **Add Operator** affect gradient flow?
Gradient is copied to both inputs ## Footnote This is because both inputs affect the output equally.
179
What happens to gradients in the **Multiply Operator**?
Gradients are multiplied by the opposite input ## Footnote This is often referred to as the 'swap multiplier'.
180
In the context of backpropagation, what does the **Max Operator** do?
Gradient only flows to the largest input ## Footnote It acts like a router for gradient flow.
181
What is the function of the **Copy Operator** in gradient flow?
Gradients from branches are added together ## Footnote This ensures that all influences are accounted for.
182
What are the two main steps of **autograd systems** during backpropagation?
* Save inputs during forward pass * Multiply by local gradient during backward pass ## Footnote This process helps in efficiently calculating gradients.
183
What types of data structures do real networks use in backpropagation?
* Vectors * Matrices * Tensors ## Footnote These structures are necessary because weights and activations are often multidimensional.
184
What is a **regular derivative**?
Single input → single output ## Footnote It describes the relationship between one input and one output.
185
Define **gradient** in the context of derivatives.
Vector of derivatives ## Footnote It tells how output changes with each input element.
186
What is a **Jacobian**?
Matrix of derivatives ## Footnote It is used when there are multiple inputs and multiple outputs.
187
What is the process for backpropagation with vectors?
1. Receive upstream gradient 2. Multiply by local Jacobian 3. Produce downstream gradient ## Footnote This often results in matrix-vector multiplication.
188
In backpropagation with matrices or tensors, how are gradients propagated?
Multiplying gradients by transposed matrices ## Footnote This is essential for passing gradients through layers.
189
What is the key result for **matrix multiplication** during backpropagation?
* ∂ℒ/∂A = (∂ℒ/∂U) × Bᵀ * ∂ℒ/∂B = Aᵀ × (∂ℒ/∂U) ## Footnote This shows how to compute gradients for both matrices involved.
190
Why does the **transpose** appear in gradients during matrix multiplication?
Each element of the output depends on a row of A and a column of B ## Footnote The gradient of one element equals the dot product between the corresponding row and column.
191
What is the benefit of using backpropagation efficiently?
Gradients are reused and computation graph stores intermediate values ## Footnote This makes complex networks manageable.
192
What is the memory hook for the **forward pass**?
Compute outputs ## Footnote This is a simple way to remember the function of the forward pass.
193
What is the memory hook for the **backward pass**?
Send gradients back ## Footnote This captures the essence of what happens during the backward pass.
194
What is the memory hook for the **chain rule**?
Multiply gradients ## Footnote This highlights the fundamental operation in backpropagation.
195
What is the memory hook for the **Add Operator**?
Copy gradient ## Footnote This helps recall how gradients flow through addition.
196
What is the memory hook for the **Multiply Operator**?
Swap inputs ## Footnote This reminds us of how gradients are affected by multiplication.
197
What is the memory hook for the **Max Operator**?
Route gradient ## Footnote This indicates how gradients are directed in max operations.
198
What is the memory hook for **matrices** in backpropagation?
Use transpose ## Footnote This is crucial for understanding how gradients are calculated in matrix operations.
199
What is the memory hook for **autograd**?
Automatic backprop ## Footnote This emphasizes the convenience provided by autograd systems.
200
What are the **three distinct branches** of a convolutional neural network (CNN)?
* Convolution layers * Pooling layers * Normalisation layers ## Footnote A CNN also includes activation functions (ReLU) and fully connected layers.
201
What is the purpose of **padding** in convolutional layers?
To keep spatial size the same ## Footnote Common choice for padding: P = (K - 1) / 2.
202
What does the **stride** parameter control in convolutional layers?
How far the kernel moves ## Footnote Stride = 1 for normal sliding, Stride = 2 for downsampling.
203
What is a **receptive field** in the context of CNNs?
Area of input affecting one output value ## Footnote Each output depends on a K × K region.
204
What do first-layer **filters** in a CNN typically learn?
* Edges * Colour contrasts * Texture patterns ## Footnote Similar to the derivative of Gaussian.
205
What is the output size formula for a convolution layer given input dimensions B × C_in × H × W?
B × C_out × H' × W' ## Footnote Where H' = (H - K + 2P)/S + 1 and W' = (W - K + 2P)/S + 1.
206
What is the **main idea** behind convolutional neural networks (CNNs)?
Preserve spatial structure and reduce parameters ## Footnote Compared to fully connected networks.
207
What are the two types of **pooling** mentioned?
* Max Pooling * Average Pooling ## Footnote Pooling reduces spatial size and introduces invariance to small shifts.
208
True or false: **Batch Normalisation** (BN) helps in faster training and allows higher learning rates.
TRUE ## Footnote BN normalises layer outputs to zero mean and unit variance.
209
What is the typical pattern for stacking **convolution layers**?
Conv → ReLU → Conv → ReLU → … ## Footnote Each layer increases abstraction and learns more complex features.
210
What is the **output** of a convolution layer with 6 filters of size 3 × 7 × 7?
6 × 26 × 26 ## Footnote Each filter detects a different feature.
211
What is the **function** of adaptive pooling in CNNs?
Fix output size ## Footnote The framework calculates required parameters.
212
What does the **general convolution formula** output dimensions depend on?
* Input dimensions * Kernel size * Padding * Stride ## Footnote Output dimensions are calculated using specific formulas.
213
What is the **importance** of the kernel in a convolution layer?
Must have the same number of channels as input ## Footnote This ensures proper feature extraction.
214
What is the **output** of a convolution layer with no padding?
Output shrinks each layer ## Footnote This can be mitigated by adding zero-padding.
215
What does **batch normalisation** (BN) do during test time?
Uses running averages of μ and σ ## Footnote Becomes a simple linear operation that can be fused with the convolution layer.
216
What is the **role** of pooling in CNNs?
* Reduce spatial size * Introduce invariance to small shifts * Reduce computation ## Footnote Pooling has no learnable parameters.
217
What is the **output size** of a pooling layer with kernel size and stride defined?
C × H' × W' ## Footnote H' and W' are determined by the pooling operation parameters.
218
What are the **components of a CNN**?
* Convolution layers * Pooling layers * Non-linearity (ReLU) * Normalisation * MLP (Fully connected layers) ## Footnote General pattern: Conv → ReLU → Pool → (repeat) → Flatten → FC
219
What is the **input size** for AlexNet?
3 × 227 × 227 ## Footnote AlexNet was the winner of the ImageNet competition in 2012.
220
What is the **output size formula** for a convolution layer?
W^′=(W−K+2P)/S+1 ## Footnote This formula calculates the output dimensions based on input size, kernel size, padding, and stride.
221
What is the output size of the **first convolution layer** in AlexNet?
96 × 55 × 55 ## Footnote Calculated using the input size and parameters of the conv1 layer.
222
How much **memory** does the first convolution layer in AlexNet consume?
1134KB ## Footnote Memory is calculated based on the number of output elements and their size.
223
What is the total number of **parameters** in the first convolution layer of AlexNet?
34,944 ## Footnote This includes weights and biases for the layer.
224
What is the **FLOP calculation** for the first convolution layer in AlexNet?
105,705,600 ≈ 106 MFLOPs ## Footnote Convolution layers dominate computation in CNNs.
225
What is the output size of the **pooling layer (pool1)** in AlexNet?
96 × 27 × 27 ## Footnote Pooling layers have no learnable parameters and very small FLOPs.
226
List the layers in the **full AlexNet architecture**.
* conv1 * pool1 * conv2 * pool2 * conv3 * conv4 * conv5 * pool5 * flatten → 9216 * fc6 → 4096 * fc7 → 4096 * fc8 → 1000 classes ## Footnote Flatten size is calculated as 256×6×6=9216.
227
Where do the **most costs** occur in AlexNet?
* Most memory → early convolution layers * Most parameters → fully connected layers * Most FLOPs → convolution layers ## Footnote Important points for exam preparation.
228
What do the **early layers** of CNNs learn according to Zeiler & Fergus?
* Edges * Colours ## Footnote Mid layers learn textures and parts; deep layers learn objects and shapes.
229
What is the design philosophy of **VGG**?
Deeper networks with regular structure ## Footnote VGG emphasizes simplicity and depth.
230
What are the **design rules** for VGG?
* All conv layers: 3 × 3 kernel, stride 1, pad 1 * All max pooling: 2 × 2 kernel, stride 2 * After each pooling: double number of channels ## Footnote VGG-19 includes extra conv layers in stages 4 and 5.
231
What are the **four key innovations** of GoogLeNet (Inception V1)?
* Stem Network * Inception Module * Global Average Pooling * Auxiliary Classifiers ## Footnote These innovations aim for efficiency in terms of parameters, memory, and FLOPs.
232
What is the **core idea** of Residual Networks (ResNet)?
Let network learn: F(X)+X ## Footnote This introduces shortcut connections to ease optimization.
233
What is the structure of a **Residual Block**?
* Conv 3×3 * ReLU * Conv 3×3 * + skip connection ## Footnote Output is F(X)+X, facilitating identity mapping.
234
What distinguishes a **Bottleneck Block** from a Basic Block in ResNet?
* Basic Block: Two 3×3 convolutions * Bottleneck Block: 1×1 → 3×3 → 1×1 ## Footnote Bottleneck blocks reduce computation and enable deeper networks.
235
What is the **architecture comparison** summary for AlexNet, VGG, GoogLeNet, and ResNet?
* AlexNet / VGG: [Conv + Pool + ReLU] × N, Flatten, [FC + ReLU] × N, FC * GoogLeNet: Inception modules, Global average pooling, No big FC layers * ResNet: Residual blocks, Identity shortcuts, Global average pooling ## Footnote Each architecture has unique features that define its structure and efficiency.
236
What are the **ADHD-Friendly Memory Hooks** for CNN architectures?
* AlexNet = first big CNN * VGG = deep + simple 3×3 * GoogLeNet = inception + efficient * ResNet = skip connections * Memory heavy = early conv * Parameters heavy = FC layers * FLOPs heavy = conv layers * Residual = F(X) + X ## Footnote These hooks help in remembering key concepts related to CNN architectures.