Computer Vision
Image understanding, classification, detection
Computer Vision (CV) enables machines to interpret and understand visual information from the world. It spans classical techniques like edge detection and feature extraction to modern deep learning approaches using Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). Today's multimodal models like GPT-4V and Claude 3 Vision can analyze images alongside text, opening new possibilities for visual reasoning and understanding.
Key Formulas
Key Concepts
Convolutional Neural Networks (CNNs)
CNNs are the foundational architecture for image processing. They use learnable filters (kernels) that slide across input images to extract hierarchical features-from edges and textures in early layers to complex objects in deeper layers. Key components include convolutional layers (feature extraction), pooling layers (spatial downsampling), batch normalization (training stability), and fully connected layers (classification). Popular architectures include ResNet (residual connections), EfficientNet (compound scaling), and MobileNet (depthwise separable convolutions for efficiency).
Object Detection Architectures
Object detection localizes and classifies multiple objects in an image. Two-stage detectors like R-CNN, Fast R-CNN, and Faster R-CNN first propose regions of interest, then classify them-achieving higher accuracy but slower inference. Single-stage detectors like YOLO (You Only Look Once) and SSD (Single Shot Detector) predict bounding boxes and classes in one pass, offering real-time performance. Modern variants include YOLOv8/v9/v10, YOLO-NAS, and RT-DETR (real-time DETR with transformers). Key metrics include mAP@0.5, mAP@0.5:0.95, and inference FPS.
Image Segmentation
Segmentation assigns a class to each pixel. Semantic segmentation treats all instances of a class identically (e.g., all 'car' pixels merged), using architectures like U-Net, DeepLabv3+, and SegFormer. Instance segmentation distinguishes individual objects, combining detection with pixel masks-Mask R-CNN is the classic approach, with newer models like YOLOv8-seg and Mask2Former. Panoptic segmentation unifies both, assigning each pixel to either a semantic class or a specific instance identifier.
Vision Transformers (ViT)
Vision Transformers apply the transformer architecture to images by splitting them into patches, linearly embedding each patch, adding position embeddings, and processing through standard transformer encoder blocks. ViTs excel with large-scale pretraining (JFT-300M, ImageNet-21K) and can outperform CNNs on large datasets. Variants include DeiT (data-efficient training), Swin Transformer (hierarchical with shifted windows), and ViT-22B (largest vision model). ViTs lack inductive bias for translation invariance, requiring more data but offering better scalability and global context modeling.
Transfer Learning and Fine-tuning
Transfer learning leverages pretrained models on large datasets (ImageNet, LAION) for downstream tasks with limited data. Common approaches include feature extraction (freeze backbone, train classifier head), fine-tuning (unfreeze and train with lower learning rate), and progressive unfreezing. Pretrained weights are available via torchvision, timm (PyTorch Image Models), Hugging Face, and TensorFlow Hub. For vision tasks, even small datasets (100-1000 images) can achieve strong results with proper transfer learning and data augmentation.
Multimodal Vision-Language Models
Modern multimodal models combine vision and language understanding. GPT-4V (OpenAI) and Claude 3 Vision (Anthropic) accept image inputs and can perform visual reasoning, OCR, chart interpretation, and visual question answering. Open models like LLaVA, Qwen-VL, and Idefics2 use vision encoders (CLIP, SigLIP) with language model decoders. These models excel at tasks requiring both visual perception and language understanding-describing images, answering questions about visual content, and generating text based on visual inputs. They represent the convergence of NLP and CV into unified architectures.
Feature Extraction and Embeddings
Visual feature extraction converts images into dense vector representations. Classical methods (SIFT, HOG, SURF) extract hand-crafted features based on gradients and keypoints. Deep learning approaches use pretrained CNN backbones (ResNet, EfficientNet) or vision encoders (CLIP, DINO, DINOv2) to produce embeddings-typically 512-2048 dimensional vectors. CLIP embeddings align images and text in a shared space, enabling zero-shot classification via text prompts. DINO/DINOv2 produce self-supervised features useful for downstream tasks without labels. These embeddings power image retrieval, similarity search, and as input to other ML models.
Data Augmentation and MLOps for Vision
Data augmentation expands training data diversity. Geometric augmentations include random crop, flip, rotation, and scaling. Color augmentations adjust brightness, contrast, saturation, and hue. Advanced techniques include CutMix (paste patches between images), MixUp (blend images and labels), RandAugment (random augmentation policy), and AutoAugment (learned policies). For production, tools like FiftyOne for dataset curation, Weights & Biases for experiment tracking, and ONNX/TensorRT for optimized inference are essential. Model serving options include TorchServe, TensorFlow Serving, and cloud services (AWS SageMaker, Vertex AI).
Solved Examples
Problem 1:
A convolutional layer has an input image of size 224×224×3. It uses 64 filters of size 7×7 with stride 2 and padding 3. What is the output feature map dimension?
Solution:
Step 1: Identify the given parameters
- Input spatial size (H, W) = 224×224
- Number of filters (output channels) = 64
- Kernel size (K) = 7×7
- Stride (S) = 2
- Padding (P) = 3
Step 2: Apply the convolution output formula
Output spatial size = (Input - Kernel + 2×Padding) / Stride + 1
= (224 - 7 + 2×3) / 2 + 1
= (224 - 7 + 6) / 2 + 1
= 223 / 2 + 1
= 111.5 + 1 → 112 (integer floor, common in frameworks)
Step 3: Determine final output dimension
- Spatial: 112×112
- Channels: 64 (number of filters)
Answer: Output feature map dimension is 112×112×64
Problem 2:
An object detector predicts a bounding box [50, 50, 200, 200] for a car. The ground truth box is [60, 60, 190, 190]. Calculate the IoU. Should this prediction be considered correct at IoU threshold 0.5?
Solution:
Step 1: Identify coordinates (x1, y1, x2, y2 format)
- Predicted box: (50, 50, 200, 200)
- Ground truth: (60, 60, 190, 190)
Step 2: Calculate intersection area
- Intersection x1 = max(50, 60) = 60
- Intersection y1 = max(50, 60) = 60
- Intersection x2 = min(200, 190) = 190
- Intersection y2 = min(200, 190) = 190
- Intersection width = 190 - 60 = 130
- Intersection height = 190 - 60 = 130
- Intersection area = 130 × 130 = 16,900
Step 3: Calculate union area
- Predicted box area = (200-50) × (200-50) = 150 × 150 = 22,500
- Ground truth area = (190-60) × (190-60) = 130 × 130 = 16,900
- Union = Predicted + Ground truth - Intersection
= 22,500 + 16,900 - 16,900 = 22,500
Step 4: Calculate IoU
IoU = Intersection / Union = 16,900 / 22,500 = 0.751
Answer: IoU = 0.751 (75.1%). Since 0.751 > 0.5, this prediction IS considered correct at the 0.5 IoU threshold.
Problem 3:
You have 5,000 labeled images for defect detection in manufacturing. You want to use transfer learning with a pretrained ResNet50. Design a training strategy.
Solution:
Step 1: Split the data
- Training: 4,000 images (80%)
- Validation: 500 images (10%)
- Test: 500 images (10%)
- Apply stratified splitting to maintain class balance
Step 2: Data augmentation strategy
- Use albumentations or torchvision.transforms
- Augmentations: horizontal flip, rotation (±15°), brightness/contrast adjustment
- Consider CutMix or MixUp if classes are imbalanced
- For defect detection, preserve defect integrity-avoid aggressive crops that might remove defects
Step 3: Model modification
- Load ResNet50 with pretrained ImageNet weights (via torchvision or timm)
- Replace final fully connected layer: nn.Linear(2048, num_classes)
- Initialize new classifier weights appropriately
Step 4: Training phases
Phase 1 (Frozen backbone):
- Freeze all ResNet layers except the classifier head
- Train for 5-10 epochs with lr=0.01
- This quickly adapts the classifier to your classes
Phase 2 (Fine-tuning):
- Unfreeze the last 2-3 residual blocks
- Use discriminative learning rates: backbone lr=1e-4, classifier lr=1e-3
- Train for 20-30 epochs with early stopping on validation loss
Step 5: Monitoring and optimization
- Track validation accuracy, F1 score (better for imbalanced data)
- Use learning rate scheduling (cosine annealing or ReduceLROnPlateau)
- Save best checkpoint based on validation F1
Answer: Use two-phase transfer learning-freeze backbone for initial classifier adaptation, then fine-tune last layers with discriminative learning rates. This approach balances fast convergence with preservation of learned features, suitable for the limited dataset size.
Problem 4:
Compare using CLIP embeddings versus a fine-tuned CNN classifier for a product image search system with 100,000 catalog images. Recommend the better approach.
Solution:
Step 1: Analyze CLIP embeddings approach
Pros:
- Zero-shot capability: no training data needed
- Unified image-text space: enables natural language queries ('red summer dress')
- Fast deployment: precompute embeddings, use approximate nearest neighbor (ANN) search
- Generalization: handles unseen product categories
Cons:
- Lower accuracy on domain-specific products compared to fine-tuned models
- Fixed embedding size: may not capture fine-grained details (texture, material)
- Computational: requires GPU for embedding generation (though inference is fast)
Step 2: Analyze fine-tuned CNN approach
Pros:
- Higher accuracy on your specific product categories
- Can optimize for domain-specific features
- Smaller model possible via distillation or efficient architectures
Cons:
- Requires labeled training data (category labels)
- Training time and hyperparameter tuning
- Text queries need separate processing (no unified embedding space)
- Retraining needed for new categories
Step 3: Hybrid recommendation
For product search at this scale:
1. Use CLIP as the primary embedding model:
- Precompute 512-dim CLIP embeddings for all 100,000 images
- Build ANN index using FAISS or ScaNN
- Enable both image-to-image and text-to-image search
2. For critical product categories with high search volume:
- Fine-tune a smaller model (MobileNet, EfficientNet-B0) for reranking
- Use CLIP for retrieval, then rerank top-100 results with fine-tuned model
3. Implementation:
- CLIP ViT-B/32: fast inference, good quality
- FAISS IVF with 256 clusters: balance speed and recall
- Approximate search: <10ms for 100K images
Answer: Recommend CLIP embeddings as the primary approach with optional fine-tuned reranking. This provides immediate deployment capability, natural language search, and good accuracy. The hybrid approach adds fine-tuned reranking for high-traffic categories, combining CLIP's flexibility with domain-specific accuracy. This balances development speed, search quality, and computational efficiency for the 100,000-image catalog.
Tips & Tricks
- When calculating CNN output dimensions, always verify with a quick tensor shape check in code-framework implementations may have subtle rounding differences that affect your calculations.
- For object detection evaluation, understand the difference between mAP@0.5 (Pascal VOC metric) and mAP@0.5:0.95 (COCO metric averaged across IoU thresholds)-modern benchmarks use the latter.
- In transfer learning, start with the smallest possible learning rate for fine-tuning. A good rule: backbone LR should be 10× smaller than classifier LR to avoid destroying pretrained features.
- When choosing between YOLO and Faster R-CNN, consider: YOLO for real-time applications (>30 FPS needed), Faster R-CNN for maximum accuracy when speed is secondary. Modern YOLO variants have largely closed the accuracy gap.
- For production vision systems, always implement confidence thresholding and non-maximum suppression (NMS). Without NMS, object detectors may produce multiple overlapping predictions for the same object.
- When using multimodal models (GPT-4V, Claude 3 Vision) for document understanding, pre-process images to appropriate resolution (typically 768-1024px on longest side) to balance token usage and OCR accuracy.
Ready to practice?
Test your understanding with questions and get instant feedback.