Part 1. Foundations of Convolutional Neural Networks (CNN)

1. The Dimensionality Explosion in Multi-Layer Perceptrons (MLP)

Consider a binary classification task: separating images of cats and dogs. Using a modern camera to capture standard images yields a resolution of roughly 12 Megapixels.

  • Input Size: A single 3-channel RGB image contains $12\text{M} \times 3 = 36\text{M}$ numerical features.
  • MLP Scaling: If we connect this input to a single hidden layer with just 100 hidden units, the weight matrix alone would require: \(36\text{M} \times 100 = 3.6 \times 10^{9} \text{ parameters } (\approx 14\text{GB of VRAM})\)

This model architecture boasts more parameters than the global census of cats and dogs combined. To fix this over-parameterization, we must incorporate two spatial inductive biases: Translation Invariance and Locality.


2. Re-examining the Fully Connected Layer

Let the input $\mathbf{X}$ and the hidden representations $\mathbf{H}$ be structured as 2D spatial matrices instead of flattened vectors. The weights link every input coordinate $(k, l)$ to every output coordinate $(i, j)$, requiring a 4D tensor $\mathbf{W}$:

\[h_{i,j} = \sum_{k,l} W_{i,j,k,l} x_{k,l}\]

By changing indexes with a spatial offset $(a, b) = (k - i, l - j)$, we can rewrite the 4D weight tensor as $\mathbf{V}$ where $V_{i,j,a,b} = W_{i,j,i+a,j+b}$:

\[h_{i,j} = \sum_{a,b} V_{i,j,a,b} x_{i+a,j+b}\]

Principle #1: Translation Invariance

An object’s identity does not change when it shifts across an image. Therefore, the structural response $\mathbf{V}$ should not depend on the absolute spatial output coordinates $(i, j)$. This forces $V_{i,j,a,b} = V_{a,b}$, converting the operation into a standard Cross-Correlation:

\[h_{i,j} = \sum_{a,b} V_{a,b} x_{i+a,j+b}\]

Principle #2: Locality

To evaluate local features at $(i, j)$, we should not process pixels located excessively far away. Thus, for any spatial offset beyond a specific radius $|a| > \Delta$ or $|b| > \Delta$, we set $V_{a,b} = 0$.

Introducing a bias term $u$, the formulation transforms into a localized spatial convolution:

\[H_{i, j} = u + \sum_{a = -\Delta}^{\Delta} \sum_{b = -\Delta}^{\Delta} V_{a, b} X_{i+a, j+b}\]

Using convolutional constraints, parameter counts plunge from billions to hundreds. However, this relies heavily on the inductive bias matching reality; if an application breaks translation invariance, CNNs may struggle to fit the training data.


3. Spatial Aggregation: Max Pooling

Max Pooling abstracts spatial representation by extracting the maximum activation within a sliding kernel window.

While Max Pooling provides slight translation and deformation tolerance, standard CNN architectures still cannot naturally generalize to massively scaled or rotated targets. To mitigate this structural limitation, aggressive Data Augmentation (random scaling, rotation, cropping) remains mandatory during training pipelines.


Part 2. Object Detection with YOLOv1 and YOLOv3

1. The Bounding Box Paradigm

YOLO splits the input image into an $S \times S$ grid. If an object’s absolute ground-truth center (midpoint) falls inside a specific grid cell, that cell is assigned responsibility for detecting that object.

  • Ground-Truth Target Structure: For each grid cell, the ground-truth vector is defined as: \(\text{label}_{\text{cell}} = [C_1, C_2, \dots, C_{20}, P_c, X, Y, W, H]\) Where $P_c \in {0, 1}$ indicates object presence, and $[X, Y, W, H]$ defines the relative bounding box.
  • Coordinate Mapping: $X, Y \in [0, 1]$ are strictly bounded relative offsets within the specific grid cell boundary. $W, H$ represent structural multipliers normalized relative to the full image dimensions (and can thus exceed $1.0$).

2. Model Implementations via PyTorch

Below is the structured implementation for parsing configurations, manipulating prediction tensors, executing Non-Maximum Suppression (NMS), and building the Darknet backbone network.

2.1 Bounding Box Math & Tensor Utilities

import torch
import torch.nn as nn
import numpy as np
import cv2

def unique(tensor):
    """Extracts unique class elements from a 1D tensor safely."""
    tensor_np = tensor.cpu().numpy()
    unique_np = np.unique(tensor_np)
    unique_tensor = torch.from_numpy(unique_np)
    
    tensor_res = tensor.new(unique_tensor.shape)
    tensor_res.copy_(unique_tensor)
    return tensor_res

def bbox_iou(box1, box2):
    """Calculates intersection-over-union (IoU) scores between two batches of boxes."""
    b1_x1, b1_y1, b1_x2, b1_y2 = box1[:,0], box1[:,1], box1[:,2], box1[:,3]
    b2_x1, b2_y1, b2_x2, b2_y2 = box2[:,0], box2[:,1], box2[:,2], box2[:,3]
    
    inter_rect_x1 = torch.max(b1_x1, b2_x1)
    inter_rect_y1 = torch.max(b1_y1, b2_y1)
    inter_rect_x2 = torch.min(b1_x2, b2_x2)
    inter_rect_y2 = torch.min(b1_y2, b2_y2)
    
    inter_area = torch.clamp(inter_rect_x2 - inter_rect_x1 + 1, min=0) * \
                 torch.clamp(inter_rect_y2 - inter_rect_y1 + 1, min=0)

    b1_area = (b1_x2 - b1_x1 + 1) * (b1_y2 - b1_y1 + 1)
    b2_area = (b2_x2 - b2_x1 + 1) * (b2_y2 - b2_y1 + 1)
    
    iou = inter_area / (b1_area + b2_area - inter_area)
    return iou

2.2 Prediction Vector Transformation & Post-Processing

def predict_transform(prediction, inp_dim, anchors, num_classes, CUDA=True):
    """Transforms raw network output feature maps into predictable box coordinates."""
    batch_size = prediction.size(0)
    stride = inp_dim // prediction.size(2)
    grid_size = inp_dim // stride
    bbox_attrs = 5 + num_classes
    num_anchors = len(anchors)
    
    prediction = prediction.view(batch_size, bbox_attrs * num_anchors, grid_size * grid_size)
    prediction = prediction.transpose(1, 2).contiguous()
    prediction = prediction.view(batch_size, grid_size * grid_size * num_anchors, bbox_attrs)
    
    anchors = [(a[0]/stride, a[1]/stride) for a in anchors]

    # Map center coordinates and objectness scores through Sigmoid activation
    prediction[:,:,0] = torch.sigmoid(prediction[:,:,0])
    prediction[:,:,1] = torch.sigmoid(prediction[:,:,1])
    prediction[:,:,4] = torch.sigmoid(prediction[:,:,4])
    
    # Calculate absolute grid location offsets
    grid = np.arange(grid_size)
    a, b = np.meshgrid(grid, grid)

    x_offset = torch.FloatTensor(a).view(-1, 1)
    y_offset = torch.FloatTensor(b).view(-1, 1)

    if CUDA:
        x_offset = x_offset.cuda()
        y_offset = y_offset.cuda()

    x_y_offset = torch.cat((x_offset, y_offset), 1).repeat(1, num_anchors).view(-1, 2).unsqueeze(0)
    prediction[:,:,:2] += x_y_offset

    # Apply anchor scale dimensions via log-space transforms
    anchors = torch.FloatTensor(anchors)
    if CUDA:
        anchors = anchors.cuda()

    anchors = anchors.repeat(grid_size * grid_size, 1).unsqueeze(0)
    prediction[:,:,2:4] = torch.exp(prediction[:,:,2:4]) * anchors
    
    # Class probability activations
    prediction[:,:,5:5+num_classes] = torch.sigmoid((prediction[:,:,5:5+num_classes]))
    prediction[:,:,:4] *= stride
    
    return prediction

def write_results(prediction, confidence, num_classes, nms_conf=0.4):
    """Filters target predictions via thresholding and Non-Maximum Suppression (NMS)."""
    conf_mask = (prediction[:,:,4] > confidence).float().unsqueeze(2)
    prediction = prediction * conf_mask
    
    box_corner = prediction.new(prediction.shape)
    box_corner[:,:,0] = (prediction[:,:,0] - prediction[:,:,2]/2)
    box_corner[:,:,1] = (prediction[:,:,1] - prediction[:,:,3]/2)
    box_corner[:,:,2] = (prediction[:,:,0] + prediction[:,:,2]/2) 
    box_corner[:,:,3] = (prediction[:,:,1] + prediction[:,:,3]/2)
    prediction[:,:,:4] = box_corner[:,:,:4]
    
    batch_size = prediction.size(0)
    write = False

    for ind in range(batch_size):
        image_pred = prediction[ind]
        
        max_conf, max_conf_score = torch.max(image_pred[:, 5:5+num_classes], 1)
        max_conf = max_conf.float().unsqueeze(1)
        max_conf_score = max_conf_score.float().unsqueeze(1)
        image_pred = torch.cat((image_pred[:, :5], max_conf, max_conf_score), 1)
        
        non_zero_ind = torch.nonzero(image_pred[:, 4])
        try:
            image_pred_ = image_pred[non_zero_ind.squeeze(), :].view(-1, 7)
        except:
            continue
            
        if image_pred_.shape[0] == 0:
            continue       
            
        img_classes = unique(image_pred_[:, -1])
        
        for cls in img_classes:
            cls_mask = image_pred_ * (image_pred_[:, -1] == cls).float().unsqueeze(1)
            class_mask_ind = torch.nonzero(cls_mask[:, -2]).squeeze()
            image_pred_class = image_pred_[class_mask_ind].view(-1, 7)
            
            conf_sort_index = torch.sort(image_pred_class[:, 4], descending=True)[1]
            image_pred_class = image_pred_class[conf_sort_index]
            idx = image_pred_class.size(0)
            
            for i in range(idx):
                try:
                    ious = bbox_iou(image_pred_class[i].unsqueeze(0), image_pred_class[i+1:])
                except (ValueError, IndexError):
                    break
            
                iou_mask = (ious < nms_conf).float().unsqueeze(1)
                image_pred_class[i+1:] *= iou_mask       
            
                non_zero_ind = torch.nonzero(image_pred_class[:, 4]).squeeze()
                image_pred_class = image_pred_class[non_zero_ind].view(-1, 7)
                
            batch_ind = image_pred_class.new(image_pred_class.size(0), 1).fill_(ind)
            seq = batch_ind, image_pred_class
            
            if not write:
                output = torch.cat(seq, 1)
                write = True
            else:
                out = torch.cat(seq, 1)
                output = torch.cat((output, out))
    try:
        return output
    except:
        return 0

2.3 Image Preprocessing Pipeline

def letterbox_image(img, inp_dim):
    """Resizes an image using padding while preserving original aspect ratio."""
    img_w, img_h = img.shape[1], img.shape[0]
    w, h = inp_dim
    new_w = int(img_w * min(w/img_w, h/img_h))
    new_h = int(img_h * min(w/img_w, h/img_h))
    resized_image = cv2.resize(img, (new_w, new_h), interpolation=cv2.INTER_CUBIC)
    
    canvas = np.full((inp_dim[1], inp_dim[0], 3), 128)
    canvas[(h-new_h)//2:(h-new_h)//2 + new_h, (w-new_w)//2:(w-new_w)//2 + new_w, :] = resized_image
    return canvas

def prep_image(img, inp_dim):
    """Prepares standard OpenCV image matrices into normalized float Torch Tensors."""
    img = letterbox_image(img, (inp_dim, inp_dim))
    img = img[:,:,::-1].transpose((2,0,1)).copy() # BGR to RGB, then HWC to CHW
    img = torch.from_numpy(img).float().div(255.0).unsqueeze(0)
    return img

2.4 Parsing the Darknet Configuration File (.cfg)

def parse_cfg(cfgfile):
    """Parses structural text blocks from Darknet architecture network files."""
    file = open(cfgfile, 'r')
    lines = file.read().split('\n')
    lines = [x for x in lines if len(x) > 0 and x[0] != '#']
    lines = [x.rstrip().lstrip() for x in lines]
    
    block = {}
    blocks = []
    
    for line in lines:
        if line[0] == "[":
            if len(block) != 0:
                blocks.append(block)
                block = {}
            block["type"] = line[1:-1].rstrip()     
        else:
            key, value = line.split("=")
            block[key.rstrip()] = value.lstrip()
    blocks.append(block)
    return blocks


3. Assembling the Darknet Modular Infrastructure

In PyTorch, a custom layer must explicitly detail structural state maps within its sub-classed forward step. However, building boilerplate modules for structural manipulation layers like Route or Shortcut creates unnecessary abstraction overhead.

Instead, we place a dummy module (EmptyLayer) into our generated sequential chain. The conditional tensor slicing, channel concatenation (torch.cat), and residual additions are then handled directly inside the main model definition’s forward method.

class EmptyLayer(nn.Module):
    """A placeholder layer used for structural routing logic (Route & Shortcut)."""
    def __init__(self):
        super(EmptyLayer, self).__init__()

class DetectionLayer(nn.Module):
    """A placeholder detection block initialized with specific scaled anchor tensors."""
    def __init__(self, anchors):
        super(DetectionLayer, self).__init__()
        self.anchors = anchors       

def create_modules(blocks):
    """Generates an evaluation layer sequence from raw block configurations."""
    net_info = blocks[0]     
    module_list = nn.ModuleList()
    prev_filters = 3          # Starts at 3 for standard RGB processing
    output_filters = []
    
    for index, x in enumerate(blocks[1:]):
        module = nn.Sequential()
        
        if (x["type"] == "convolutional"):
            activation = x["activation"]
            try:
                batch_normalize = int(x["batch_normalize"])
                bias = False
            except:
                batch_normalize = 0
                bias = True
        
            filters = int(x["filters"])
            padding = int(x["pad"])
            kernel_size = int(x["size"])
            stride = int(x["stride"])
            pad = (kernel_size - 1) // 2 if padding else 0
        
            conv = nn.Conv2d(prev_filters, filters, kernel_size, stride, pad, bias=bias)
            module.add_module(f"conv_{index}", conv)
        
            if batch_normalize:
                bn = nn.BatchNorm2d(filters)
                module.add_module(f"batch_norm_{index}", bn)
        
            if activation == "leaky":
                activn = nn.LeakyReLU(0.1, inplace=True)
                module.add_module(f"leaky_{index}", activn)
        
        elif (x["type"] == "upsample"):
            upsample = nn.Upsample(scale_factor=2, mode="nearest")
            module.add_module(f"upsample_{index}", upsample)
                
        elif (x["type"] == "route"):
            x["layers"] = x["layers"].split(',')
            start = int(x["layers"][0])
            try:
                end = int(x["layers"][1])
            except:
                end = 0
                
            if start > 0: start = start - index
            if end > 0: end = end - index
            
            route = EmptyLayer()
            module.add_module(f"route_{index}", route)
            
            if end < 0:
                filters = output_filters[index + start] + output_filters[index + end]
            else:
                filters = output_filters[index + start]
    
        elif x["type"] == "shortcut":
            shortcut = EmptyLayer()
            module.add_module(f"shortcut_{index}", shortcut)
            
        elif x["type"] == "yolo":
            mask = [int(mask_idx) for mask_idx in x["mask"].split(",")]
            anchors = [int(a) for a in x["anchors"].split(",")]
            anchors = [(anchors[i], anchors[i+1]) for i in range(0, len(anchors), 2)]
            anchors = [anchors[i] for i in mask]
    
            detection = DetectionLayer(anchors)
            module.add_module(f"Detection_{index}", detection)
                             
        module_list.append(module)
        prev_filters = filters
        output_filters.append(filters)
        
    return (net_info, module_list)

3.1 Overriding Forward Computations (The Darknet Engine)

The main model orchestration loop processes the routing paths and handles the multi-scale outputs of YOLOv3.

class Darknet(nn.Module):
    def __init__(self, cfgfile):
        super(Darknet, self).__init__()
        self.blocks = parse_cfg(cfgfile)
        self.net_info, self.module_list = create_modules(self.blocks)
        
    def forward(self, x, CUDA=True):
        modules = self.blocks[1:]
        outputs = {}  # Caches layer feature maps for Route/Shortcut connections
        write = 0     # Accumulator flag for multi-scale YOLO detections
        
        for i, module in enumerate(self.module_list):        
            module_type = (modules[i]["type"])
            
            if module_type == "convolutional" or module_type == "upsample":
                x = module(x)
                
            elif module_type == "route":
                layers = [int(lyr) for lyr in modules[i]["layers"]]
                if layers[0] > 0: layers[0] = layers[0] - i
                
                if len(layers) == 1:
                    x = outputs[i + layers[0]]
                else:
                    if layers[1] > 0: layers[1] = layers[1] - i
                    map1 = outputs[i + layers[0]]
                    map2 = outputs[i + layers[1]]
                    x = torch.cat((map1, map2), 1)
                    
            elif module_type == "shortcut":
                from_layer = int(modules[i]["from"])
                x = outputs[i - 1] + outputs[i + from_layer]
                
            elif module_type == "yolo":
                anchors = self.module_list[i][0].anchors
                inp_dim = int(self.net_info["width"])
                num_classes = int(modules[i]["classes"])
                
                # Transform feature map into prediction vectors
                x = predict_transform(x, inp_dim, anchors, num_classes, CUDA)
                
                if not write:
                    output = x
                    write = 1
                else:
                    output = torch.cat((output, x), 1)
                    
            outputs[i] = x
            
        return output  # Returns concatenated multi-scale box detection predictions


4. Verification

if __name__ == "__main__":
    # Initialize network model structure
    model = Darknet("cfg/yolov3.cfg")
    print("Darknet modules successfully initiated.")
    
    # Target shape input expectation check
    # Dynamic dimension: Batch Size=1, Channels=3, Height=416, Width=416
    mock_batch = torch.randn(1, 3, 416, 416)
    
    # Disable VRAM acceleration if local device testing environment lacks CUDA
    device_has_cuda = torch.cuda.is_available()
    if device_has_cuda:
        model = model.cuda()
        mock_batch = mock_batch.cuda()
        
    with torch.no_grad():
        predictions = model(mock_batch, CUDA=device_has_cuda)
        
    print(f"Prediction output evaluation tensor shape: {predictions.shape}")
    # Expected output: torch.Size([1, 10647, 85]) 
    # (10647 total anchor box evaluation points for a 416x416 resolution input)

```