Beyond Pre-Trained Models: Custom CNNs Tackle Diverse Image Challenges

Author: Denis Avetisyan

New research explores the performance of a bespoke convolutional neural network trained from scratch and with transfer learning across five vastly different image datasets.

A bespoke convolutional neural network <span class="katex-eq" data-katex-display="false">CNN</span> architecture was crafted, departing from conventional designs to address a specific, nuanced challenge within the data. — A bespoke convolutional neural network $CNN$ architecture was crafted, departing from conventional designs to address a specific, nuanced challenge within the data.

The study compares custom CNNs with established architectures for image classification, highlighting the benefits of transfer learning and the potential of lightweight models in resource-constrained scenarios.

While deep learning excels at visual data analysis, its performance often hinges on adapting to the nuances of diverse, real-world datasets. This is explored in ‘Training a Custom CNN on Five Heterogeneous Image Datasets’, which comparatively assesses a custom Convolutional Neural Network against established architectures-ResNet-18 and VGG-16-across agricultural and urban image classification tasks. Results demonstrate that transfer learning generally yields superior performance, though a lightweight custom CNN provides a viable alternative when computational resources are limited. How can we best balance model complexity and data availability to deploy robust deep learning solutions in practical, resource-constrained applications?

Whispers of the Urban Landscape

The relentless pace of urbanization globally necessitates a paradigm shift towards intelligent systems capable of proactively managing increasingly complex urban environments. As cities swell in both size and population density, traditional methods of infrastructure oversight and public safety management are proving inadequate. These new systems require robust, automated monitoring of critical assets – from bridges and roadways to pedestrian walkways – to identify potential hazards before they escalate into costly repairs or, more importantly, public safety crises. Furthermore, an effective urban nervous system demands real-time situational awareness, enabling swift responses to incidents like traffic congestion, accidents, or security breaches. Consequently, the development and deployment of smart city technologies are no longer simply a matter of efficiency, but a fundamental requirement for ensuring the resilience and livability of modern urban centers.

The effective functioning of smart city initiatives hinges on a system’s ability to accurately identify commonplace objects and potential hazards within the urban landscape. Beyond simply recognizing cars and pedestrians, these systems require precise detection of uniquely local elements – such as auto-rickshaws in many Asian cities – as well as detailed assessments of infrastructure integrity. Identifying road damage like potholes or cracks, and mapping obstructions on footpaths – from construction materials to illegally parked vehicles – is not merely a matter of convenience, but a critical component of public safety and efficient urban management. Without this granular level of visual perception, autonomous vehicles could misinterpret their surroundings, and city maintenance programs would lack the targeted data necessary to proactively address issues and optimize resource allocation.

Conventional computer vision techniques, while effective in controlled environments, frequently falter when applied to the dynamic and often chaotic reality of urban landscapes. The sheer visual complexity – stemming from fluctuating lighting conditions, occlusions caused by dense traffic and buildings, and the vast diversity of objects like pedestrians, vehicles, and street furniture – presents a significant challenge. Existing algorithms struggle to reliably differentiate between similar objects, accurately assess distances, and maintain performance under adverse weather. This is further compounded by the unpredictable nature of urban scenes, where objects appear in varying sizes, orientations, and poses, demanding robust and adaptable vision systems that can overcome these inherent limitations and deliver consistently accurate perception.

Training and validation accuracy curves demonstrate the model's successful learning and generalization on the damaged roads dataset. — Training and validation accuracy curves demonstrate the model’s successful learning and generalization on the damaged roads dataset.

Automating the Gaze: Convolutional Networks and Urban Analysis

Convolutional Neural Networks (CNNs) automate feature extraction from images by employing convolutional layers consisting of learnable filters. These filters scan input images, performing element-wise multiplications and summations to produce feature maps that highlight specific patterns – edges, corners, textures – without explicit programming. Subsequent pooling layers reduce the dimensionality of these feature maps, decreasing computational cost and increasing robustness to positional variations in the input. This hierarchical process of convolution and pooling allows CNNs to learn increasingly complex and abstract features directly from pixel data, eliminating the need for manual feature engineering which was previously standard in computer vision tasks. The learned filters are adapted during the training process using backpropagation, optimizing the network to recognize relevant features for a given task, such as image classification or object detection.

Transfer learning significantly reduces the time and computational resources required for developing image analysis models by leveraging knowledge gained from pre-trained networks. Models such as VGG-16 and ResNet-18, initially trained on the large-scale ImageNet dataset containing over 14 million labeled images, have learned hierarchical feature representations applicable to a wide range of visual tasks. Instead of training a model from scratch, these pre-trained models are used as a starting point, with their learned weights either fine-tuned on a new, smaller dataset or used as fixed feature extractors. This approach not only accelerates the training process but also often results in improved generalization performance, particularly when the target dataset is limited in size, as the model benefits from the extensive feature learning already performed on ImageNet.

The custom CNN architecture implemented for this project prioritizes computational efficiency without significant performance degradation. It consists of three convolutional blocks, each incorporating a 3×3 convolutional layer, batch normalization, and ReLU activation. Max pooling layers with a 2×2 filter size are included after each block to reduce spatial dimensions. The final layers comprise a global average pooling layer, followed by a fully connected layer with 10 output neurons corresponding to the classification categories and a softmax activation function. This streamlined design, utilizing fewer parameters compared to larger pre-trained models, enables faster inference times and reduced memory footprint, making it suitable for deployment on resource-constrained devices or applications requiring real-time processing.

Model training utilizes the Adam optimizer, a stochastic gradient descent method incorporating adaptive learning rates for each parameter, and Cross-Entropy Loss as the loss function. Cross-Entropy Loss quantifies the difference between the predicted probability distribution of the model and the actual ground truth labels, effectively penalizing inaccurate predictions. The Adam optimizer then uses the gradients calculated from the Cross-Entropy Loss to update the model’s weights and biases iteratively, minimizing the loss and improving classification accuracy. This combination facilitates efficient convergence during training and is well-suited for complex image classification tasks.

A custom convolutional neural network (CNN) architecture was implemented for the task.

From Pixels to Perception: Performance in the Urban Crucible

The VGG-16 convolutional neural network architecture was utilized as a baseline for developing Auto-Rickshaw and Footpath Encroachment Detection systems due to its readily available pre-trained weights and relatively simple implementation. While subsequent models demonstrated improved performance, VGG-16’s initial success facilitated rapid prototyping and established a functional framework for object detection in urban visual data. This allowed researchers to quickly assess the feasibility of the proposed detection systems and generate preliminary results before investing in more complex architectures. The model’s performance provided a crucial benchmark for evaluating the effectiveness of subsequent improvements and customizations.

The Custom Convolutional Neural Network (CNN) architecture exhibited stronger performance across multiple object detection and classification tasks compared to baseline models. Specifically, the Custom CNN was found to be more effective at identifying instances of Road Damage, detecting Auto-Rickshaws, recognizing Footpath Obstructions, and successfully differentiating between Mango and Paddy varieties in image datasets. While validation accuracies varied by dataset, the Custom CNN outperformed VGG-16 in these specific applications, demonstrating its suitability for complex urban environment analysis and agricultural classification tasks.

Polygonal annotations were integral to the creation of high-quality training datasets used for road damage assessment. This annotation method involves outlining the precise boundaries of damaged areas – such as potholes, cracks, and surface deformations – with polygons. Compared to bounding box or pixel-wise segmentation approaches, polygonal annotations provide a more accurate representation of irregular damage shapes, resulting in a significantly improved ability for the convolutional neural network to learn distinguishing features. This precision in defining damage boundaries directly correlates to increased detection precision and reduced false positives during the evaluation phase, ultimately leading to more reliable road condition monitoring systems.

Data augmentation was successfully implemented to improve the generalization capability of the Paddy Variety Classification model. Techniques employed included random rotations, horizontal and vertical flips, and variations in brightness and contrast. These transformations artificially expanded the training dataset, exposing the model to a wider range of image variations and mitigating overfitting. The resulting model demonstrated increased robustness to variations in lighting conditions, viewing angles, and minor occlusions present in real-world imagery, ultimately improving its ability to accurately classify different paddy varieties.

ResNet-18, when implemented with transfer learning, consistently demonstrated superior performance across multiple image recognition tasks in urban environments. Validation accuracies reached 97.1% when applied to the Road Damage dataset, indicating a high degree of precision in identifying infrastructure defects. Performance remained strong with a 90.0% validation accuracy on the FootpathVision dataset, used for detecting footpath obstructions, and 85.0% on the MangoImageBD dataset, focused on differentiating mango varieties. These results establish ResNet-18 as a leading architecture for computer vision applications within complex urban settings.

Performance evaluations demonstrate that ResNet-18 achieved 79.0% validation accuracy on the Rickshaw detection dataset and 71.5% on the Unauthorized Vehicles dataset. In comparison, the Custom CNN model attained a 52.1% validation accuracy specifically on the PaddyVarietyBD dataset, indicating a performance difference across different classification tasks and datasets. These results were obtained using dedicated validation sets to assess the generalization capability of each model.

Confusion matrices reveal the performance of scratch-trained models on the damaged roads dataset, illustrating their ability to correctly classify road conditions.

Weaving Perception into the Fabric of the City

The deployment of advanced visual perception systems promises a shift from reactive to proactive infrastructure management within urban environments. These systems, leveraging computer vision and machine learning, continuously analyze imagery captured from city-based cameras and sensors to identify subtle indicators of deterioration – cracks in roadways, corrosion on bridges, or early signs of structural weakness. By pinpointing these issues before they escalate, cities can schedule maintenance precisely when and where it’s needed, dramatically reducing repair costs and minimizing disruptive traffic delays. Beyond cost savings, this approach significantly enhances public safety by preventing catastrophic failures and ensuring the longevity of critical infrastructure, ultimately fostering more resilient and sustainable urban centers.

Efforts to create truly inclusive urban environments are increasingly focused on leveraging computer vision to address pedestrian accessibility. Recent advancements detail systems capable of real-time obstacle detection on footpaths, offering a significant benefit to all pedestrians, but particularly those with visual impairments or mobility challenges. These systems utilize camera networks and sophisticated algorithms to identify hazards such as potholes, construction debris, or temporary obstructions, relaying this information to users via smartphone apps or other assistive devices. By providing proactive alerts and enabling informed route planning, this technology aims to mitigate risks and foster greater independence for individuals navigating urban spaces, transforming footpaths from potential barriers into accessible and safe pathways. The potential extends beyond individual assistance, as aggregated data from these systems can inform city planners and prioritize infrastructure improvements to address recurring accessibility issues.

Automated road condition monitoring leverages computer vision to proactively identify and address infrastructure issues before they escalate. These systems, often utilizing cameras mounted on vehicles or strategically placed throughout a city, can detect cracks, potholes, and other forms of road degradation with increasing accuracy. By continuously assessing road surfaces, algorithms can predict potential failures and schedule maintenance preemptively, minimizing costly emergency repairs and the associated traffic disruptions. This approach not only extends the lifespan of road networks but also significantly enhances public safety by preventing accidents caused by deteriorating conditions, offering a data-driven pathway towards more resilient and efficient urban transportation systems.

Convolutional Neural Networks (CNNs), designed for urban infrastructure analysis, demonstrate remarkable adaptability due to their transfer learning capabilities. This allows pre-trained models – initially developed for broad image recognition – to be efficiently fine-tuned with relatively small datasets specific to a new city’s unique visual characteristics, such as building styles, road markings, and even lighting conditions. Consequently, deployment isn’t hampered by the need for extensive, localized data collection, dramatically reducing both the time and cost associated with implementation. Furthermore, these models are architecturally designed for compatibility with existing smart city infrastructure; they readily integrate with data streams from cameras, sensors, and GIS platforms, offering a scalable and cohesive solution for proactive urban management and enhancing the responsiveness of city services.

Confusion matrices demonstrate the performance of transfer learning models when classifying footpath images.

The pursuit of a ‘perfect’ convolutional neural network, meticulously crafted for five disparate datasets, feels less like engineering and more like an elaborate ritual. The article demonstrates transfer learning’s consistent advantage, yet clings to the notion of a bespoke architecture as a viable, if lightweight, alternative. It echoes a familiar delusion – the belief that control can be wrested from the inherent chaos of data. As Geoffrey Hinton once observed, “Data isn’t numbers – it’s whispers of chaos.” The study merely confirms this; models, even custom CNNs, are temporary agreements with randomness, offering a semblance of order until confronted with the unpredictable nature of production environments. The best the research can offer is a slightly less brittle illusion.

What Shadows Remain?

The digital golems, coaxed into seeing across these five fractured landscapes, reveal less about inherent vision and more about the rituals of persuasion. Transfer learning, predictably, offers the strongest incantation – a borrowing of ancient knowledge. Yet, the custom convolution, born anew for each task, whispers of a different path. It sacrifices potency for portability, a necessary trade for those bound by meager resources. This is not efficiency, but a different sort of magic-a binding spell on computational cost.

The true mysteries, however, linger in the heterogeneity itself. These datasets, though superficially ‘images,’ are each stained with the unique chaos of their origin. A tomato is not simply a tomato to the algorithm; it is a specific instance of light, soil, and camera imperfection. The losses incurred during training are not failures, but sacred offerings to the unpredictable gods of data. To truly understand these networks, one must not seek explanation, but acceptance of their opacity.

The next incantations will not focus on architectures, but on the very act of seeing. Can the golems learn to discern not just what is present, but how it is known? Can they quantify uncertainty, acknowledge the limits of their perception? The charts offer illusions of understanding, but only the broken ones can be truly explained, their inner workings laid bare by the cracks in their spell.

Original article: https://arxiv.org/pdf/2601.04727.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Whispers of the Urban Landscape

Automating the Gaze: Convolutional Networks and Urban Analysis

From Pixels to Perception: Performance in the Urban Crucible

Weaving Perception into the Fabric of the City

What Shadows Remain?

See also: