Old Money, New Algorithms: Can Machines Grade Rare Coins?

Author: Denis Avetisyan


A new study pits traditional feature engineering against deep learning to determine the best approach for automatically assessing the condition of valuable Saint-Gaudens Double Eagle coins.

Feature-engineered machine learning models demonstrate superior performance to deep learning for automated coin grading, particularly with limited data and imbalanced datasets.

Despite the prevailing assumption that deep learning universally outperforms traditional techniques, this study challenges that notion in the context of automated coin grading. ‘Feature Engineering vs. Deep Learning for Automated Coin Grading: A Comparative Study on Saint-Gaudens Double Eagles’ comparatively assesses a feature-engineered artificial neural network against convolutional and support vector machine approaches for classifying Saint-Gaudens Double Eagle gold coins. Our results demonstrate that, particularly with limited datasets and imbalanced classes, carefully designed features-derived from domain expertise-can significantly outperform ‘black box’ deep learning models in achieving accurate, specific grade predictions. Could this finding signal a broader reevaluation of feature engineering’s role in specialized computer vision tasks where data is scarce and nuanced knowledge is paramount?


The Illusion of Objectivity in Numismatics

For generations, determining the condition – and thus the value – of collectible coins has been the domain of experienced numismatists, a process inherently reliant on subjective judgment. This human element, while valuing expertise, introduces unavoidable inconsistencies; differing opinions on subtle flaws or wear can lead to disparate grades for the same coin, creating bottlenecks in the buying, selling, and insurance processes. These inconsistencies not only affect the financial transactions surrounding collectibles but also hinder comprehensive cataloging and historical analysis of coin collections. The reliance on individual expertise also presents a scalability issue, limiting the speed and volume at which coins can be accurately assessed, particularly as the hobby gains popularity and the number of coins requiring evaluation increases.

The pursuit of automated coin grading represents a significant shift towards objectivity and efficiency in numismatic evaluation. Traditional methods, reliant on expert human assessment, are inherently susceptible to inconsistencies and create bottlenecks in processing large collections. Automated systems promise a consistent standard, applying algorithms and high-resolution imaging to meticulously analyze a coin’s surface, identifying even the most minute imperfections. This scalability is particularly valuable for auction houses and grading services handling vast quantities of coins, reducing turnaround times and costs while maintaining a verifiable record of assessment criteria. Ultimately, the goal is not to replace human expertise entirely, but to augment it with a technology capable of providing a reliable, repeatable, and readily accessible means of determining a coin’s condition and, consequently, its value.

Evaluating ‘Mint State’ coins, particularly those with intricate designs like the Saint-Gaudens Double Eagle, presents a considerable challenge for automated systems. These coins, never circulated, are graded on a remarkably subtle spectrum of imperfections – minute contact marks, hairlines invisible to the naked eye, and variations in luster. Accurate assessment requires capturing these details with resolutions far exceeding typical imaging standards, and algorithms capable of differentiating between natural minting characteristics and post-production damage. The high-relief designs characteristic of coins like the Saint-Gaudens exacerbate this difficulty, creating shadows and highlights that can obscure or mimic surface flaws. Consequently, developing automated grading methods for these premium coins demands not only advanced imaging technology but also sophisticated analytical techniques capable of discerning the most delicate indicators of condition, effectively replicating the expertise of a seasoned numismatist.

Current automated coin grading systems often fall short when discerning the delicate gradations within the Sheldon Scale, a standardized numerical scale from 1 to 70 used to assess a coin’s condition. These systems frequently struggle with subtleties like microscopic contact marks, faint hairlines, or the precise degree of luster retention – features that experienced numismatists instantly recognize but prove difficult for algorithms to quantify. While capable of broadly categorizing coins, existing approaches often misclassify coins within the crucial ‘Mint State’ range – particularly high-value specimens like the Saint-Gaudens Double Eagle – where even minor imperfections can drastically affect value. This limitation stems from the difficulty in replicating the human eye’s ability to integrate complex visual cues and contextualize them against established grading standards, hindering the development of truly accurate and reliable automated systems.

The Fragility of Explicit Knowledge

A feature-based approach to coin analysis involves the deliberate creation of quantifiable characteristics, or features, from raw coin imagery. These features are not inherent to the image data itself, but are computed through algorithms specifically designed to highlight aspects relevant to coin grading and authentication. Examples include measurements of coin diameter, circularity, and the area of detected features. The selection of these features is guided by numismatic expertise and the assumption that specific characteristics correlate with coin condition and authenticity. This manual feature engineering requires significant domain knowledge and often involves iterative refinement to optimize performance for a given classification or regression task.

Sobel edge detection identifies boundaries within the coin image by calculating the gradient of image intensity, highlighting areas of significant change that correspond to features like lettering or wear marks. This process involves convolving the image with Sobel kernels in both horizontal and vertical directions, producing gradient magnitude and direction maps. Simultaneously, Gaussian blurring is applied as a preprocessing step to reduce noise and minor variations in image intensity, thereby improving the accuracy of edge detection. The standard deviation of the Gaussian kernel is a key parameter, controlling the degree of blurring; a larger standard deviation results in more significant smoothing. These techniques collectively extract quantifiable edge and texture information, providing numerical data representing the coin’s physical characteristics for subsequent analysis.

Perceptually-Weighted Brightness Computation aims to quantify coin luster by approximating human visual perception. This is achieved by calculating brightness values not based on raw pixel intensity, but using a weighting function that accounts for the human eye’s increased sensitivity to certain wavelengths of light. Specifically, the computation utilizes a luminosity function – often based on the CIE 1931 color space – which assigns different weights to the red, green, and blue color channels to more closely align with perceived brightness. This weighted sum provides a more accurate representation of how a human would visually assess the reflective qualities, or luster, of a coin’s surface, compared to a simple average of pixel values.

K-Means Clustering is employed to segment coin images based on color information derived from the HSV (Hue, Saturation, Value) color space. This technique groups pixels with similar color characteristics, allowing for the identification of areas exhibiting wear. By analyzing the distribution of these color clusters, the algorithm can differentiate between original coin surfaces and areas where metal has been removed through abrasion. The number of clusters, $k$, is a user-defined parameter influencing the granularity of the wear pattern analysis; a higher $k$ value can detect more subtle wear, while a lower value simplifies the analysis. This approach is effective because wear typically alters a coin’s surface color and reflectivity, which are represented as changes in the HSV values.

Addressing the Inevitable Imbalance

Coin grading datasets frequently exhibit class imbalance, a condition where the number of samples representing each grade differs substantially. This disparity arises because certain coin grades are inherently rarer than others due to minting quantities, survival rates, and collecting patterns. For example, grades representing significant wear or damage will naturally occur more frequently than pristine, uncirculated examples. This uneven distribution poses challenges for machine learning algorithms, as models tend to be biased towards the majority classes and perform poorly on the under-represented grades. Consequently, techniques to address this imbalance are crucial for developing accurate and reliable coin grading systems.

The Synthetic Minority Oversampling Technique (SMOTE) addresses class imbalance in coin grading datasets by generating synthetic examples for under-represented grades. This is achieved by interpolating between existing minority class samples, creating new, plausible data points. Applying SMOTE prior to training an Artificial Neural Network (ANN) demonstrably improves performance on imbalanced datasets, effectively increasing the representation of rarer coin grades and reducing bias towards more frequent grades during model training. This technique allows the ANN to learn more robust decision boundaries, leading to improved generalization and accuracy in predicting less common coin grades.

In a limited-data scenario, a feature-based approach to coin grading achieved 98% accuracy when predicting within ±3 grades of the Sheldon scale. This represents a performance improvement over both Convolutional Neural Network (CNN) and Support Vector Machine (SVM) models tested under the same conditions. Specifically, the feature-based Artificial Neural Network (ANN) attained an exact accuracy of 31%, while both the CNN and SVM models achieved an approximate exact accuracy of 30%. This indicates the feature-based methodology is more robust for coin grade prediction when training data is constrained.

Performance evaluations demonstrated that the Artificial Neural Network (ANN) achieved an exact accuracy of 31% in coin grade prediction. In comparison, both the Convolutional Neural Network (CNN) and Support Vector Machine (SVM) models attained an approximate exact accuracy of 30%. This indicates a marginal, but measurable, performance advantage for the ANN within the tested dataset and evaluation parameters. These results were obtained using the feature-based approach and represent performance in a limited-data scenario.

The System Grows, Not Built

The hybrid Convolutional Neural Network (CNN) approach represents a significant advancement in automated coin grading by strategically merging the strengths of both automated feature learning and meticulously engineered features. Traditionally, coin grading relied heavily on human expertise to identify and quantify specific characteristics – a process inherently limited by subjectivity and scalability. Hybrid CNNs overcome this limitation by allowing the network to independently learn relevant patterns directly from coin imagery, while simultaneously incorporating features explicitly designed by experts to highlight critical grading indicators. This synergy allows the model to capture nuances often missed by purely automated systems and surpass the precision achievable through manual feature extraction alone, resulting in a robust and highly accurate grading solution.

The architecture leverages EfficientNetV2 as a foundational convolutional neural network, a design choice predicated on its demonstrated ability to achieve state-of-the-art performance with improved efficiency. Crucially, the training process incorporates Batch Normalization, a technique that stabilizes learning by reducing internal covariate shift – effectively normalizing the activations of each layer. This normalization allows for higher learning rates and faster convergence, preventing the network from getting stuck in suboptimal configurations. By consistently re-centering and re-scaling activations, Batch Normalization not only accelerates training but also often improves the generalization capability of the model, leading to better performance on unseen coin images and a more robust classification system.

The integration of hybrid Convolutional Neural Networks represents a significant advancement in automated coin analysis by enabling the direct extraction of intricate patterns from coin imagery. Traditionally, identifying and classifying coins relied heavily on painstakingly designed, manual feature engineering – a process limited by human perception and requiring substantial domain expertise. However, this hybrid approach bypasses these constraints, allowing the model to autonomously discover subtle textures, edge characteristics, and overall visual cues indicative of a coin’s authenticity, grade, or origin. By learning directly from the data, the network can identify correlations and nuances that might be overlooked by human analysts or prove difficult to explicitly program, ultimately leading to more accurate and robust coin classification systems capable of handling the vast diversity within numismatic collections.

The efficacy of these hybrid convolutional neural networks is significantly rooted in the extensive dataset curated by David Lawrence Rare Coins (DLRC), providing the necessary volume and variety for robust model training and validation. Performance benchmarks reveal a clear distinction in processing speed between architectures; the Artificial Neural Network (ANN) model achieves an inference time of 1.8 seconds, positioning it as a viable solution for real-time web service applications where immediate responses are not critical. Conversely, the Convolutional Neural Network (CNN) demonstrates a markedly faster inference time of just 80 milliseconds, making it exceptionally well-suited for high-throughput batch processing of coin images, where maximizing efficiency and speed are paramount.

The pursuit of automated coin grading, as demonstrated in this study, reveals a fundamental truth about systems: elegance doesn’t guarantee resilience. While deep learning promises a path toward generalized solutions, its hunger for data often proves insatiable, especially when faced with the realities of class imbalance. As Robert Tarjan once observed, “The most effective algorithms are often the simplest.” This sentiment resonates deeply; the feature-engineered models, though less glamorous, exhibit a robust performance precisely because they operate within the constraints of limited data, prioritizing interpretable signals over abstract representations. Long stability, in this context, isn’t merely uptime, but a sign of a system thoughtfully aligned with its inherent limitations.

The Grain of the Metal

The preference demonstrated for carefully constructed features, over the seductive allure of end-to-end learning, feels less like a victory and more like a familiar accounting. Every architectural choice is a prophecy of future failure, yet the study confirms a persistent truth: systems reveal their limitations most clearly when starved of data. The elegance of deep learning rests on abundance; these coins, like all relics, arrive in frustratingly finite numbers. The imbalances within those numbers-the scarcity of pristine examples-speak to a deeper challenge than algorithmic preference.

The question isn’t simply how to classify, but what classification truly means when the boundaries blur with wear and subjective judgement. The Sheldon scale, itself a human construct, imposes order on a continuum. Future work will likely find itself less concerned with maximizing accuracy, and more occupied with understanding – and explicitly modeling – the inherent uncertainties in the grading process. The system doesn’t simply learn the grade; it negotiates a consensus.

One anticipates a shift toward methods that actively incorporate expert knowledge, not as a means of pre-labeling, but as a guiding principle for feature design and model interpretation. The goal isn’t to replace the numismatist, but to create a system that reflects the complex, nuanced reasoning of a practiced eye. It’s a humbling reminder that every refactor begins as a prayer and ends in repentance.


Original article: https://arxiv.org/pdf/2512.04464.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2025-12-05 11:26