AGVBench: A Reliability-Oriented Benchmark of Data Augmentation for Vein Recognition

Paper Summary

Data augmentation is essential when annotated vein images are limited, but strategies designed for natural images can distort vascular topology and high-frequency identity cues. AGVBench provides a standardized benchmark to measure whether an augmentation improves recognition while preserving deployment-oriented reliability.

30augmentation strategies across single-image, multi-image, and label-enhancement families

5public palm-vein and finger-vein datasets

7CNN, transformer, and vein-specific backbone architectures

6evaluation dimensions from accuracy to adversarial robustness

Motivation

Biometric systems are evaluated by more than classification accuracy. A usable vein recognition system must maintain low equal error rate, high true acceptance under strict false acceptance constraints, calibrated confidence, and robustness under sensor noise, occlusion, corruption, and adversarial perturbation.

Topology sensitivity

Vein identity depends on vascular geometry. Spatial transforms that are harmless for natural images can break discriminative vessel structures.

Reliability gap

Augmentations that maximize Top-1 accuracy may still produce overconfident scores or weak adversarial behavior.

Reproducibility

AGVBench aligns datasets, backbones, metrics, and augmentation implementations to make method comparisons repeatable.

Benchmark Design

The benchmark standardizes data splits, preprocessing, training recipes, augmentation modules, and biometric metrics so that augmentation methods can be compared under the same protocol.

Datasets

AGVBench covers SCUT1100, TJU600, VERA220, FV-USM, and SDUMLA-HMT, spanning unconstrained palm-vein imagery, two-session palm-vein acquisition, temporal finger-vein variation, and multi-finger recognition.

Palm vein Finger vein Public splits

Pipeline

The experimental pipeline resizes all images to 224 x 224, trains models from scratch, applies augmentation in a unified training interface, and evaluates both closed-set recognition and biometric verification.

Train from scratch 224 x 224 Unified metrics

Backbones

The evaluation uses general-purpose ResNet18, MobileNetv2, ViT-S, and Swin-T, plus vein-specific FVRASNet, AMPVNet, and StarLKNet-S, all trained from scratch at 224 x 224 resolution.

CNN Transformer Vein-specific

Augmentations

The method suite includes geometric and photometric transforms, occlusion and quantization operators, policy-based augmentation, MixUp-style mixing, CutMix-style regional mixing, and label regularization.

Single image Multi image Label enhancement

Dataset Summary

The benchmark combines small and large datasets because augmentation behavior changes substantially with data scale, acquisition protocol, and anatomical modality.

Dataset	Modality	Subjects	Total Images	Collection Characteristics
SCUT1100	Palm	550	11,000	Unconstrained dynamic acquisition with out-of-plane rotations and grayscale variations.
TJU600	Palm	300	12,000	Two-session collection with variations in posture, positioning, and illumination.
VERA220	Palm	110	2,200	Open-environment acquisition with minor pose variation and ambient-light sensitivity.
FV-USM	Finger	123	1,476	Two-session finger-vein set for temporal intra-class robustness evaluation.
SDUMLA-HMT	Finger	106	3,816	Multi-finger samples with placement and orientation variability.

Experiments

All experiments are implemented in AGVBench with PyTorch and MMCV. Images are resized to 224 x 224, models are trained from scratch without external pretraining, and accuracy is reported using the median of the last 10 epochs. The main biometric metrics are Top-1 Accuracy, EER, and TAR@FAR=0.0001, denoted below as TPR@FPR=1e-4.

Setup

Single NVIDIA A100 GPU; unified image resolution; dataset-specific official or session-based train/test splits.

Metrics

Top-1 Accuracy measures recognition; EER and TPR@FPR=1e-4 measure strict biometric verification.

Reliability Tests

Calibration, corruption, FGSM/PGD adversarial attacks, and occlusion classification are evaluated beyond clean accuracy.

Full Accuracy, EER And TPR@FPR Results

Use the buttons to switch datasets. Each table includes all reported methods and all backbone-specific Accuracy, EER, and TPR@FPR=1e-4 results from the paper tables.

ROC Figure Window

Use the left and right buttons to switch between ROC figures for the five evaluated vein datasets.

VERA220 ROC curves

Composed Augmentation Results

Composition studies use ResNet18 on VERA220 and TJU600.

Method	VERA Acc.	VERA EER	VERA TPR@FPR	TJU Acc.	TJU EER	TJU TPR@FPR
Vanilla	71.45	5.20	51.00	85.55	1.72	81.23
AutoAugment	80.82	2.55	65.09	88.28	1.59	85.23
AutoAugment + LabelSmoothing	89.73	2.44	78.64	94.97	0.77	93.83
MixUp	95.27	0.91	92.27	93.90	0.84	92.51
MixUp + LabelSmoothing	97.18	0.63	96.36	96.37	0.51	95.33
PuzzleMix	95.55	0.83	93.36	95.25	0.46	94.45
PuzzleMix + LabelSmoothing	97.27	0.65	96.09	96.58	0.38	96.05
AutoAugment + PuzzleMix + LabelSmoothing	98.00	0.56	95.27	96.50	0.45	96.12

Calibration Results

The figure shows reliability diagrams on VERA220 with ResNet18. The table below reports complete ECE scores across VERA220, TJU600, and SCUT1100. Lower ECE is better.

Corruption Robustness

Switch between VERA220-C and TJU600-C. Each table reports C1/C2/C3 corruption accuracy for all methods and backbones.

Adversarial Attack Robustness

Switch between TJU600 and SCUT1100. Each table reports Top-1 accuracy under FGSM and PGD attacks for all methods and backbones.

Occlusion Robustness

The occlusion experiment randomly masks square regions with ratios from 0% to 50% on VERA220 and TJU600 using ResNet18. Cutting-based methods such as Cutout, CutMix, and PuzzleMix maintain more stable performance under spatial information loss, while many other augmentations degrade quickly.

Main Findings

Accuracy alone is not sufficient for selecting augmentation in biometric systems. AGVBench shows that high recognition scores, calibrated confidence, and attack robustness often point to different augmentation choices.

Mixing methods dominate recognition and verification.

MixUp and PuzzleMix on VERA220 with ResNet18 reach 95.27% and 95.55% Top-1 accuracy, far above the 71.45% vanilla baseline. On SCUT1100, MixUp reduces ResNet18 EER from 0.30% to 0.07% and raises TAR@FAR=0.0001 from 97.30% to 99.63%.

Geometric transforms are risky for vein topology.

Flip, rotation, translation, and related spatial operations often underperform the baseline because vascular identity depends on stable topology rather than object-level semantic invariance.

Reliability exposes unresolved trade-offs.

MixUp-based methods can be poorly calibrated and adversarially fragile; on TJU600 with ResNet18, MixUp drops to 4.87% under PGD. LabelSmoothing is stronger against FGSM and PGD but can also be miscalibrated.

Composed augmentations can be better than any single family.

AutoAugment + PuzzleMix + LabelSmoothing reaches 98.00% accuracy, 0.56% EER, and 95.27% TAR@FAR on VERA220, and 96.50%, 0.45%, and 96.12% on TJU600.

Codebase And Reproducibility

AGVBench is designed as a PyTorch/MMCV-based benchmark. The expected workflow is configuration driven: choose a dataset, backbone, augmentation policy, and evaluation target, then launch the same training and testing entry points across all methods.

# Example workflow
conda create -n agvbench python=3.10
conda activate agvbench
pip install torch torchvision mmcv

# Train one augmentation setting
python tools/train.py configs/vera220/resnet18/mixup.py

# Evaluate recognition, verification, calibration, and robustness
python tools/test.py configs/vera220/resnet18/mixup.py --metrics all

AGVBench quick start workflow — Quick-start workflow for running AGVBench experiments.

Analysis, Discussion And Conclusion

AGVBench reveals that augmentation for biometric recognition is a multi-objective design problem rather than a single accuracy optimization problem.

Analysis

Multi-image methods such as MixUp, PuzzleMix, and StarMixup usually provide the strongest recognition and verification performance. Their gains are most visible on smaller datasets such as VERA220 and FV-USM, while strict verification metrics remain informative even on saturated datasets such as SCUT1100.

Discussion

Geometric augmentations often degrade vein recognition because they disturb vascular topology. High-accuracy mixup-style methods can also be poorly calibrated and adversarially fragile, showing that no single current method resolves accuracy, calibration, and robustness simultaneously.

Conclusion

AGVBench evaluates 30 augmentation strategies across five datasets, seven architectures, and six evaluation dimensions. The benchmark motivates augmentation methods that jointly optimize recognition accuracy, verification security, calibration, and robustness for real-world vein recognition systems.

Cite

@article{li2026agvbench,
  title  = {AGVBench: A Reliability-Oriented Benchmark of Data Augmentation for Vein Recognition},
  author = {Li, Haiyang and Fu, Yuming and Song, Qun and Liao, Hongchao and Chen, Jing and El-Yacoubi, Mounim A. and Jin, Xin},
  journal = {arXiv},
  year   = {2026}
}