Building a Non-EU/UK Facial Image Dataset with Age Progression Diversity
A 1,205 participant, time-separated face image corpus to strengthen fairness & robustness for computer vision models.
Project Overview
A global technology company building face-centric AI for safety, personalization, and identity experiences sought a Non-EU/UK dataset with time-separated photos to reduce bias and improve model resilience across age, environment, and accessories.
The client partnered with Shaip to collect, curate, and validate a large facial image corpus where each participant contributes recent and older photos. The aim was to encode natural age progression while enforcing strict Non-EU/UK provenance & achieving balanced gender/age quotas.
Key Stats
Participants
1,205 (Non-EU/UK only, 50/50 gender ±10–15%)
Age Mix
40% (10–29), 40% (30–49), 20% (50+) ±10–15% tolerance
Coverage
South/Southeast Asia, North & North/East Africa, Singapore, South America
Timeline
19 weeks
Challenges
Sourcing exclusively from Non-EU/UK populations while avoiding travel-origin EU/UK images.
Hitting 1,205 participants with tight gender and age tolerances.
Ensuring every ID provides both recent and historical photos, aligned to age bands.
Enforcing minimum image/face size, variety, and duplication limits without slowing throughput.
Solution
1. Country Panels & Provenance Controls
We established country level sourcing pods across target regions and trained partners on provenance rules (Non-EU/UK only). Photos were screened for travel origin risks using metadata cues (year, location markers) plus submitter attestations, reducing EU/UK leakage before QC. This mirrors Shaip’s proven practice of front-loading risk checks to protect downstream throughput.
2. Age Progression Capture Design
Rather than “ask for 20 images,” we designed a two track submission flow that guided participants to:
- Track A (Recent): photos from the last two years;
- Track B (Historical): older photos aligned to the participant’s age band at submission (e.g., 2–10/15/20 year windows).
The portal nudged users with examples (indoor/outdoor, angles, accessories) to drive variety without over specifying.
3. Diversity Orchestration & Quota Guardrails
A real time quota dashboard monitored enrollments by gender, age band, and geography, pausing intake once a stratum reached planned limits. This prevented late cycle rework and reflects Shaip’s standard approach of stratified enrollment + lockouts used in prior biometric datasets to maintain balanced representation.
4. Quality Pipeline (Human in the Loop + Automated Pre Checks)
- Automated gates: face detection + min size thresholds, basic blur/noise checks, and same day clustering to flag potential duplicates early.
- Human QA tiers: image level reviewers validated subject exclusivity (primary participant only), scene/angle variety, and no beautification filters; CQA auditors spot checked batches prior to acceptance. This multi layer QA mirrors Shaip’s published biometric data programs.
5. Compliance & Consent
Enrollment ≥20 years with signed consent; under 20 cases accepted only with guardian consent. We captured consent presence in metadata and aligned reviewer checklists to eligibility + consent fields, ensuring auditability.
6. Metadata & Traceability
We delivered participant & image level metadata (ID linkages, demographics, nationality/ residence, year of photo, submission date, etc.) and standardized field names to simplify downstream labeling and evaluation. This follows Shaip’s best practice of rich metadata tagging for biometric datasets.
7. Phased Delivery to De Risk Scale
An 8 batch plan began with a 10 participant calibration set, followed by controlled scale up. Client feedback after batch 1 informed rubric tweaks, then volumes ramped in predictable tranches to reach 1,205 participants in ~19 weeks.
Project Scope
| Dimension | What We Delivered |
|---|---|
| Population | 1,205 Non EU/UK participants with balanced gender and age bands. |
| Content | ≥20 images per participant: recent + historical to encode age progression; varied scenes, angles, and accessories. |
| Quality Ops | Automated pre checks + human multi layer QA (duplication controls; subject exclusivity; filter rejection). |
| Compliance | Non EU/UK provenance verification; consent governance and eligibility validation. |
| Metadata | Participant + image attributes for traceability and downstream ML evaluation. |
| Delivery | 8 phased batches, starting with calibration then steady state delivery to final target. |
The Outcome
- Balanced, audit ready corpus: Demographic quotas met within tolerance; Non-EU/UK provenance enforced across all images for compliant training.
- Model ready variability: Time separated images, diverse environments/angles, and accessory coverage support robustness testing and bias analysis.
- Operational predictability: Calibration first rollout + quota guardrails reduced rework and safeguarded timeline to the full 1,205 participant target.
- Downstream efficiency: Rich metadata and consistent file hygiene shortened the path to annotation and benchmark construction, following Shaip’s biometric dataset playbooks.
Shaip turned a complex Non-EU/UK facial dataset brief into a balanced, audit ready corpus. Their age progression design and tiered QA gave our CV team clean, diverse data we could trust—without schedule risk.