Off-the-shelf facial image & video data licensing
Off-the-Shelf Facial Recognition Datasets for AI Model Training
Leveraging ethically sourced, demographically diverse datasets to accelerate AI model training and reduce bias for a leading global technology conglomerate.

Project Overview
The client sought to accelerate AI-driven facial recognition development without undergoing long, costly data collection cycles. To achieve this, they needed ready-to-use datasets that were not only large and diverse, but also ethically sourced and compliant with global data privacy regulations.
Shaip delivered comprehensive datasets with controlled variations in lighting, head poses, occlusions, and emotions, enabling the client’s models to achieve both accuracy and fairness while meeting required ethnic and demographic criteria. Each dataset included detailed metadata, pose annotations, and bounding boxes for emotion recognition, allowing models to be trained and tested in highly diverse, real-world scenarios.

Key Stats
7,000+ Subjects
in Historical Dataset
with 300,000+ images
and 2,000 videos.
10,000+ Subjects
in the Multi-Angle Emotion Dataset.
74,880 Images
in Lighting
Variation Dataset.
18,600 Images
covering six
core emotions.
Project Scope
The client required large-scale, ethically sourced, and demographically diverse facial image and video datasets to support the development and training of facial recognition models. These datasets were essential to power use cases in anti-spoofing, identity verification, image matching, and expression analysis systems, ensuring robust and unbiased AI performance in real-world applications.
The scope of the engagement included:
- Delivering curated datasets designed to meet facial recognition use cases like anti-spoofing, identity verification, and expression recognition.
- Providing images and videos with detailed annotations for demographics, head pose, occlusions, lighting type, and emotions.
- Ensuring balanced demographic coverage to reduce systemic bias in training.
- Guaranteeing compliance and consent with global data protection and privacy standards.
Sample Dataset Contributions:
- Historical Dataset (~7,000 subjects): 300,000+ images & 2,000 videos with pose and occlusion variations.
- Multi-Angle Emotion Dataset (~10,000 subjects): 15–20 images per subject across angles and emotional states.
- Six Emotions Dataset (~3,100 subjects): 18,600 annotated images covering core human expressions.
- Lighting Variation Dataset (~468 subjects): 74,880 images across nine lighting conditions.
Challenges
The project addressed key challenges common in building robust AI models:
Preventing over-representation of specific ethnicities or genders to ensure fairness.
Capturing lighting conditions, facial angles, occlusions, and natural expressions.
Providing hundreds of thousands of high-resolution images without compromising diversity.
Meeting stringent global privacy and data protection requirements with full participant consent.
Solution
Shaip implemented a structured approach to ensure dataset quality and relevance:
- Curated Balanced Datasets with wide ethnic, gender, and age representation.
- Captured multi-angle poses and lighting variations to replicate real-world conditions.
- Added detailed annotations (e.g., head pose, occlusions, emotions) to enrich dataset usability.
- Established strict quality control and compliance workflows to guarantee ethical sourcing and privacy adherence.
Dataset Portfolio
Dataset | Volume | Demographics / Diversity | Standards / Specs |
---|---|---|---|
Historical Facial Image & Video Dataset (~7,000 Subjects) | 7,000 enrollment images; 300,000+ historical images; 2,000 videos (1 indoor + 1 outdoor per 1,000 subjects) | Ethnicity: Black (35%), East Asian (42%), South Asian (13%), White (10%); Gender: 50% Male / 50% Female; Age: Adults 18+ (last 10 years) | Video duration: 1–2 min; Head pose variation (P1–P7); 5 occlusion types (O0–O4) |
Facial Image Dataset (~5,000 Subjects) | 35 images per subject; 2,500 Indians; 1,000 Asians; 1,500 Blacks | Age: 18–60 years; Balanced gender distribution | No beautification; Varied background & clothing; Min. resolution: 960×1280 |
Multi-Angle Emotion Dataset (~10,000 Subjects – Chinese) | 15–20 images per subject; Poses: Front, Left, Right (30°–60°); Expressions: Smile, open-mouth, sad, serious, neutral | Ethnicity: Chinese; Age: 18–26; Gender: 50/50 split | Resolution: 2160×3840 pixels or higher |
Six Human Emotions Dataset (~3,100 Subjects) | 6 images per subject (different expressions); 18,600 total images | Ethnicities: Japanese (9,000), Korean (2,400), Chinese (2,400), Southeast Asian (2,400), South Asian (2,400); Age: 20–65 years | Bounding box annotations for emotions; Plain backgrounds; No hats, glasses, or obstructions |
Lighting Variation Dataset (~468 Indian Subjects) | 160 images per subject; Total: 74,880 images | Age: 20–70; 70% Male | 9 lighting conditions (indoor, outdoor, side light, backlight, neon, etc.) |
Multi-Ethnic Facial Image Dataset (~600 Subjects) | 3,752 total images | Ethnicities: African, Middle Eastern, Native American, South Asian, Southeast Asian; Age: 20–70 years | — |
Outcome
The collaboration delivered significant business and technical impact:
- Improved Model Accuracy: Enhanced precision and recall for facial recognition models across multiple use cases.
- Bias Reduction: Balanced demographic representation reduced systemic bias in AI outputs.
- Accelerated Development Timelines: Off-the-shelf datasets allowed rapid prototyping and model training without lengthy data collection.
- Regulatory Compliance: All datasets adhered to global privacy standards and included participant consent.
Shaip’s diverse, ethically sourced datasets gave us the speed, quality, and compliance we needed. With ready-to-use data, we accelerated AI model training and significantly reduced systemic bias.