Shaip provides the most comprehensive catalog of HIPAA-compliant, physician-annotated medical datasets — purpose-built to train, fine-tune, and evaluate clinical AI models at enterprise scale.
"*" indicates required fields
Shaip’s medical datasets are trusted by Fortune 500 companies, top AI research labs, and the world’s fastest-growing health AI startups.
Shaip offers medical data platform that combines physician-grade annotation, global data collection, and built-in compliance — so your team focuses on building models, not hunting for data.
Access pre-built, immediately licensable datasets across clinical notes, EHR, imaging, dictation, and more. Skip data collection entirely.
Every dataset is annotated and QA'd by licensed medical professionals — not crowdsourced workers. Expect 98%+ inter-annotator agreement.
Need custom data from specific demographics, languages, or geographies? We collect, process, and deliver it — fully compliant and to spec.
All datasets are de-identified, IRB-cleared, and HIPAA/GDPR compliant. Legal agreements and NDAs available from day one.
| Speciality | Patient Audio Files (Playtime in Hours) | Total No. of Audio Files |
|---|---|---|
| Total | 257,977 | 5,172,766 |
| Male | 58,850 | 2,444,910 |
| Female | 113,406 | 1,290,900 |
| Unknown | 85,721 | 1,436,956 |
| Speciality | Patient Audio Files (Playtime in Hours) | Total No. of Audio Files |
|---|---|---|
| Total | 257,977 | 5,172,766 |
| Accident & emergency | 9 | 359 |
| Allergy and Immunology | 1152 | 22202 |
| Anesthesiology | 677 | 22280 |
| Anesthetics | 1 | 9 |
| APRN | 163 | 1693 |
| Cardiology | 67504 | 1566721 |
| Cardiothoracic | 17 | 122 |
| Cardiothoracic surgery | 1 | 10 |
| Clinical hematology | 0 | 2 |
| Colon and Rectal Surgery | 7 | 162 |
| Colorectal surgeon | 45 | 984 |
| Critical care medicine | 220 | 4328 |
| Dentist | 1 | 65 |
| Dermatology | 771 | 23014 |
| Dietitian and nutritionist | 44 | 736 |
| Emergency medicine | 4911 | 112518 |
| Endocrinology | 205 | 7052 |
| ENT | 7010 | 175477 |
| Family medicine | 1767 | 106733 |
| Gastroenterology | 1458 | 40365 |
| General medicine | 140 | 5757 |
| General practice | 41 | 1318 |
| General surgery | 2038 | 71744 |
| Gynecology | 3269 | 103370 |
| Hand surgery | 2 | 45 |
| Hematology | 258 | 8125 |
| Hospitalist | 5931 | 142529 |
| Infectious disease | 493 | 14001 |
| Internal medicine | 15410 | 445591 |
| Interventional cardiology | 1511 | 43035 |
| Maternal-Fetal medicine | 51 | 1355 |
| Neonatology | 1045 | 24760 |
| Nephrology | 735 | 20334 |
| Neurology | 2269 | 63774 |
| Neurosurgery | 462 | 11990 |
| Nuclear Medicine | 2 | 23 |
| OBGYN | 3562 | 122303 |
| Oncology | 2938 | 82996 |
| Ophthalmology | 1316 | 41047 |
| Optometry | 33 | 1066 |
| Orthopedics | 5665 | 164483 |
| Otolaryngology | 3433 | 100811 |
| Pathology | 166 | 4097 |
| Pediatric pulmonology | 4 | 40 |
| Pediatric specialty | 35 | 682 |
| Pediatric surgery | 2 | 23 |
| Pediatrics | 877 | 9271 |
| Physical Medicine & Rehabilitation | 1347 | 23523 |
| Physical Therapist | 114 | 1713 |
| Physician Asst. | 6 | 38 |
| Podiatric Surgery | 4 | 24 |
| Podiatry | 473 | 12296 |
| Primary Care | 651 | 20120 |
| Psychiatry | 2120 | 60381 |
| Pulmonology | 1290 | 35290 |
| Radiation oncology | 239 | 6558 |
| Radiology | 3345 | 99641 |
| Rheumatology | 293 | 8729 |
| SICU | 1 | 25 |
| Speech pathology | 3 | 28 |
| Surgical oncology | 217 | 5758 |
| Thoracic surgery | 107 | 3336 |
| Transplant surgery | 61 | 1535 |
| Urology | 3170 | 96934 |
| Upper gastrointestinal surgery | 4 | 58 |
| VASCULAR SURGERY | 19 | 156 |
| Vascular/General | 9 | 268 |
| Wound Care | 15 | 211 |
| Specialty | Approx. No. of Medical Records | Approx. No. of Characters |
|---|---|---|
| Total | 5,172,766 | 11,331,920,127 |
| Pain Medicine | 11 | 35,515 |
| Podiatric Surgery | 24 | 1,08,258 |
| Plastic surgery – specialty | 183 | 6,04,359 |
| Physician Asst. | 38 | 1,27,349 |
| Physical Therapist | 1,713 | 46,81,870 |
| Physical Medicine & Rehabilitation | 23,523 | 5,77,01,697 |
| Pediatrics | 9,271 | 4,26,54,058 |
| Pediatric surgery | 23 | 90,525 |
| Pediatric specialty | 682 | 20,63,509 |
| Pediatric pulmonology | 40 | 1,58,625 |
| Pediatric Dentistry | 420 | 8,99,253 |
| Pathology | 43,462 | 2,76,60,828 |
| PANP | 1,45,960 | 44,53,32,915 |
| Podiatry | 12,056 | 3,91,63,411 |
| Pain Management | 30 | 62,650 |
| Otolaryngology | 19,548 | 3,95,00,098 |
| Osteopathic | 5,566 | 1,36,79,541 |
| Orthopedic | 1,45,053 | 27,75,08,345 |
| Orthopaedics & Sports Medicine | 3,165 | 1,43,93,798 |
| Oral surgery | 13 | 32,527 |
| Oral & Maxillofacial Surgeon | 8 | 18,733 |
| Ophthalmology | 19,299 | 4,48,44,680 |
| OPERATIVE CARE | 5 | 13,637 |
| Oncology | 82,300 | 29,63,70,809 |
| Occupational Therapist | 68 | 2,38,853 |
| Surgery | 2,36,788 | 64,27,35,680 |
| Wound Care | 211 | 5,82,123 |
| Vascular/General | 268 | 4,11,007 |
| VASCULAR SURGERY | 156 | 6,74,129 |
| Urology | 96,934 | 13,55,27,616 |
| Upper gastrointestinal surgery | 58 | 1,80,361 |
| Unknown | 7,48,054 | 1,69,50,98,900 |
| Trauma & orthopedics | 1,308 | 53,08,512 |
| Transplant | 32 | 1,28,670 |
| Thoracic surgery | 37 | 1,53,325 |
| Thoracic medicine | 27 | 1,64,106 |
| Surgical specialty | 290 | 10,14,789 |
| Surgery Physician Assistant | 3 | 4,315 |
| Occupational medicine | 763 | 34,76,696 |
| Sports Medicine | 49 | 1,48,200 |
| Speech Therapy | 327 | 9,81,803 |
| Rheumatology | 124 | 4,32,080 |
| Resident | 641 | 19,90,867 |
| Rehabilitation | 30,078 | 9,61,87,590 |
| Radiology | 6,30,983 | 64,19,87,812 |
| Pulmonary | 64,368 | 15,66,29,273 |
| Psychotherapy (specialty) | 229 | 29,61,345 |
| Psychiatry | 70,269 | 35,10,76,474 |
| PRIMARY CARE ATTENDING | 7 | 27,134 |
| Preventive Medicine | 191 | 4,35,298 |
| Dental | 1,233 | 29,74,753 |
| General | 313 | 13,77,179 |
| Gastroenterology | 62,158 | 12,79,38,968 |
| Family Practice | 2,498 | 69,42,820 |
| Family Nurse Practitioner | 9,018 | 1,86,24,462 |
| Family Medicine | 2,63,480 | 53,40,93,592 |
| Endocrinology | 3,212 | 91,07,557 |
| Emergency Room Specialist | 378 | 12,72,557 |
| Emergency | 62,256 | 16,24,31,343 |
| ED Physician Assistant | 70 | 31,316 |
| Ear, Nose And Throat | 658 | 20,74,977 |
| Diagnostic Radiology | 7,591 | 72,68,441 |
| Dermatology | 3,474 | 62,28,845 |
| General dental practice | 25 | 99,740 |
| Critical Care | 9,645 | 3,42,13,951 |
| Clinical physiology | 160 | 10,03,807 |
| Clinical hematology | 2 | 7,546 |
| Cardiothoracic surgery | 10 | 55,321 |
| Cardiothoracic | 122 | 7,06,280 |
| Cardiology | 15,66,721 | 3,20,98,50,575 |
| APRN | 1,693 | 54,36,558 |
| Anesthetics | 9 | 21,300 |
| Anesthesiology | 22,280 | 4,80,25,191 |
| Allergy and Immunology | 22,202 | 48,273,220 |
| Accident & emergency | 359 | 7,23,866 |
| IH-Industrial Health | 945 | 27,57,753 |
| OB/GYN | 42,739 | 11,41,18,874 |
| Nurse Practitioner – Family | 113 | 2,81,032 |
| Nurse Practitioner | 432 | 27,19,033 |
| Neurosurgery | 755 | 31,46,223 |
| Neurology | 17,786 | 4,90,64,199 |
| Neuro/TBI | 1,157 | 51,42,035 |
| Nephrology | 39,821 | 10,14,22,013 |
| Medicine | 122 | 3,68,833 |
| Medical oncology | 67 | 4,87,088 |
| Internal Medicine, Pulmonary Medicine, Critical Care Medicine And Sleep Medicine | 102 | 2,10,331 |
| Internal Medicine And Nephrology | 111 | 5,19,283 |
| Internal Medicine | 6,23,072 | 1,74,14,86,763 |
| Location | Text Documents |
|---|---|
| NorthEast | 4,473,573 |
| South | 1,801,716 |
| MidWest | 781,701 |
| West | 1,509,109 |
| Major Diagnosis Category | Text Documents |
|---|---|
| Alcohol/Drug Use & Alcohol/Drug-Induced Organic Mental Disorders | 48,717 |
| Total including everything (Cases with & without MDC category) | 8,566,687 |
| Cases without reimbursement generated (MDC not specified) | 790,697 |
| Outpatient Cases (MDC not specified) | 1,980,606 |
| Cases using a specialty grouper such as 3M (MDC not specified) | 1,619,682 |
| Total with MDC | 4,175,702 |
| Alcohol/Drug Use or Induced Mental Disorders | 48,717 |
| Burns | 444 |
| Eye | 3,549 |
| Male Reproductive System | 9,230 |
| Human Immunodeficiency Virus Infections | 12,422 |
| Myeloproliferative Diseases & Disorders, Poorly Differentiated Neoplasms | 15,620 |
| Factors Influencing Health Status & Other Contacts with Health Services | 21,294 |
| Female Reproductive System | 17,010 |
| Ear, Nose, Mouth & Throat | 22,987 |
| Multiple Significant Trauma | 27,902 |
| Circulatory System | 589,730 |
| Blood, Blood Forming Organs & Immunologic Disorders | 48,990 |
| Injuries, Poisonings & Toxic Effects of Drugs | 64,097 |
| Skin, Subcutaneous Tissue & Breast | 89,577 |
| Hepatobiliary System & Pancreas | 127,172 |
| Endocrine, Nutritional & Metabolic Diseases & Disorders | 142,808 |
| Newborns & Other Neonates with Conditions Originating in the Perinatal Period | 163,605 |
| Pregnancy, Childbirth & the Puerperium | 165,303 |
| Kidney & Urinary Tract | 209,561 |
| Mental Diseases & Disorders | 282,501 |
| Nervous System | 316,243 |
| Digestive System | 346,369 |
| Musculoskeletal System & Connective Tissue | 329,344 |
| Respiratory System | 561,983 |
| Infectious & Parasitic Diseases | 559,244 |
Pre-built, immediately licensable datasets — or custom collections built to your exact model specifications. Speak to our team to discuss your requirements.
Patient-physician dialog covering history taking, symptom assessment, treatment planning, and follow-up — across specialties and languages.
SOAP / progress notes, and discharge summary with NER labels for diagnoses, medications, procedures, & clinical entities.
Structured, linked multi-encounter patient records built for AI models that track disease progression, treatment response, & clinical outcomes over time.
High-fidelity physician dictation recordings with accurate clinical transcriptions — ideal for training ASR and voice-to-EHR systems.
Structured and unstructured EHR data — labs, vitals, medications, problem lists — de-identified, HL7/FHIR-compatible, ready for ML pipelines.
Adverse event reports, clinical trial corpora, drug-drug interaction datasets, and pharmacovigilance data for pharma AI applications.
Verbatim and intelligent transcriptions of medical records with de-identification, clinical entity tagging, and format standardization.
Virtual consultation transcripts, remote monitoring datasets, and teleconsultation audio — annotated for clinical AI in digital health settings.
Whole-slide pathology images, genomics datasets, and biomarker data — annotated by pathologists for oncology AI, drug discovery, and precision medicine.
Don’t see exactly what you need in the catalog? Shaip’s full-service healthcare AI data team can collect, de-identify, annotate, & deliver custom datasets at any scale.
Primary source data collection from patients, healthcare providers, and clinical settings — globally, across 40+ languages and 30+ specialties.
Physician-led annotation services covering all clinical data types — text, image, audio, and video — with rigorous multi-layer QA.
HIPAA Safe Harbor and Expert Determination de-identification of clinical text, images, and audio — with full audit trails and legal documentation.
Clinically realistic synthetic datasets generated to augment scarce real-world data — without PHI exposure risk.
Physician dictation, clinical call recordings, and patient-facing voice interaction datasets — transcribed and annotated for clinical ASR models.
Human + AI transcription of clinical audio, video, and handwritten records — formatted for downstream ML pipelines and EHR integration.
As healthcare organizations race to deploy clinical LLMs, the bottleneck is always the same: high-quality, domain-specific training data. Shaip provides the full data stack for healthcare AI foundation models.
Curated instruction-response pairs for clinical tasks: diagnosis, triage, drug interaction checking, discharge summary generation, and more.
Human preference data ranked by licensed physicians — essential for aligning medical LLM outputs with clinical standards and safety requirements.
Structured, chunked, and embedded clinical knowledge for RAG pipelines — clinical guidelines, formulary data, ICD/SNOMED reference sets.
Expert-curated test sets and clinical benchmarks to evaluate LLM accuracy, hallucination rates, safety, and clinical utility at scale.
Combined text + imaging + audio for multimodal medical AI — radiology reports paired with DICOM images, physician notes with audio recordings.
Adversarial prompt datasets and clinical edge-case testing to surface failure modes in healthcare LLMs before deployment in clinical settings.
Not a generic annotation marketplace. Not a crowdsourcing platform. Shaip is the only purpose-built healthcare AI data company with a decade of clinical domain depth.
10+ years of working exclusively in healthcare AI data — we understand clinical terminologies, workflows, and regulatory requirements that generic providers simply don't.
Our annotation workforce consists entirely of licensed physicians, radiologists, nurses, and pharmacists. No crowdworkers. No general contractors. Only clinicians.
HIPAA, GDPR, IRB, ISO 27001, SOC 2 — we own the entire compliance and de-identification process so your legal team can approve datasets on first review.
Start with ready-to-license catalog data, then scale to custom collection without switching vendors. One relationship, one contract, one delivery pipeline.
Our partnership with Shaip has been instrumental in advancing our NLP capabilities within the oncology domain. The professional handling of 10,000 medical records with detailed clinical entity annotations demonstrated strong compliance and operational excellence.
Shaip designed and validated a complete MRI de-identification workflow for our research program. Their privacy-first pipeline secured nearly 100,000 scans for compliant multi-institutional research data sharing.
Shaip curated and annotated de-identified Pediatrics and OB-GYN outpatient records with ICD-10 CM codes delivered through an API pipeline, creating high-quality training datasets for clinical NLP models.
De-identified oncology records for clinical NLP including negation and entity labeling.
MRI scans secured via validated privacy-first de-identification pipeline.
Outpatient Pediatrics & OB-GYN records annotated with ICD-10 CM codes.
No months-long RFP cycles. No ambiguity. A clear, fast path from data requirement to delivery.
Fill out the form, and a Shaip specialist with clinical domain expertise — not a generic SDR — will connect with you to understand your exact requirements.
We scope your data requirements — modality, specialty, volume, annotation schema, language, compliance needs — and deliver a detailed proposal with timeline.
We deliver a pilot batch so your team can validate annotation quality, schema fit, and format before committing to full-scale production. Adjustments made before scale.
Full dataset delivered securely in your preferred ML format, with compliance documentation. Dedicated customer success manager for ongoing dataset needs.
Don’t let data constraints slow your clinical AI initiative. Talk to a Shaip healthcare data specialist today and get a clear path to the annotated medical data your model needs.