Open Datasets
Discover open source datasets that gets you going to train ML models
Open Source Datasets To Get You Started with AI/ML Models
The output of your AI & ML models is only as good as the data you use to train it – so the precision that you apply to data aggregation and the tagging and identifying of that data is important!
So if you want to start a new AI/ML initiative and now you’re quickly realizing that finding high-quality training data will be one of the more challenging aspects of your project as high-quality datasets are the fuel that keeps the AI/ML engine running. We have accumulated a list of open datasets that are free to use and train your AI/ML models of the future.
| Specialization | Data Type | Dataset Name | Industry / Dept. | Annotation/Use Case | Link |
|---|---|---|---|---|---|
| +NLP | Text | Amazon Reviews | E-commerce | Sentiment Analysis | Link |
| Description | A set of 35 Mn reviews & ratings from over last 18 years in plain text with user and product details. | ||||
| +NLP | Text | Wikipedia Links Data | General | Link | |
| Description | More than 4 Mn. articles containing 1.9 Bn. words from Wikipedia. Each article contains hyperlinks for the associated entity. | ||||
| +NLP | Text | Standford Sentiment Treebank | Entertainment | Sentiment Analysis | Link |
| Description | Sentiment annotations dataset for over 10,000 Rotten Tomatoes movie review sentences. Available at phrase level - each sentence is parsed into sub-phrases by binarizing the parse trees in the Penn Treebank format. | ||||
| +NLP | Text | Twitter US Airline Sentiment | Airline | Sentiment Analysis | Link |
| Description | 2015 Tweets on US Airlines bifurcated into positive, neutral and negative sentiments. | ||||
| +CV | Image | Imagenet | General | Link | |
| Description | Dataset with over 14 Mn. images in various file formats mapped to around 21,000 synsets. Synsets are synomyms with associated entities present as an image. 1 Mn. Images have bounding boxes and more than 1 Mn. images have SIFT features. | ||||
| +CV | Image | Google’s Open Images | General | Link | |
| Description | A dataset similar to ImageNet with 600 categories. Available in development, validation and training splits. Some images also include bounding boxes and visual relationships. | ||||
| +NLP | Text | Cornell Movie Dialogs | Entertainment | Dialogs | Link |
| Description | A collection of fictional conversations, with metadata of characters and movies. Each row is a dialog between two people, in a question-answer format. | ||||
| +NLP | Text | Yahoo Answers | General | Question Answering | Link |
| Description | A question-answer dataset with questions and answers from Yahoo Answers portal between Apr 2007 and Oct 2007. | ||||
| +NLP | Text | MS MARCO | General | Question Answering | Link |
| Description | A question-answer dataset with annotations from Bing’s web search logs. Each question contains an answer provided from a user, as well as web passages that contain the answer. | ||||
| +NLP | Text | Natural Questions Dataset | General | Question Answering | Link |
| Description | Released by Google, this dataset contains real user queries and answers from Wikipedia articles. | ||||
| +NLP | Text | DBPedia | General | Knowledge Graph | Link |
| Description | A structured rendering of Wikipedia, with entities and relations extracted as a Knowledge Graph. | ||||
| +NLP | Text | YAGO | General | Knowledge Graph | Link |
| Description | A knowledge graph containing entities and relations from Wikipedia, WordNet, and GeoNames. | ||||
| +NLP | Text | FreeBase | General | Knowledge Graph | Link |
| Description | A crowd-sourced knowledge base consisting of entities and relationships, now incorporated into Google knowledge graph. | ||||
| +NLP | Text | Ontonotes | General | Semantic Role Labeling | Link |
| Description | A corpus with syntactic, semantic, and discourse-level annotations used in the CoNLL shared tasks. | ||||
| +NLP | Text | CoNLL 2003 | General | Named Entity Recognition | Link |
| Description | An English dataset annotated for named entities such as person, organization, and location. | ||||
| +CV | Image | COCO | General | Object Detection | Link |
| Description | Common Objects in Context: a richly annotated dataset for object detection, segmentation, and captioning. | ||||
| +CV | Image | PASCAL VOC | General | Object Detection | Link |
| Description | A benchmark dataset for object detection and segmentation challenges. | ||||
| +CV | Image | Cityscapes | Autonomous Driving | Semantic Segmentation | Link |
| Description | Dataset for urban scene understanding with pixel-level annotations for 30 classes. | ||||
| +CV | Image | MNIST | General | Digit Classification | Link |
| Description | Handwritten digits dataset with 60,000 training and 10,000 test images of 28x28 pixels. | ||||
| +CV | Image | Fashion-MNIST | Retail | Image Classification | Link |
| Description | Dataset of Zalando’s article images in the same format as MNIST, used as a drop-in replacement for benchmarking. | ||||
| +NLP | Audio | LibriSpeech | General | ASR | Link |
| Description | A corpus of read English speech derived from audiobooks, with 1000 hours of speech and associated texts. | ||||
| +NLP | Audio | TED-LIUM | General | ASR | Link |
| Description | Transcribed TED talks with audio and aligned transcriptions for speech recognition research. | ||||
| +NLP | Audio | TIMIT | General | Phoneme Recognition | Link |
| Description | Phonetically transcribed speech of American English speakers, widely used for phoneme recognition tasks. | ||||
| +NLP | Audio | Common Voice | General | ASR | Link |
| Description | A multilingual corpus of read speech contributed by volunteers around the world. | ||||
| +NLP | Audio | VoxCeleb | General | Speaker Recognition | Link |
| Description | A large-scale speaker identification dataset collected from YouTube videos. | ||||
| +NLP | Text | Wikipedia Dump | General | Language Modeling | Link |
| Description | Full text dumps of Wikipedia articles, updated regularly, used for pretraining language models. | ||||
| +NLP | Text | Gigaword | News | Language Modeling | Link |
| Description | A comprehensive archive of newswire text data from multiple news agencies. | ||||
| +NLP | Text | IMDB Reviews | Entertainment | Sentiment Analysis | Link |
| Description | Large movie review dataset for binary sentiment classification. | ||||
| +CV | Video | Kinetics-700 | General | Action Recognition | Link |
| Description | A large-scale, high-quality dataset of YouTube video clips covering 700 human action classes. | ||||
| +CV | Video | UCF101 | General | Action Recognition | Link |
| Description | A dataset of realistic action videos, with 101 action categories. | ||||
| +CV | Video | HMDB51 | General | Action Recognition | Link |
| Description | A large human motion video database with 51 action categories. | ||||
| +CV | Image | LFW (Labeled Faces in the Wild) | General | Face Recognition | Link |
| Description | A database of face photographs designed for studying unconstrained face recognition. | ||||
| +CV | Image | CASIA-WebFace | General | Face Recognition | Link |
| Description | A dataset with millions of face images for training deep face recognition models. | ||||
| +NLP | Text | SQuAD | General | Reading Comprehension | Link |
| Description | Stanford Question Answering Dataset: questions posed by crowdworkers on a set of Wikipedia articles. | ||||
| +NLP | Text | NewsQA | News | Reading Comprehension | Link |
| Description | A machine comprehension dataset with questions and answers based on CNN news articles. | ||||
| +NLP | Text | MultiNLI | General | Natural Language Inference | Link |
| Description | A dataset for sentence-pair natural language inference across multiple genres. | ||||
| +NLP | Text | SNLI | General | Natural Language Inference | Link |
| Description | Stanford Natural Language Inference Corpus with sentence pairs labeled as entailment, contradiction, or neutral. | ||||
| +NLP | Text | WikiText | General | Language Modeling | Link |
| Description | A collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. | ||||
| +CV | Image | Stanford Cars | Automotive | Fine-grained Classification | Link |
| Description | A dataset of 16,185 images of 196 classes of cars. | ||||
| +CV | Image | Oxford Flowers 102 | Botany | Fine-grained Classification | Link |
| Description | 102 flower categories commonly occurring in the United Kingdom. | ||||
| +CV | Image | CIFAR-10 | General | Image Classification | Link |
| Description | Images of 10 classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. | ||||
| +CV | Image | CIFAR-100 | General | Image Classification | Link |
| Description | A dataset similar to CIFAR-10, but with 100 fine-grained classes. | ||||
| +CV | Image | VOC Person Layout | General | Pose Estimation | Link |
| Description | Part of PASCAL VOC focusing on person layout annotations such as head, hands, and feet. | ||||
| +CV | Image | MPII Human Pose | General | Pose Estimation | Link |
| Description | Around 25,000 images containing over 40,000 people with annotated body joints. | ||||
| +NLP | Text | Reuters-21578 | Finance | Text Classification | Link |
| Description | Collection of Reuters newswire articles for text categorization research. | ||||
| +NLP | Text | 20 Newsgroups | General | Text Classification | Link |
| Description | A collection of 20,000 newsgroup documents partitioned into 20 different newsgroups. | ||||