Open Datasets

Discover open source datasets that gets you going to train ML models

Open Source Datasets To Get You Started with AI/ML Models

The output of your AI & ML models is only as good as the data you use to train it – so the precision that you apply to data aggregation and the tagging and identifying of that data is important!

So if you want to start a new AI/ML initiative and now you’re quickly realizing that finding high-quality training data will be one of the more challenging aspects of your project as high-quality datasets are the fuel that keeps the AI/ML engine running. We have accumulated a list of open datasets that are free to use and train your AI/ML models of the future.

Specialization Natural Language Processing (NLP) Computer Vision (CV)

Data Type Audio Image Text Video

Specialization	Data Type	Dataset Name	Industry / Dept.	Annotation/Use Case	Link
+NLP	Text	Amazon Reviews	E-commerce	Sentiment Analysis	Link
Description	A set of 35 Mn reviews & ratings from over last 18 years in plain text with user and product details.
+NLP	Text	Wikipedia Links Data	General		Link
Description	More than 4 Mn. articles containing 1.9 Bn. words from Wikipedia. Each article contains hyperlinks for the associated entity.
+NLP	Text	Standford Sentiment Treebank	Entertainment	Sentiment Analysis	Link
Description	Sentiment annotations dataset for over 10,000 Rotten Tomatoes movie review sentences. Available at phrase level - each sentence is parsed into sub-phrases by binarizing the parse trees in the Penn Treebank format.
+NLP	Text	Twitter US Airline Sentiment	Airline	Sentiment Analysis	Link
Description	2015 Tweets on US Airlines bifurcated into positive, neutral and negative sentiments.
+CV	Image	Imagenet	General		Link
Description	Dataset with over 14 Mn. images in various file formats mapped to around 21,000 synsets. Synsets are synomyms with associated entities present as an image. 1 Mn. Images have bounding boxes and more than 1 Mn. images have SIFT features.
+CV	Image	Google’s Open Images	General		Link
Description	A dataset similar to ImageNet with 600 categories. Available in development, validation and training splits. Some images also include bounding boxes and visual relationships.
+NLP	Text	Cornell Movie Dialogs	Entertainment	Dialogs	Link
Description	A collection of fictional conversations, with metadata of characters and movies. Each row is a dialog between two people, in a question-answer format.
Description	A question-answer dataset with questions and answers from Yahoo Answers portal between Apr 2007 and Oct 2007.
+NLP	Text	MS MARCO	General	Question Answering	Link
Description	A question-answer dataset with annotations from Bing’s web search logs. Each question contains an answer provided from a user, as well as web passages that contain the answer.
+NLP	Text	Natural Questions Dataset	General	Question Answering	Link
Description	Released by Google, this dataset contains real user queries and answers from Wikipedia articles.
+NLP	Text	DBPedia	General	Knowledge Graph	Link
Description	A structured rendering of Wikipedia, with entities and relations extracted as a Knowledge Graph.
+NLP	Text	YAGO	General	Knowledge Graph	Link
Description	A knowledge graph containing entities and relations from Wikipedia, WordNet, and GeoNames.
+NLP	Text	FreeBase	General	Knowledge Graph	Link
Description	A crowd-sourced knowledge base consisting of entities and relationships, now incorporated into Google knowledge graph.
+NLP	Text	Ontonotes	General	Semantic Role Labeling	Link
Description	A corpus with syntactic, semantic, and discourse-level annotations used in the CoNLL shared tasks.
Description	An English dataset annotated for named entities such as person, organization, and location.
+CV	Image	COCO	General	Object Detection	Link
Description	Common Objects in Context: a richly annotated dataset for object detection, segmentation, and captioning.
+CV	Image	PASCAL VOC	General	Object Detection	Link
Description	A benchmark dataset for object detection and segmentation challenges.
+CV	Image	Cityscapes	Autonomous Driving	Semantic Segmentation	Link
Description	Dataset for urban scene understanding with pixel-level annotations for 30 classes.
+CV	Image	MNIST	General	Digit Classification	Link
Description	Handwritten digits dataset with 60,000 training and 10,000 test images of 28x28 pixels.
+CV	Image	Fashion-MNIST	Retail	Image Classification	Link
Description	Dataset of Zalando’s article images in the same format as MNIST, used as a drop-in replacement for benchmarking.
+NLP	Audio	LibriSpeech	General	ASR	Link
Description	A corpus of read English speech derived from audiobooks, with 1000 hours of speech and associated texts.
+NLP	Audio	TED-LIUM	General	ASR	Link
Description	Transcribed TED talks with audio and aligned transcriptions for speech recognition research.
+NLP	Audio	TIMIT	General	Phoneme Recognition	Link
Description	Phonetically transcribed speech of American English speakers, widely used for phoneme recognition tasks.
+NLP	Audio	Common Voice	General	ASR	Link
Description	A multilingual corpus of read speech contributed by volunteers around the world.
+NLP	Audio	VoxCeleb	General	Speaker Recognition	Link
Description	A large-scale speaker identification dataset collected from YouTube videos.
+NLP	Text	Wikipedia Dump	General	Language Modeling	Link
Description	Full text dumps of Wikipedia articles, updated regularly, used for pretraining language models.
+NLP	Text	Gigaword	News	Language Modeling	Link
Description	A comprehensive archive of newswire text data from multiple news agencies.
+NLP	Text	IMDB Reviews	Entertainment	Sentiment Analysis	Link
Description	Large movie review dataset for binary sentiment classification.
+CV	Video	Kinetics-700	General	Action Recognition	Link
Description	A large-scale, high-quality dataset of YouTube video clips covering 700 human action classes.
+CV	Video	UCF101	General	Action Recognition	Link
Description	A dataset of realistic action videos, with 101 action categories.
+CV	Video	HMDB51	General	Action Recognition	Link
Description	A large human motion video database with 51 action categories.
Description	A database of face photographs designed for studying unconstrained face recognition.
+CV	Image	CASIA-WebFace	General	Face Recognition	Link
Description	A dataset with millions of face images for training deep face recognition models.
+NLP	Text	SQuAD	General	Reading Comprehension	Link
Description	Stanford Question Answering Dataset: questions posed by crowdworkers on a set of Wikipedia articles.
Description	A machine comprehension dataset with questions and answers based on CNN news articles.
+NLP	Text	MultiNLI	General	Natural Language Inference	Link
Description	A dataset for sentence-pair natural language inference across multiple genres.
+NLP	Text	SNLI	General	Natural Language Inference	Link
Description	Stanford Natural Language Inference Corpus with sentence pairs labeled as entailment, contradiction, or neutral.
Description	A collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia.
Description	A dataset of 16,185 images of 196 classes of cars.
+CV	Image	Oxford Flowers 102	Botany	Fine-grained Classification	Link
Description	102 flower categories commonly occurring in the United Kingdom.
+CV	Image	CIFAR-10	General	Image Classification	Link
Description	Images of 10 classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck.
+CV	Image	CIFAR-100	General	Image Classification	Link
Description	A dataset similar to CIFAR-10, but with 100 fine-grained classes.
+CV	Image	VOC Person Layout	General	Pose Estimation	Link
Description	Part of PASCAL VOC focusing on person layout annotations such as head, hands, and feet.
+CV	Image	MPII Human Pose	General	Pose Estimation	Link
Description	Around 25,000 images containing over 40,000 people with annotated body joints.
Description	Collection of Reuters newswire articles for text categorization research.
+NLP	Text	20 Newsgroups	General	Text Classification	Link
Description	A collection of 20,000 newsgroup documents partitioned into 20 different newsgroups.

Open Datasets

Open Source Datasets To Get You Started with AI/ML Models

AI Data Services

Platform

Speciality

Industry

Resources

Company

Contact Us