List of datasets for machine-learning research

These datasets are used in machine learning (ML) research and have been cited in peer-reviewed academic journals. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning algorithms, computer hardware, and, less intuitively, the availability of high-quality training datasets. High-quality labeled training datasets for supervised and semi-supervised machine-learning algorithms are usually difficult and expensive to produce because of the large amount of time needed to label the data. Although they do not need to be labeled, high-quality unlabeled datasets for unsupervised learning can also be difficult and costly to produce.
Many organizations, including governments, publish and share their datasets, often using common metadata formats. The datasets are classified, based on the licenses, into two groups: open data and non-open data.
The datasets from various governmental-bodies are presented in List of open government data sites. The datasets are ported on open data portals. They are made available for searching, depositing and accessing through interfaces like Open API. The datasets are made available as various sorted types and subtypes.

List of sorting used for datasets

The data portal is classified based on its type of license. The open source license based data portals are known as open data portals which are used by many government organizations and academic institutions.

List of open data portals

Portal-name	License	List of installations of the portal	Typical usages
Comprehensive Knowledge Archive Network	AGPL	https://ckan.github.io/ckan-instances/ https://github.com/sebneu/ckan_instances/blob/master/instances.csv	Data repository for government or non-profit organisations, Data Management Solution for Research Institutes
	GPL	https://getdkan.org/community	Data repository for government or non-profit organisations, Data Management Solution for Research Institutes
Dataverse	Apache	https://dataverse.org/installations https://dataverse.org/metrics	Data Management Solution for Research Institutes
DSpace	BSD	https://registry.lyrasis.org/	Data Management Solution for Research Institutes
	BSD	https://www.openml.org/search?type=data&sort=runs&status=active	Data Management Solution to share datasets, algorithms, and experiments results through APIs.

List of portals suitable for multiple types of applications

The data portal sometimes lists a wide variety of subtypes of datasets pertaining to many machine learning applications.

Academic Torrents	https://academictorrents.com
Amazon Datasets	https://registry.opendata.aws/
Awesome Public Datasets Collection	https://github.com/awesomedata/awesome-public-datasets
data.world	https://data.world/datasets/machine-learning
Datahub – Core Datasets	https://datahub.io/docs/core-data
DataONE	https://www.dataone.org/
DataPortals	https://dataportals.org/
Datasetlist.com	https://www.datasetlist.com
Global Open Data Index – Open Knowledge Foundation	https://okfn.org/
Google Dataset Search	https://datasetsearch.research.google.com/
Hugging Face	https://huggingface.co/docs/datasets/
IBM's Data Asset Exchange	https://developer.ibm.com/exchanges/data/
Jupyter – Tutorial Data	https://jupyter-tutorial.readthedocs.io/en/latest/data-processing/opendata.html
Kaggle	https://www.kaggle.com/datasets
Machine learning datasets	https://macgence.com/data-sets-and-cataloges/
Major Smart Cities with Open Data	https://rlist.io/l/major-smart-cities-with-open-data-portals
Microsoft Datasets	https://msropendata.com/datasets
Open Data Inception	https://opendatainception.io/
Opendatasoft	https://data.opendatasoft.com/explore/dataset/open-data-sources%40public/table/?sort=code_en
OpenDOAR	https://v2.sherpa.ac.uk/opendoar/
OpenML	https://www.openml.org/search?type=data
Papers with Code	https://paperswithcode.com/datasets
Penn Machine Learning Benchmarks	https://github.com/EpistasisLab/pmlb/tree/master/datasets
Public APIs	https://github.com/public-apis/public-apis
Registry of Open Access Repositories	http://roar.eprints.org/
REgistry of REsearch Data REpositories	https://www.re3data.org/
UCI Machine Learning Repository	https://archive.ics.uci.edu/
Speech Dataset	https://www.shaip.com/offerings/speech-data-catalog/
Visual Data Discovery	https://visualdata.io/discovery

List of portals suitable for a specific subtype of applications

The data portals which are suitable for a specific subtype of machine learning application are listed in the subsequent sections.

Text data

These datasets consist primarily of text for tasks such as natural language processing, sentiment analysis, translation, and cluster analysis.

Sound data

These datasets consist of sounds and sound features used for tasks such as speech recognition and speech synthesis.

Signal data

Datasets containing electric signal information requiring some sort of signal processing for further analysis.

Chemical data

Datasets from physical systems.

OpenReACT-CHON-EFH

OpenReACT-CHON-EFH is a 2025 open-access benchmark for machine-learning interatomic potentials.

**RTP set** – 35,087 stationary-point geometries drawn from 11,961 elementary reactions, each labeled with density-functional energies, atomic forces and full Hessian matrices at the ωB97X-D/6-31G level.
**IRC set** – 34,248 structures along 600 minimum-energy reaction paths, used to test extrapolation beyond trained stationary points.
**NMS set** – 62,527 off-equilibrium geometries generated by normal-mode sampling to probe model robustness under thermal perturbations.

The collection underpins the study Does Hessian Data Improve the Performance of Machine Learning Potentials? and was used to train and benchmark the machine-learning interatomic potentials reported therein.
The dataset itself is distributed under a CC licence via Figshare.

Physical data

Datasets from physical systems.

Other physical

Dataset name	Brief description	Preprocessing	Instances	Format	Default task	Created	Reference	Creator
Concrete Compressive Strength Dataset	Dataset of concrete properties and compressive strength.	Nine features are given for each sample.	1030	Text	Regression	2007		I. Yeh
Concrete Slump Test Dataset	Concrete slump flow given in terms of properties.	Features of concrete given such as fly ash, water, etc.	103	Text	Regression	2009		I. Yeh
Musk Dataset	Predict if a molecule, given the features, will be a musk or a non-musk.	168 features given for each molecule.	6598	Text	Classification	1994		Arris Pharmaceutical Corp.
Steel Plates Faults Dataset	Steel plates of 7 different types.	27 features given for each sample.	1941	Text	Classification	2010		Semeion Research Center
Noble Metal Monometallic Nanoparticles Datasets	Processing and structural features of monometallic nanoparticles, labels being formation energy.	85-182 features given for each sample.	425 to 4000	CSV	Regression	2017 to 2023		A. Barnard and G. Opletal
Noble Metal Bimetallic Nanoparticles Datasets	Processing and structural features of bimetallic nanoparticles, labels being formation energy.	922 features given for each sample.	138147 to 162770	CSV	Regression	2023		J. Ting et al.
AuPdPt Trimetallic Nanoparticles Dataset	Processing and structural features of AuPdPt nanoparticles, labels being formation energy.	1958 features given for each sample.	48136	CSV	Regression	2023		K. Lu et al.

Biological data

Datasets from biological systems.

Question answering data

This section includes datasets that deals with structured data.

Dataset name	Brief description	Preprocessing	Instances	Format	Default task	Created	Reference	Creator
DBpedia Neural Question Answering Dataset	A large collection of Question to SPARQL specially design for Open Domain Neural Question Answering over DBpedia Knowledgebase.	This dataset contains a large collection of Open Neural SPARQL Templates and instances for training Neural SPARQL Machines; it was pre-processed by semi-automatic annotation tools as well as by three SPARQL experts.	894,499	Question-query pairs	Question Answering	2018		Hartmann, Soru, and Marx et al.
Vietnamese Question Answering Dataset	A large collection of Vietnamese questions for evaluating MRC models.	This dataset comprises over 23,000 human-generated question-answer pairs based on 5,109 passages of 174 Vietnamese articles from Wikipedia.	23,074	Question-answer pairs	Question Answering	2020		Nguyen et al.
Vietnamese Multiple-Choice Machine Reading Comprehension Corpus	A collection of Vietnamese multiple-choice questions for evaluating MRC models.	This corpus includes 2,783 Vietnamese multiple-choice questions.	2,783	Question-answer pairs	Question Answering/Machine Reading Comprehension	2020		Nguyen et al.
Open-Domain Question Answering Goes Conversational via Question Rewriting	An end-to-end open-domain question answering.	This dataset includes 14,000 conversations with 81,000 question-answer pairs.		Context, Question, Rewrite, Answer, Answer_URL, Conversation_no, Turn_no, Conversation_source Further details are provided in the and respective .	Question Answering	2021		Anantha and Vakulenko et al.
UnifiedQA	Question-answer data	Processed dataset			Question Answering	2020		Khashabi et al.

Dialog or instruction prompted data

This section includes datasets that contains multi-turn text with at least two actors, a "user" and an "agent". The user makes requests for the agent, which performs the request.

Dataset name	Brief description	Preprocessing	Instances	Format	Default task	Created	Reference	Creator
Taskmaster	3 datasets with >55,000 spoken and written task-oriented dialogs in several domains.		13,215 + 17,289 + 23,757 dialogs, in 6 + 7 + 1 task domains.	1 and 2: conversation id, utterances, Instruction id 3: conversation id, utterances, vertical, scenario, instructions.	Do the task.	2019		Byrne and Krishnamoorthi et al.
DrRepair	Labeled dataset for program repair.			Check format details in the .	Do the task.	2020		Michihiro et al.
Super-NaturalInstructions	Tasks specified in natural language.		1,616 NLP tasks in 76 task types.	Task definition in natural language instructions; example input/output.	Do the task.	2022		Wang et al.
LAMBADA	Narrative passages where the last word is omitted.				Guess the last word.	2016		Paperno et al.
FLAN	Instruction tuning data, with a mix of zero-shot, few-shot and chain-of-thought templates.				Instruction tuning; do the task.	2021		Wei et al.

Cybersecurity

Dataset name	Brief description	Preprocessing	Instances	Format	Default task	Created	Reference	Creator
MITRE ATTACK	The ATT&CK is a globally-accessible knowledge base of adversary tactics and techniques.			Data can be downloaded from these two GitHub repositories: and				MITRE ATTACK
CAPEC	Common Attack Pattern Enumeration and Classification			Data can be downloaded from :				CAPEC
CVE	CVE is a list of publicly disclosed cybersecurity vulnerabilities that is free to search, use, and incorporate into products and services.			Data can be downloaded from:				CVE
CWE	Common Weakness Enumeration data.			Data can be downloaded from:				CWE
MalwareTextDB	Annotated database of malware texts.			The contains the data to download.				Kiat et al.
USENIX Security Symposium proceedings	Collection of security proceedings from USENIX Security Symposium – technical sessions from 1995 to 2022.	This data is not pre-processed.		,,,,,,,,,,,,,, , ,,,,,,,,,,, .				USENIX Security Symposium
APTNotes	Collection of public documents, whitepapers and articles about APT campaigns. All the documents are publicly available data.	This data is not pre-processed.		The of the project contains a file with links to the data stored in box. Data files can also be downloaded .				APT Notes
arXiv Cryptography and Security papers	Collection of articles about cybersecurity	This data is not pre-processed.		All articles available .				arXiv
Security eBooks for free	Small collection of security eBooks, and security presentations publicly available.	This data is not pre-processed.
National Cyber Security strategy repository	Repository of worldwide strategy documents about cybersecurity.	This data is not pre-processed.
Cyber Security Natural Language Processing	Data about cybersecurity strategies from more than 75 countries.	Tokenization, meaningless-frequent words removal.						Yanlin Chen, Yunjian Wei, Yifan Yu, Wen Xue, Xianya Qin
APT Reports collection	Sample of APT reports, malware, technology, and intelligence collection	Raw and tokenize data available.		All data is available in this repository.				blackorbird
Offensive Language Identification Dataset				Data available in the . Data is also available .				Zampieri et al.
Cyber reports from the National Cyber Security Centre		This data is not pre-processed.		,,,, . .
APT reports by Kaspersky		This data is not pre-processed.
The cyberwire		This data is not pre-processed.		, , and .
Databreaches news		This data is not pre-processed.		,
Cybernews		This data is not pre-processed.		,
Bleepingcomputer		This data is not pre-processed.
Therecord		This data is not pre-processed.
Hackread		This data is not pre-processed.
Securelist		This data is not pre-processed.		,,,,,,,,,, and .
Stucco project	The Stucco project collects data not typically integrated into security systems.	This data is not pre-processed
Farsightsecurity	Website with technical information, reports, and more about security topics.	This data is not pre-processed		,, .
Schneier	Website with academic papers about security topics.	This data is not pre-processed		, .
Trendmicro	Website with research, news, and perspectives bout security topics.	This data is not pre-processed		.
The Hacker News	News about cybersecurity topics.	This data is not pre-processed		,,, .
Krebsonsecurity	Security news and investigation	This data is not pre-processed
Mitre Defend	Matrix of Defend artifacts			json files
Mitre Atlas	Mitre Atlas is a knowledge base of adversary tactics, techniques, and case studies for machine learning systems based on real-world observations.	This data is not pre-processed
Mitre Engage	MITRE Engage is a framework for planning and discussing adversary engagement operations that empowers you to engage your adversaries and achieve your cybersecurity goals.	This data is not pre-processed
Hacking Tutorials		This data is not pre-processed

Curated repositories of datasets

As datasets come in myriad formats and can sometimes be difficult to use, there has been considerable work put into curating and standardizing the format of datasets to make them easier to use for machine learning research.

OpenML: Web platform with Python, R, Java, and other APIs for downloading hundreds of machine learning datasets, evaluating algorithms on datasets, and benchmarking algorithm performance against dozens of other algorithms.
PMLB: A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms. Provides classification and regression datasets in a standardized format that are accessible through a Python API.
Metatext NLP: https://metatext.io/datasets web repository maintained by community, containing nearly 1000 benchmark datasets, and counting. Provides many tasks from classification to QA, and various languages from English, Portuguese to Arabic.
Appen: Off The Shelf and Open Source Datasets hosted and maintained by the company. These biological, image, physical, question answering, signal, sound, text, and video resources number over 250 and can be applied to over 25 different use cases.