List of datasets for machine-learning research


These datasets are used in machine learning research and have been cited in peer-reviewed academic journals. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning algorithms, computer hardware, and, less intuitively, the availability of high-quality training datasets. High-quality labeled training datasets for supervised and semi-supervised machine-learning algorithms are usually difficult and expensive to produce because of the large amount of time needed to label the data. Although they do not need to be labeled, high-quality unlabeled datasets for unsupervised learning can also be difficult and costly to produce.
Many organizations, including governments, publish and share their datasets, often using common metadata formats. The datasets are classified, based on the licenses, into two groups: open data and non-open data.
The datasets from various governmental-bodies are presented in List of [open government data sites]. The datasets are ported on open data portals. They are made available for searching, depositing and accessing through interfaces like Open API. The datasets are made available as various sorted types and subtypes.

List of sorting used for datasets

The data portal is classified based on its type of license. The open source license based data portals are known as open data portals which are used by many government organizations and academic institutions.

List of open data portals

Portal-nameLicenseList of installations of the portalTypical usages
Comprehensive Knowledge Archive Network AGPLhttps://ckan.github.io/ckan-instances/
https://github.com/sebneu/ckan_instances/blob/master/instances.csv
Data repository for government or non-profit organisations, Data Management Solution for Research Institutes
GPLhttps://getdkan.org/communityData repository for government or non-profit organisations, Data Management Solution for Research Institutes
DataverseApachehttps://dataverse.org/installations
https://dataverse.org/metrics
Data Management Solution for Research Institutes
DSpaceBSDhttps://registry.lyrasis.org/Data Management Solution for Research Institutes
BSDhttps://www.openml.org/search?type=data&sort=runs&status=activeData Management Solution to share datasets, algorithms, and experiments results through APIs.

List of portals suitable for multiple types of applications

The data portal sometimes lists a wide variety of subtypes of datasets pertaining to many machine learning applications.
Academic Torrentshttps://academictorrents.com
Amazon Datasetshttps://registry.opendata.aws/
Awesome Public Datasets Collectionhttps://github.com/awesomedata/awesome-public-datasets
data.worldhttps://data.world/datasets/machine-learning
Datahub – Core Datasetshttps://datahub.io/docs/core-data
DataONEhttps://www.dataone.org/
DataPortalshttps://dataportals.org/
Datasetlist.comhttps://www.datasetlist.com
Global Open Data IndexOpen Knowledge Foundationhttps://okfn.org/
Google Dataset Searchhttps://datasetsearch.research.google.com/
Hugging Facehttps://huggingface.co/docs/datasets/
IBM's Data Asset Exchangehttps://developer.ibm.com/exchanges/data/
Jupyter – Tutorial Datahttps://jupyter-tutorial.readthedocs.io/en/latest/data-processing/opendata.html
Kagglehttps://www.kaggle.com/datasets
Machine learning datasetshttps://macgence.com/data-sets-and-cataloges/
Major Smart Cities with Open Datahttps://rlist.io/l/major-smart-cities-with-open-data-portals
Microsoft Datasetshttps://msropendata.com/datasets
Open Data Inceptionhttps://opendatainception.io/
Opendatasofthttps://data.opendatasoft.com/explore/dataset/open-data-sources%40public/table/?sort=code_en
OpenDOARhttps://v2.sherpa.ac.uk/opendoar/
OpenMLhttps://www.openml.org/search?type=data
Papers with Codehttps://paperswithcode.com/datasets
Penn Machine Learning Benchmarkshttps://github.com/EpistasisLab/pmlb/tree/master/datasets
Public APIshttps://github.com/public-apis/public-apis
Registry of Open Access Repositorieshttp://roar.eprints.org/ 
REgistry of REsearch Data REpositorieshttps://www.re3data.org/ 
UCI Machine Learning Repositoryhttps://archive.ics.uci.edu/
Speech Datasethttps://www.shaip.com/offerings/speech-data-catalog/
Visual Data Discoveryhttps://visualdata.io/discovery

List of portals suitable for a specific subtype of applications

The data portals which are suitable for a specific subtype of machine learning application are listed in the subsequent sections.

Image data

Text data

These datasets consist primarily of text for tasks such as natural language processing, sentiment analysis, translation, and cluster analysis.

Reviews

News articles

Messages

Twitter and tweets

Dialogues

Legal

Other text

Sound data

These datasets consist of sounds and sound features used for tasks such as speech recognition and speech synthesis.

Speech

Music

Other sounds

Signal data

Datasets containing electric signal information requiring some sort of signal processing for further analysis.

Electrical

Motion-tracking

Other signals

Chemical data

Datasets from physical systems.

Chemical Reactions with transition states (TS)

OpenReACT-CHON-EFH

OpenReACT-CHON-EFH is a 2025 open-access benchmark for machine-learning interatomic potentials.
  • **RTP set** – 35,087 stationary-point geometries drawn from 11,961 elementary reactions, each labeled with density-functional energies, atomic forces and full Hessian matrices at the ωB97X-D/6-31G level.
  • **IRC set** – 34,248 structures along 600 minimum-energy reaction paths, used to test extrapolation beyond trained stationary points.
  • **NMS set** – 62,527 off-equilibrium geometries generated by normal-mode sampling to probe model robustness under thermal perturbations.
The collection underpins the study Does Hessian Data Improve the Performance of Machine Learning Potentials? and was used to train and benchmark the machine-learning interatomic potentials reported therein.
The dataset itself is distributed under a CC licence via Figshare.

Physical data

Datasets from physical systems.

High-energy physics

Systems

Astronomy

Earth science

Other physical

Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated ReferenceCreator
Concrete Compressive Strength DatasetDataset of concrete properties and compressive strength.Nine features are given for each sample.1030TextRegression2007
I. Yeh
Concrete Slump Test DatasetConcrete slump flow given in terms of properties.Features of concrete given such as fly ash, water, etc.103TextRegression2009
I. Yeh
Musk DatasetPredict if a molecule, given the features, will be a musk or a non-musk.168 features given for each molecule.6598TextClassification1994Arris Pharmaceutical Corp.
Steel Plates Faults DatasetSteel plates of 7 different types.27 features given for each sample.1941TextClassification2010Semeion Research Center
Noble Metal Monometallic Nanoparticles DatasetsProcessing and structural features of monometallic nanoparticles, labels being formation energy.85-182 features given for each sample.425 to 4000CSVRegression2017 to 2023




A. Barnard and G. Opletal
Noble Metal Bimetallic Nanoparticles DatasetsProcessing and structural features of bimetallic nanoparticles, labels being formation energy.922 features given for each sample.138147 to 162770CSVRegression2023










J. Ting et al.
AuPdPt Trimetallic Nanoparticles DatasetProcessing and structural features of AuPdPt nanoparticles, labels being formation energy.1958 features given for each sample.48136CSVRegression2023K. Lu et al.

Biological data

Datasets from biological systems.

Human

Animal

Fungi

Plant

Microbe

Drug discovery

Anomaly data

Question answering data

This section includes datasets that deals with structured data.
Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated ReferenceCreator
DBpedia Neural Question Answering DatasetA large collection of Question to SPARQL specially design for Open Domain Neural Question Answering over DBpedia Knowledgebase.This dataset contains a large collection of Open Neural SPARQL Templates and instances for training Neural SPARQL Machines; it was pre-processed by semi-automatic annotation tools as well as by three SPARQL experts.894,499Question-query pairsQuestion Answering2018Hartmann, Soru, and Marx et al.
Vietnamese Question Answering Dataset A large collection of Vietnamese questions for evaluating MRC models.This dataset comprises over 23,000 human-generated question-answer pairs based on 5,109 passages of 174 Vietnamese articles from Wikipedia.23,074Question-answer pairsQuestion Answering2020Nguyen et al.
Vietnamese Multiple-Choice Machine Reading Comprehension CorpusA collection of Vietnamese multiple-choice questions for evaluating MRC models.This corpus includes 2,783 Vietnamese multiple-choice questions.2,783Question-answer pairsQuestion Answering/Machine Reading Comprehension2020Nguyen et al.
Open-Domain Question Answering Goes Conversational via Question RewritingAn end-to-end open-domain question answering.This dataset includes 14,000 conversations with 81,000 question-answer pairs.Context, Question, Rewrite, Answer, Answer_URL, Conversation_no, Turn_no, Conversation_source
Further details are provided in the and respective .
Question Answering2021Anantha and Vakulenko et al.
UnifiedQAQuestion-answer dataProcessed datasetQuestion Answering2020Khashabi et al.

Dialog or instruction prompted data

This section includes datasets that contains multi-turn text with at least two actors, a "user" and an "agent". The user makes requests for the agent, which performs the request.
Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated ReferenceCreator
Taskmaster3 datasets with >55,000 spoken and written task-oriented dialogs in several domains.13,215 + 17,289 + 23,757 dialogs, in 6 + 7 + 1 task domains.1 and 2: conversation id, utterances, Instruction id
3: conversation id, utterances, vertical, scenario, instructions.
Do the task.2019Byrne and Krishnamoorthi et al.
DrRepairLabeled dataset for program repair.Check format details in the .Do the task.2020Michihiro et al.
Super-NaturalInstructionsTasks specified in natural language.1,616 NLP tasks in 76 task types.Task definition in natural language instructions; example input/output.Do the task.2022Wang et al.
LAMBADANarrative passages where the last word is omitted.Guess the last word.2016Paperno et al.
FLANInstruction tuning data, with a mix of zero-shot, few-shot and chain-of-thought templates.Instruction tuning; do the task.2021Wei et al.

Cybersecurity

Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated ReferenceCreator
MITRE ATTACKThe ATT&CK is a globally-accessible knowledge base of adversary tactics and techniques.Data can be downloaded from these two GitHub repositories: and MITRE ATTACK
CAPECCommon Attack Pattern Enumeration and ClassificationData can be downloaded from :
CAPEC
CVECVE is a list of publicly disclosed cybersecurity vulnerabilities that is free to search, use, and incorporate into products and services.Data can be downloaded from: CVE
CWECommon Weakness Enumeration data.Data can be downloaded from:
CWE
MalwareTextDBAnnotated database of malware texts.The contains the data to download.Kiat et al.
USENIX Security Symposium proceedingsCollection of security proceedings from USENIX Security Symposium – technical sessions from 1995 to 2022.This data is not pre-processed., , , , , , , , , , , , , ,
,
, , , , , , , , , , , .
USENIX Security Symposium
APTNotesCollection of public documents, whitepapers and articles about APT campaigns. All the documents are publicly available data.This data is not pre-processed.The of the project contains a file with links to the data stored in box.
Data files can also be downloaded .
APT Notes
arXiv Cryptography and Security papersCollection of articles about cybersecurityThis data is not pre-processed.All articles available .arXiv
Security eBooks for freeSmall collection of security eBooks, and security presentations publicly available.This data is not pre-processed.










National Cyber Security strategy repositoryRepository of worldwide strategy documents about cybersecurity.This data is not pre-processed.
Cyber Security Natural Language ProcessingData about cybersecurity strategies from more than 75 countries.Tokenization, meaningless-frequent words removal.Yanlin Chen, Yunjian Wei, Yifan Yu, Wen Xue, Xianya Qin
APT Reports collectionSample of APT reports, malware, technology, and intelligence collectionRaw and tokenize data available.All data is available in this repository.blackorbird
Offensive Language Identification Dataset Data available in the .
Data is also available .
Zampieri et al.
Cyber reports from the National Cyber Security CentreThis data is not pre-processed., , , , .
.
APT reports by KasperskyThis data is not pre-processed.
The cyberwireThis data is not pre-processed., , and .
Databreaches newsThis data is not pre-processed.,
CybernewsThis data is not pre-processed.,
BleepingcomputerThis data is not pre-processed.
TherecordThis data is not pre-processed.
HackreadThis data is not pre-processed.
SecurelistThis data is not pre-processed., , , , , , , , , , and .
Stucco projectThe Stucco project collects data not typically integrated into security systems.This data is not pre-processed
FarsightsecurityWebsite with technical information, reports, and more about security topics.This data is not pre-processed, , .
SchneierWebsite with academic papers about security topics.This data is not pre-processed, .
TrendmicroWebsite with research, news, and perspectives bout security topics.This data is not pre-processed.
The Hacker NewsNews about cybersecurity topics.This data is not pre-processed, , , .
KrebsonsecuritySecurity news and investigationThis data is not pre-processed
Mitre DefendMatrix of Defend artifactsjson files
Mitre AtlasMitre Atlas is a knowledge base of adversary tactics, techniques, and case studies for machine learning systems based on real-world observations.This data is not pre-processed
Mitre EngageMITRE Engage is a framework for planning and discussing adversary engagement operations that empowers you to engage your adversaries and achieve your cybersecurity goals.This data is not pre-processed
Hacking TutorialsThis data is not pre-processed

Climate and sustainability

Code data

Multivariate data

Financial

Weather

Census

Transit

Internet

Games

Other multivariate

Curated repositories of datasets

As datasets come in myriad formats and can sometimes be difficult to use, there has been considerable work put into curating and standardizing the format of datasets to make them easier to use for machine learning research.
  • OpenML: Web platform with Python, R, Java, and other APIs for downloading hundreds of machine learning datasets, evaluating algorithms on datasets, and benchmarking algorithm performance against dozens of other algorithms.
  • PMLB: A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms. Provides classification and regression datasets in a standardized format that are accessible through a Python API.
  • Metatext NLP: https://metatext.io/datasets web repository maintained by community, containing nearly 1000 benchmark datasets, and counting. Provides many tasks from classification to QA, and various languages from English, Portuguese to Arabic.
  • Appen: Off The Shelf and Open Source Datasets hosted and maintained by the company. These biological, image, physical, question answering, signal, sound, text, and video resources number over 250 and can be applied to over 25 different use cases.