List of datasets for machine-learning research


These datasets are used in machine learning research and have been cited in peer-reviewed academic journals. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning algorithms, computer hardware, and, less intuitively, the availability of high-quality training datasets. High-quality labeled training datasets for supervised and semi-supervised machine-learning algorithms are usually difficult and expensive to produce because of the large amount of time needed to label the data. Although they do not need to be labeled, high-quality unlabeled datasets for unsupervised learning can also be difficult and costly to produce.
Many organizations, including governments, publish and share their datasets, often using common metadata formats. The datasets are classified, based on the licenses, into two groups: open data and non-open data.
The datasets from various governmental-bodies are presented in List of open government data sites. The datasets are ported on open data portals. They are made available for searching, depositing and accessing through interfaces like Open API. The datasets are made available as various sorted types and subtypes.

List of sorting used for datasets

The data portal is classified based on its type of license. The open source license based data portals are known as open data portals which are used by many government organizations and academic institutions.

List of open data portals

Portal-nameLicenseList of installations of the portalTypical usages
Comprehensive Knowledge Archive Network AGPLhttps://ckan.github.io/ckan-instances/
https://github.com/sebneu/ckan_instances/blob/master/instances.csv
Data repository for government or non-profit organisations, Data Management Solution for Research Institutes
GPLhttps://getdkan.org/communityData repository for government or non-profit organisations, Data Management Solution for Research Institutes
DataverseApachehttps://dataverse.org/installations
https://dataverse.org/metrics
Data Management Solution for Research Institutes
DSpaceBSDhttps://registry.lyrasis.org/Data Management Solution for Research Institutes
BSDhttps://www.openml.org/search?type=data&sort=runs&status=activeData Management Solution to share datasets, algorithms, and experiments results through APIs.

List of portals suitable for multiple types of applications

The data portal sometimes lists a wide variety of subtypes of datasets pertaining to many machine learning applications.
Academic Torrentshttps://academictorrents.com
Amazon Datasetshttps://registry.opendata.aws/
Awesome Public Datasets Collectionhttps://github.com/awesomedata/awesome-public-datasets
data.worldhttps://data.world/datasets/machine-learning
Datahub – Core Datasetshttps://datahub.io/docs/core-data
DataONEhttps://www.dataone.org/
DataPortalshttps://dataportals.org/
Datasetlist.comhttps://www.datasetlist.com
Global Open Data IndexOpen Knowledge Foundationhttps://okfn.org/
Google Dataset Searchhttps://datasetsearch.research.google.com/
Hugging Facehttps://huggingface.co/docs/datasets/
IBM's Data Asset Exchangehttps://developer.ibm.com/exchanges/data/
Jupyter – Tutorial Datahttps://jupyter-tutorial.readthedocs.io/en/latest/data-processing/opendata.html
Kagglehttps://www.kaggle.com/datasets
Machine learning datasetshttps://macgence.com/data-sets-and-cataloges/
Major Smart Cities with Open Datahttps://rlist.io/l/major-smart-cities-with-open-data-portals
Microsoft Datasetshttps://msropendata.com/datasets
Open Data Inceptionhttps://opendatainception.io/
Opendatasofthttps://data.opendatasoft.com/explore/dataset/open-data-sources%40public/table/?sort=code_en
OpenDOARhttps://v2.sherpa.ac.uk/opendoar/
OpenMLhttps://www.openml.org/search?type=data
Papers with Codehttps://paperswithcode.com/datasets
Penn Machine Learning Benchmarkshttps://github.com/EpistasisLab/pmlb/tree/master/datasets
Public APIshttps://github.com/public-apis/public-apis
Registry of Open Access Repositorieshttp://roar.eprints.org/ 
REgistry of REsearch Data REpositorieshttps://www.re3data.org/ 
UCI Machine Learning Repositoryhttps://archive.ics.uci.edu/
Speech Datasethttps://www.shaip.com/offerings/speech-data-catalog/
Visual Data Discoveryhttps://visualdata.io/discovery

List of portals suitable for a specific subtype of applications

The data portals which are suitable for a specific subtype of machine learning application are listed in the subsequent sections.

Image data

Text data

These datasets consist primarily of text for tasks such as natural language processing, sentiment analysis, translation, and cluster analysis.

Reviews

News articles

Messages

Twitter and tweets

Dialogues

Legal

Other text

Sound data

These datasets consist of sounds and sound features used for tasks such as speech recognition and speech synthesis.

Speech

Music

Other sounds

Signal data

Datasets containing electric signal information requiring some sort of signal processing for further analysis.

Electrical

Motion-tracking

Other signals

Chemical data

Datasets from physical systems.

Chemical Reactions with transition states (TS)

OpenReACT-CHON-EFH

OpenReACT-CHON-EFH is a 2025 open-access benchmark for machine-learning interatomic potentials.
  • **RTP set** – 35,087 stationary-point geometries drawn from 11,961 elementary reactions, each labeled with density-functional energies, atomic forces and full Hessian matrices at the ωB97X-D/6-31G level.
  • **IRC set** – 34,248 structures along 600 minimum-energy reaction paths, used to test extrapolation beyond trained stationary points.
  • **NMS set** – 62,527 off-equilibrium geometries generated by normal-mode sampling to probe model robustness under thermal perturbations.
The collection underpins the study Does Hessian Data Improve the Performance of Machine Learning Potentials? and was used to train and benchmark the machine-learning interatomic potentials reported therein.
The dataset itself is distributed under a CC licence via Figshare.

Physical data

Datasets from physical systems.

High-energy physics

Systems

Astronomy

Earth science

Other physical

Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated ReferenceCreator
Concrete Compressive Strength DatasetDataset of concrete properties and compressive strength.Nine features are given for each sample.1030TextRegression2007
I. Yeh
Concrete Slump Test DatasetConcrete slump flow given in terms of properties.Features of concrete given such as fly ash, water, etc.103TextRegression2009
I. Yeh
Musk DatasetPredict if a molecule, given the features, will be a musk or a non-musk.168 features given for each molecule.6598TextClassification1994Arris Pharmaceutical Corp.
Steel Plates Faults DatasetSteel plates of 7 different types.27 features given for each sample.1941TextClassification2010Semeion Research Center
Noble Metal Monometallic Nanoparticles DatasetsProcessing and structural features of monometallic nanoparticles, labels being formation energy.85-182 features given for each sample.425 to 4000CSVRegression2017 to 2023




A. Barnard and G. Opletal
Noble Metal Bimetallic Nanoparticles DatasetsProcessing and structural features of bimetallic nanoparticles, labels being formation energy.922 features given for each sample.138147 to 162770CSVRegression2023










J. Ting et al.
AuPdPt Trimetallic Nanoparticles DatasetProcessing and structural features of AuPdPt nanoparticles, labels being formation energy.1958 features given for each sample.48136CSVRegression2023K. Lu et al.

Biological data

Datasets from biological systems.

Human

Animal

Fungi

Plant

Microbe

Drug discovery

Anomaly data

Question answering data

This section includes datasets that deals with structured data.
Dataset nameBrief descriptionPreprocessingInstancesFormatDefault taskCreated ReferenceCreator
DBpedia Neural Question Answering DatasetA large collection of Question to SPARQL specially design for Open Domain Neural Question Answering over DBpedia Knowledgebase.This dataset contains a large collection of Open Neural SPARQL Templates and instances for training Neural SPARQL Machines; it was pre-processed by semi-automatic annotation tools as well as by three SPARQL experts.894,499Question-query pairsQuestion Answering2018Hartmann, Soru, and Marx et al.
Vietnamese Question Answering Dataset A large collection of Vietnamese questions for evaluating MRC models.This dataset comprises over 23,000 human-generated question-answer pairs based on 5,109 passages of 174 Vietnamese articles from Wikipedia.23,074Question-answer pairsQuestion Answering2020Nguyen et al.
Vietnamese Multiple-Choice Machine Reading Comprehension CorpusA collection of Vietnamese multiple-choice questions for evaluating MRC models.This corpus includes 2,783 Vietnamese multiple-choice questions.2,783Question-answer pairsQuestion Answering/Machine Reading Comprehension2020Nguyen et al.
Open-Domain Question Answering Goes Conversational via Question RewritingAn end-to-end open-domain question answering.This dataset includes 14,000 conversations with 81,000 question-answer pairs.Context, Question, Rewrite, Answer, Answer_URL, Conversation_no, Turn_no, Conversation_source
Further details are provided in the and respective .
Question Answering2021Anantha and Vakulenko et al.
UnifiedQAQuestion-answer dataProcessed datasetQuestion Answering2020Khashabi et al.