List of datasets for machine-learning research
These datasets are used in machine learning research and have been cited in peer-reviewed academic journals. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning algorithms, computer hardware, and, less intuitively, the availability of high-quality training datasets. High-quality labeled training datasets for supervised and semi-supervised machine-learning algorithms are usually difficult and expensive to produce because of the large amount of time needed to label the data. Although they do not need to be labeled, high-quality unlabeled datasets for unsupervised learning can also be difficult and costly to produce.
Many organizations, including governments, publish and share their datasets, often using common metadata formats. The datasets are classified, based on the licenses, into two groups: open data and non-open data.
The datasets from various governmental-bodies are presented in List of open government data sites. The datasets are ported on open data portals. They are made available for searching, depositing and accessing through interfaces like Open API. The datasets are made available as various sorted types and subtypes.
List of sorting used for datasets
The data portal is classified based on its type of license. The open source license based data portals are known as open data portals which are used by many government organizations and academic institutions.List of open data portals
| Portal-name | License | List of installations of the portal | Typical usages |
| Comprehensive Knowledge Archive Network | AGPL | https://ckan.github.io/ckan-instances/ https://github.com/sebneu/ckan_instances/blob/master/instances.csv | Data repository for government or non-profit organisations, Data Management Solution for Research Institutes |
| GPL | https://getdkan.org/community | Data repository for government or non-profit organisations, Data Management Solution for Research Institutes | |
| Dataverse | Apache | https://dataverse.org/installations https://dataverse.org/metrics | Data Management Solution for Research Institutes |
| DSpace | BSD | https://registry.lyrasis.org/ | Data Management Solution for Research Institutes |
| BSD | https://www.openml.org/search?type=data&sort=runs&status=active | Data Management Solution to share datasets, algorithms, and experiments results through APIs. |
List of portals suitable for multiple types of applications
The data portal sometimes lists a wide variety of subtypes of datasets pertaining to many machine learning applications.| Academic Torrents | https://academictorrents.com |
| Amazon Datasets | https://registry.opendata.aws/ |
| Awesome Public Datasets Collection | https://github.com/awesomedata/awesome-public-datasets |
| data.world | https://data.world/datasets/machine-learning |
| Datahub – Core Datasets | https://datahub.io/docs/core-data |
| DataONE | https://www.dataone.org/ |
| DataPortals | https://dataportals.org/ |
| Datasetlist.com | https://www.datasetlist.com |
| Global Open Data Index – Open Knowledge Foundation | https://okfn.org/ |
| Google Dataset Search | https://datasetsearch.research.google.com/ |
| Hugging Face | https://huggingface.co/docs/datasets/ |
| IBM's Data Asset Exchange | https://developer.ibm.com/exchanges/data/ |
| Jupyter – Tutorial Data | https://jupyter-tutorial.readthedocs.io/en/latest/data-processing/opendata.html |
| Kaggle | https://www.kaggle.com/datasets |
| Machine learning datasets | https://macgence.com/data-sets-and-cataloges/ |
| Major Smart Cities with Open Data | https://rlist.io/l/major-smart-cities-with-open-data-portals |
| Microsoft Datasets | https://msropendata.com/datasets |
| Open Data Inception | https://opendatainception.io/ |
| Opendatasoft | https://data.opendatasoft.com/explore/dataset/open-data-sources%40public/table/?sort=code_en |
| OpenDOAR | https://v2.sherpa.ac.uk/opendoar/ |
| OpenML | https://www.openml.org/search?type=data |
| Papers with Code | https://paperswithcode.com/datasets |
| Penn Machine Learning Benchmarks | https://github.com/EpistasisLab/pmlb/tree/master/datasets |
| Public APIs | https://github.com/public-apis/public-apis |
| Registry of Open Access Repositories | http://roar.eprints.org/ |
| REgistry of REsearch Data REpositories | https://www.re3data.org/ |
| UCI Machine Learning Repository | https://archive.ics.uci.edu/ |
| Speech Dataset | https://www.shaip.com/offerings/speech-data-catalog/ |
| Visual Data Discovery | https://visualdata.io/discovery |
List of portals suitable for a specific subtype of applications
The data portals which are suitable for a specific subtype of machine learning application are listed in the subsequent sections.Image data
Text data
These datasets consist primarily of text for tasks such as natural language processing, sentiment analysis, translation, and cluster analysis.Reviews
News articles
Messages
Twitter and tweets
Dialogues
Legal
Other text
Sound data
These datasets consist of sounds and sound features used for tasks such as speech recognition and speech synthesis.Speech
Music
Other sounds
Signal data
Datasets containing electric signal information requiring some sort of signal processing for further analysis.Electrical
Motion-tracking
Other signals
Chemical data
Datasets from physical systems.Chemical Reactions with transition states (TS)
OpenReACT-CHON-EFH
OpenReACT-CHON-EFH is a 2025 open-access benchmark for machine-learning interatomic potentials.- **RTP set** – 35,087 stationary-point geometries drawn from 11,961 elementary reactions, each labeled with density-functional energies, atomic forces and full Hessian matrices at the ωB97X-D/6-31G level.
- **IRC set** – 34,248 structures along 600 minimum-energy reaction paths, used to test extrapolation beyond trained stationary points.
- **NMS set** – 62,527 off-equilibrium geometries generated by normal-mode sampling to probe model robustness under thermal perturbations.
The dataset itself is distributed under a CC licence via Figshare.
Physical data
Datasets from physical systems.High-energy physics
Systems
Astronomy
Earth science
Other physical
| Dataset name | Brief description | Preprocessing | Instances | Format | Default task | Created | Reference | Creator |
| Concrete Compressive Strength Dataset | Dataset of concrete properties and compressive strength. | Nine features are given for each sample. | 1030 | Text | Regression | 2007 | I. Yeh | |
| Concrete Slump Test Dataset | Concrete slump flow given in terms of properties. | Features of concrete given such as fly ash, water, etc. | 103 | Text | Regression | 2009 | I. Yeh | |
| Musk Dataset | Predict if a molecule, given the features, will be a musk or a non-musk. | 168 features given for each molecule. | 6598 | Text | Classification | 1994 | Arris Pharmaceutical Corp. | |
| Steel Plates Faults Dataset | Steel plates of 7 different types. | 27 features given for each sample. | 1941 | Text | Classification | 2010 | Semeion Research Center | |
| Noble Metal Monometallic Nanoparticles Datasets | Processing and structural features of monometallic nanoparticles, labels being formation energy. | 85-182 features given for each sample. | 425 to 4000 | CSV | Regression | 2017 to 2023 | A. Barnard and G. Opletal | |
| Noble Metal Bimetallic Nanoparticles Datasets | Processing and structural features of bimetallic nanoparticles, labels being formation energy. | 922 features given for each sample. | 138147 to 162770 | CSV | Regression | 2023 | J. Ting et al. | |
| AuPdPt Trimetallic Nanoparticles Dataset | Processing and structural features of AuPdPt nanoparticles, labels being formation energy. | 1958 features given for each sample. | 48136 | CSV | Regression | 2023 | K. Lu et al. |
Biological data
Datasets from biological systems.Human
Animal
Fungi
Plant
Microbe
Drug discovery
Anomaly data
Question answering data
This section includes datasets that deals with structured data.| Dataset name | Brief description | Preprocessing | Instances | Format | Default task | Created | Reference | Creator |
| DBpedia Neural Question Answering Dataset | A large collection of Question to SPARQL specially design for Open Domain Neural Question Answering over DBpedia Knowledgebase. | This dataset contains a large collection of Open Neural SPARQL Templates and instances for training Neural SPARQL Machines; it was pre-processed by semi-automatic annotation tools as well as by three SPARQL experts. | 894,499 | Question-query pairs | Question Answering | 2018 | Hartmann, Soru, and Marx et al. | |
| Vietnamese Question Answering Dataset | A large collection of Vietnamese questions for evaluating MRC models. | This dataset comprises over 23,000 human-generated question-answer pairs based on 5,109 passages of 174 Vietnamese articles from Wikipedia. | 23,074 | Question-answer pairs | Question Answering | 2020 | Nguyen et al. | |
| Vietnamese Multiple-Choice Machine Reading Comprehension Corpus | A collection of Vietnamese multiple-choice questions for evaluating MRC models. | This corpus includes 2,783 Vietnamese multiple-choice questions. | 2,783 | Question-answer pairs | Question Answering/Machine Reading Comprehension | 2020 | Nguyen et al. | |
| Open-Domain Question Answering Goes Conversational via Question Rewriting | An end-to-end open-domain question answering. | This dataset includes 14,000 conversations with 81,000 question-answer pairs. | Context, Question, Rewrite, Answer, Answer_URL, Conversation_no, Turn_no, Conversation_source Further details are provided in the and respective . | Question Answering | 2021 | Anantha and Vakulenko et al. | ||
| UnifiedQA | Question-answer data | Processed dataset | Question Answering | 2020 | Khashabi et al. |