List of datasets for machine-learning research
These datasets are used in machine learning research and have been cited in peer-reviewed academic journals. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning algorithms, computer hardware, and, less intuitively, the availability of high-quality training datasets. High-quality labeled training datasets for supervised and semi-supervised machine-learning algorithms are usually difficult and expensive to produce because of the large amount of time needed to label the data. Although they do not need to be labeled, high-quality unlabeled datasets for unsupervised learning can also be difficult and costly to produce.
Many organizations, including governments, publish and share their datasets, often using common metadata formats. The datasets are classified, based on the licenses, into two groups: open data and non-open data.
The datasets from various governmental-bodies are presented in List of [open government data sites]. The datasets are ported on open data portals. They are made available for searching, depositing and accessing through interfaces like Open API. The datasets are made available as various sorted types and subtypes.
List of sorting used for datasets
The data portal is classified based on its type of license. The open source license based data portals are known as open data portals which are used by many government organizations and academic institutions.List of open data portals
| Portal-name | License | List of installations of the portal | Typical usages |
| Comprehensive Knowledge Archive Network | AGPL | https://ckan.github.io/ckan-instances/ https://github.com/sebneu/ckan_instances/blob/master/instances.csv | Data repository for government or non-profit organisations, Data Management Solution for Research Institutes |
| GPL | https://getdkan.org/community | Data repository for government or non-profit organisations, Data Management Solution for Research Institutes | |
| Dataverse | Apache | https://dataverse.org/installations https://dataverse.org/metrics | Data Management Solution for Research Institutes |
| DSpace | BSD | https://registry.lyrasis.org/ | Data Management Solution for Research Institutes |
| BSD | https://www.openml.org/search?type=data&sort=runs&status=active | Data Management Solution to share datasets, algorithms, and experiments results through APIs. |
List of portals suitable for multiple types of applications
The data portal sometimes lists a wide variety of subtypes of datasets pertaining to many machine learning applications.| Academic Torrents | https://academictorrents.com |
| Amazon Datasets | https://registry.opendata.aws/ |
| Awesome Public Datasets Collection | https://github.com/awesomedata/awesome-public-datasets |
| data.world | https://data.world/datasets/machine-learning |
| Datahub – Core Datasets | https://datahub.io/docs/core-data |
| DataONE | https://www.dataone.org/ |
| DataPortals | https://dataportals.org/ |
| Datasetlist.com | https://www.datasetlist.com |
| Global Open Data Index – Open Knowledge Foundation | https://okfn.org/ |
| Google Dataset Search | https://datasetsearch.research.google.com/ |
| Hugging Face | https://huggingface.co/docs/datasets/ |
| IBM's Data Asset Exchange | https://developer.ibm.com/exchanges/data/ |
| Jupyter – Tutorial Data | https://jupyter-tutorial.readthedocs.io/en/latest/data-processing/opendata.html |
| Kaggle | https://www.kaggle.com/datasets |
| Machine learning datasets | https://macgence.com/data-sets-and-cataloges/ |
| Major Smart Cities with Open Data | https://rlist.io/l/major-smart-cities-with-open-data-portals |
| Microsoft Datasets | https://msropendata.com/datasets |
| Open Data Inception | https://opendatainception.io/ |
| Opendatasoft | https://data.opendatasoft.com/explore/dataset/open-data-sources%40public/table/?sort=code_en |
| OpenDOAR | https://v2.sherpa.ac.uk/opendoar/ |
| OpenML | https://www.openml.org/search?type=data |
| Papers with Code | https://paperswithcode.com/datasets |
| Penn Machine Learning Benchmarks | https://github.com/EpistasisLab/pmlb/tree/master/datasets |
| Public APIs | https://github.com/public-apis/public-apis |
| Registry of Open Access Repositories | http://roar.eprints.org/ |
| REgistry of REsearch Data REpositories | https://www.re3data.org/ |
| UCI Machine Learning Repository | https://archive.ics.uci.edu/ |
| Speech Dataset | https://www.shaip.com/offerings/speech-data-catalog/ |
| Visual Data Discovery | https://visualdata.io/discovery |
List of portals suitable for a specific subtype of applications
The data portals which are suitable for a specific subtype of machine learning application are listed in the subsequent sections.Image data
Text data
These datasets consist primarily of text for tasks such as natural language processing, sentiment analysis, translation, and cluster analysis.Reviews
News articles
Messages
Twitter and tweets
Dialogues
Legal
Other text
Sound data
These datasets consist of sounds and sound features used for tasks such as speech recognition and speech synthesis.Speech
Music
Other sounds
Signal data
Datasets containing electric signal information requiring some sort of signal processing for further analysis.Electrical
Motion-tracking
Other signals
Chemical data
Datasets from physical systems.Chemical Reactions with transition states (TS)
OpenReACT-CHON-EFH
OpenReACT-CHON-EFH is a 2025 open-access benchmark for machine-learning interatomic potentials.- **RTP set** – 35,087 stationary-point geometries drawn from 11,961 elementary reactions, each labeled with density-functional energies, atomic forces and full Hessian matrices at the ωB97X-D/6-31G level.
- **IRC set** – 34,248 structures along 600 minimum-energy reaction paths, used to test extrapolation beyond trained stationary points.
- **NMS set** – 62,527 off-equilibrium geometries generated by normal-mode sampling to probe model robustness under thermal perturbations.
The dataset itself is distributed under a CC licence via Figshare.
Physical data
Datasets from physical systems.High-energy physics
Systems
Astronomy
Earth science
Other physical
| Dataset name | Brief description | Preprocessing | Instances | Format | Default task | Created | Reference | Creator |
| Concrete Compressive Strength Dataset | Dataset of concrete properties and compressive strength. | Nine features are given for each sample. | 1030 | Text | Regression | 2007 | I. Yeh | |
| Concrete Slump Test Dataset | Concrete slump flow given in terms of properties. | Features of concrete given such as fly ash, water, etc. | 103 | Text | Regression | 2009 | I. Yeh | |
| Musk Dataset | Predict if a molecule, given the features, will be a musk or a non-musk. | 168 features given for each molecule. | 6598 | Text | Classification | 1994 | Arris Pharmaceutical Corp. | |
| Steel Plates Faults Dataset | Steel plates of 7 different types. | 27 features given for each sample. | 1941 | Text | Classification | 2010 | Semeion Research Center | |
| Noble Metal Monometallic Nanoparticles Datasets | Processing and structural features of monometallic nanoparticles, labels being formation energy. | 85-182 features given for each sample. | 425 to 4000 | CSV | Regression | 2017 to 2023 | A. Barnard and G. Opletal | |
| Noble Metal Bimetallic Nanoparticles Datasets | Processing and structural features of bimetallic nanoparticles, labels being formation energy. | 922 features given for each sample. | 138147 to 162770 | CSV | Regression | 2023 | J. Ting et al. | |
| AuPdPt Trimetallic Nanoparticles Dataset | Processing and structural features of AuPdPt nanoparticles, labels being formation energy. | 1958 features given for each sample. | 48136 | CSV | Regression | 2023 | K. Lu et al. |
Biological data
Datasets from biological systems.Human
Animal
Fungi
Plant
Microbe
Drug discovery
Anomaly data
Question answering data
This section includes datasets that deals with structured data.| Dataset name | Brief description | Preprocessing | Instances | Format | Default task | Created | Reference | Creator |
| DBpedia Neural Question Answering Dataset | A large collection of Question to SPARQL specially design for Open Domain Neural Question Answering over DBpedia Knowledgebase. | This dataset contains a large collection of Open Neural SPARQL Templates and instances for training Neural SPARQL Machines; it was pre-processed by semi-automatic annotation tools as well as by three SPARQL experts. | 894,499 | Question-query pairs | Question Answering | 2018 | Hartmann, Soru, and Marx et al. | |
| Vietnamese Question Answering Dataset | A large collection of Vietnamese questions for evaluating MRC models. | This dataset comprises over 23,000 human-generated question-answer pairs based on 5,109 passages of 174 Vietnamese articles from Wikipedia. | 23,074 | Question-answer pairs | Question Answering | 2020 | Nguyen et al. | |
| Vietnamese Multiple-Choice Machine Reading Comprehension Corpus | A collection of Vietnamese multiple-choice questions for evaluating MRC models. | This corpus includes 2,783 Vietnamese multiple-choice questions. | 2,783 | Question-answer pairs | Question Answering/Machine Reading Comprehension | 2020 | Nguyen et al. | |
| Open-Domain Question Answering Goes Conversational via Question Rewriting | An end-to-end open-domain question answering. | This dataset includes 14,000 conversations with 81,000 question-answer pairs. | Context, Question, Rewrite, Answer, Answer_URL, Conversation_no, Turn_no, Conversation_source Further details are provided in the and respective . | Question Answering | 2021 | Anantha and Vakulenko et al. | ||
| UnifiedQA | Question-answer data | Processed dataset | Question Answering | 2020 | Khashabi et al. |
Dialog or instruction prompted data
This section includes datasets that contains multi-turn text with at least two actors, a "user" and an "agent". The user makes requests for the agent, which performs the request.| Dataset name | Brief description | Preprocessing | Instances | Format | Default task | Created | Reference | Creator |
| Taskmaster | 3 datasets with >55,000 spoken and written task-oriented dialogs in several domains. | 13,215 + 17,289 + 23,757 dialogs, in 6 + 7 + 1 task domains. | 1 and 2: conversation id, utterances, Instruction id 3: conversation id, utterances, vertical, scenario, instructions. | Do the task. | 2019 | Byrne and Krishnamoorthi et al. | ||
| DrRepair | Labeled dataset for program repair. | Check format details in the . | Do the task. | 2020 | Michihiro et al. | |||
| Super-NaturalInstructions | Tasks specified in natural language. | 1,616 NLP tasks in 76 task types. | Task definition in natural language instructions; example input/output. | Do the task. | 2022 | Wang et al. | ||
| LAMBADA | Narrative passages where the last word is omitted. | Guess the last word. | 2016 | Paperno et al. | ||||
| FLAN | Instruction tuning data, with a mix of zero-shot, few-shot and chain-of-thought templates. | Instruction tuning; do the task. | 2021 | Wei et al. |
Cybersecurity
| Dataset name | Brief description | Preprocessing | Instances | Format | Default task | Created | Reference | Creator |
| MITRE ATTACK | The ATT&CK is a globally-accessible knowledge base of adversary tactics and techniques. | Data can be downloaded from these two GitHub repositories: and | MITRE ATTACK | |||||
| CAPEC | Common Attack Pattern Enumeration and Classification | Data can be downloaded from : | CAPEC | |||||
| CVE | CVE is a list of publicly disclosed cybersecurity vulnerabilities that is free to search, use, and incorporate into products and services. | Data can be downloaded from: | CVE | |||||
| CWE | Common Weakness Enumeration data. | Data can be downloaded from: | CWE | |||||
| MalwareTextDB | Annotated database of malware texts. | The contains the data to download. | Kiat et al. | |||||
| USENIX Security Symposium proceedings | Collection of security proceedings from USENIX Security Symposium – technical sessions from 1995 to 2022. | This data is not pre-processed. | , , , , , , , , , , , , , , , , , , , , , , , , , , . | USENIX Security Symposium | ||||
| APTNotes | Collection of public documents, whitepapers and articles about APT campaigns. All the documents are publicly available data. | This data is not pre-processed. | The of the project contains a file with links to the data stored in box. Data files can also be downloaded . | APT Notes | ||||
| arXiv Cryptography and Security papers | Collection of articles about cybersecurity | This data is not pre-processed. | All articles available . | arXiv | ||||
| Security eBooks for free | Small collection of security eBooks, and security presentations publicly available. | This data is not pre-processed. | ||||||
| National Cyber Security strategy repository | Repository of worldwide strategy documents about cybersecurity. | This data is not pre-processed. | ||||||
| Cyber Security Natural Language Processing | Data about cybersecurity strategies from more than 75 countries. | Tokenization, meaningless-frequent words removal. | Yanlin Chen, Yunjian Wei, Yifan Yu, Wen Xue, Xianya Qin | |||||
| APT Reports collection | Sample of APT reports, malware, technology, and intelligence collection | Raw and tokenize data available. | All data is available in this repository. | blackorbird | ||||
| Offensive Language Identification Dataset | Data available in the . Data is also available . | Zampieri et al. | ||||||
| Cyber reports from the National Cyber Security Centre | This data is not pre-processed. | , , , , . . | ||||||
| APT reports by Kaspersky | This data is not pre-processed. | |||||||
| The cyberwire | This data is not pre-processed. | , , and . | ||||||
| Databreaches news | This data is not pre-processed. | , | ||||||
| Cybernews | This data is not pre-processed. | , | ||||||
| Bleepingcomputer | This data is not pre-processed. | |||||||
| Therecord | This data is not pre-processed. | |||||||
| Hackread | This data is not pre-processed. | |||||||
| Securelist | This data is not pre-processed. | , , , , , , , , , , and . | ||||||
| Stucco project | The Stucco project collects data not typically integrated into security systems. | This data is not pre-processed | ||||||
| Farsightsecurity | Website with technical information, reports, and more about security topics. | This data is not pre-processed | , , . | |||||
| Schneier | Website with academic papers about security topics. | This data is not pre-processed | , . | |||||
| Trendmicro | Website with research, news, and perspectives bout security topics. | This data is not pre-processed | . | |||||
| The Hacker News | News about cybersecurity topics. | This data is not pre-processed | , , , . | |||||
| Krebsonsecurity | Security news and investigation | This data is not pre-processed | ||||||
| Mitre Defend | Matrix of Defend artifacts | json files | ||||||
| Mitre Atlas | Mitre Atlas is a knowledge base of adversary tactics, techniques, and case studies for machine learning systems based on real-world observations. | This data is not pre-processed | ||||||
| Mitre Engage | MITRE Engage is a framework for planning and discussing adversary engagement operations that empowers you to engage your adversaries and achieve your cybersecurity goals. | This data is not pre-processed | ||||||
| Hacking Tutorials | This data is not pre-processed |
Climate and sustainability
Code data
Multivariate data
Financial
Weather
Census
Transit
Internet
Games
Other multivariate
Curated repositories of datasets
As datasets come in myriad formats and can sometimes be difficult to use, there has been considerable work put into curating and standardizing the format of datasets to make them easier to use for machine learning research.- OpenML: Web platform with Python, R, Java, and other APIs for downloading hundreds of machine learning datasets, evaluating algorithms on datasets, and benchmarking algorithm performance against dozens of other algorithms.
- PMLB: A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms. Provides classification and regression datasets in a standardized format that are accessible through a Python API.
- Metatext NLP: https://metatext.io/datasets web repository maintained by community, containing nearly 1000 benchmark datasets, and counting. Provides many tasks from classification to QA, and various languages from English, Portuguese to Arabic.
- Appen: Off The Shelf and Open Source Datasets hosted and maintained by the company. These biological, image, physical, question answering, signal, sound, text, and video resources number over 250 and can be applied to over 25 different use cases.