Datasets for Education and Research

UC Irvine Machine Learning Repository

The UC Irvine Machine Learning Repository is a well-respected resource for researchers, educators, and data enthusiasts alike. It offers a wide range of high-quality, curated datasets spanning topics from healthcare and biology to finance, text, and image classification. Designed for use in machine learning and AI research, these datasets are ideal for teaching, benchmarking algorithms, and exploring data-driven insights. There are 678 datasets available for download. They list the number of columns and rows, the domain, any missing data, prior publications and a variable table.

Kaggle Datasets

Kaggle is primarily a data competition website, but it hosts one of the largest online collections of publicly available datasets, covering a broad array of domains such as healthcare, finance, sports, images, text, and more. Users can download, explore, and share over 500,000 datasets contributed by a global community of data scientists and organizations. These datasets are frequently updated and vary in complexity, making them suitable for everything from beginner analysis to advanced machine learning projects

OpenML

OpenML provides a large, openly accessible platform of machine learning datasets from a wide range of domains. All datasets are uniformly formatted, include rich metadata, and support automated processing, making them suitable for research, teaching, and experimentation. OpenML primarily focuses on tabular data but also supports images and other data types. Users can filter datasets by size, type, and other properties, and access them directly through a web interface or programmatically using APIs in Python, R, and other languages. The platform is free and encourages community sharing, enabling seamless integration with common machine learning tools and reproducible research workflows

Google Dataset Search

Google Dataset Search is a specialized search engine that enables users to find datasets from thousands of repositories across the web using simple keyword searches. It indexes tens of millions of datasets covering a wide range of subjects, including science, government, social sciences, healthcare, and more. The tool relies on standard metadata formats like schema.org to aggregate and organize datasets from different sources, making it easier to locate and access datasets regardless of where they are hosted. Users can filter results by topic, file format, license, and update date, and are directed to the original repository for downloading or exploring the data

Data.gov

The official U.S. government open data portal, aggregating over 300k datasets from federal, state, and local agencies. Covers a broad spectrum, including healthcare, transportation, climate, and public safety

European Union Open Portal Data
Aggregates public datasets from EU institutions across government, environment, economy, and more

Harvard Dataverse

The Harvard Dataverse Repository is a free data repository open to all researchers from any discipline, both inside and outside of the Harvard community, where you can share, archive, cite, access, and explore research data. Each individual Dataverse collection is a customizable collection of datasets (or a virtual repository) for organizing, managing, and showcasing datasets. They have more than 3000 datasets.

You can open your data to the general public, or restrict access and define customizable terms of use. When you publish your data, you automatically get a standard data citation with a Digital Object Identifier (DOI), and your metadata is open and findable via search engines, even when the data are restricted

Zenodo

Developed by CERN and OpenAIRE, Zenodo is a multidisciplinary, open-access repository supporting datasets up to 50GB. It is widely used in science and academia

GitHub Awesome Public Datasets

This curated list on GitHub collects links to high-quality, topic-centric public datasets across 36 domains like biology, economics, education, and more