Create ml_ai_datasets.md

This commit is contained in:
Omar Santos 2023-09-04 23:45:37 -04:00 committed by GitHub
parent 8b47729a3d
commit a222f076e5
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23

View file

@ -0,0 +1,62 @@
# Datasets for AI / ML Research
1. **UCI Machine Learning Repository**: A collection of databases, domain theories, and data generators widely used by the machine learning community.
Website: [UCI ML Repository](https://archive.ics.uci.edu/ml/index.php)
2. **Kaggle Datasets**: Offers a wide variety of datasets in different domains including economics, biology, computer vision, and natural language processing.
Website: [Kaggle](https://www.kaggle.com/datasets)
3. **AWS Public Datasets**: Amazon Web Services offers a variety of public datasets that anyone can access.
Website: [AWS Public Datasets](https://registry.opendata.aws/)
4. **Google Dataset Search**: A tool that enables the discovery of datasets stored across the web.
Website: [Google Dataset Search](https://datasetsearch.research.google.com/)
5. **Microsoft Research Open Data**: A collection of free datasets from Microsoft Research to advance state-of-the-art research in areas such as natural language processing, computer vision, and domain-specific sciences.
Website: [Microsoft Research Open Data](https://msropendata.com/)
6. **OpenML**: An online platform for collaborative machine learning - easily share data, models, and experiments.
Website: [OpenML](https://www.openml.org/)
7. **Data.gov**: The home of the U.S. Governments open data, providing data, tools, and resources.
Website: [Data.gov](https://www.data.gov/)
8. **EU Open Data Portal**: Provides access to an expanding range of data from the European Union institutions and other EU bodies.
Website: [EU Open Data Portal](https://data.europa.eu/euodp/en/home)
9. **Awesome Public Datasets on GitHub**: A collection of high-quality open datasets in public domains.
GitHub Repository: [Awesome Public Datasets](https://github.com/awesomedata/awesome-public-datasets)
10. **World Bank Open Data**: Free and open access to global development data.
Website: [World Bank Open Data](https://data.worldbank.org/)
11. **CERN Open Data Portal**: Provides access to data generated by the Large Hadron Collider and other CERN experiments.
Website: [CERN Open Data Portal](http://opendata.cern.ch/)
12. **National Aeronautics and Space Administration (NASA)**: Offers a wide range of datasets related to space and Earth sciences.
Website: [NASA](https://data.nasa.gov/)
13. **NOAA Data Sets**: Provides access to national and global data on climate, weather, oceans, and coasts.
Website: [NOAA](https://www.noaa.gov/data)
14. **ImageNet**: A dataset of over 15 million labeled high-resolution images across 22,000 categories.
Website: [ImageNet](http://www.image-net.org/)
15. **COCO (Common Objects in Context)**: A dataset with millions of images containing objects in complex scenes with annotations.
Website: [COCO Dataset](https://cocodataset.org/)
16. **Wikipedia: List of datasets for machine-learning research**: A wikipedia article providing a comprehensive list of datasets for machine-learning research. Website: [Wikipedia List](https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research)
17. **Natural Earth Data**: Offers free vector and raster map data at various scales.
Website: [Natural Earth Data](https://www.naturalearthdata.com/)
18. **Reddit Datasets**: A subreddit that has datasets made available by the Reddit community.
Website: [Reddit Datasets](https://www.reddit.com/r/datasets/)
19. **Quandl**: Provides financial, economic, and alternative datasets.
Website: [Quandl](https://www.quandl.com/)
20. **Stanford Large Network Dataset Collection**: A collection of large network datasets including social networks, web graphs, etc.
Website: [Stanford Network Analysis Project](http://snap.stanford.edu/data/index.html)
These sources offer a wide range of datasets from various domains, and you can explore them based on your specific requirements and interests in machine learning.