Embed this resource in your web site
The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones).
This dataset is also available from the UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/wine+quality
For more information, read [Cortez et al., 2009]. Input variables (based on physicochemical tests):
1 - fixed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
What might be an interesting thing to do, is aside from using regression modelling, is to set an arbitrary cutoff for your dependent variable (wine quality) at e.g. 7 or higher getting classified as 'good/1' and the remainder as 'not good/0'. This allows you to practice with hyper parameter tuning on e.g. decision tree algorithms looking at the ROC curve and the AUC value. Without doing any kind of feature engineering or overfitting you should be able to get an AUC of .88 (without even using random forest algorithm)
Please include this citation if you plan to use this database: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
Fires in Spain
This model predicts one out of three Iris species, based on petal length and width and sepal length and width.
The data is taken from The UCI Machine Learning Repository and created by R.A. Fisher.
Fisher,R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to Mathematical Statistics" (John Wiley, NY, 1950).
Predicting incidence of thyroid disease based on demographic data and the outcomes of several medical tests.
Fires in Spain
New York Air Quality Measurements
Enigma EU government farm subsidies
This is perhaps the best known database to be found in the pattern recognition literature. Fisher's paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant.
Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.
Predict crop yield based on crop, country, and weather conditions. Data taken from here
- FID: country codes used in extraction scripts for matching crop area maps, crop calendars, and climate data to countries
Year: 1961--2008 - Crop: maize, rice, wheat, soy, barley, or sorghum
Country: Country name - FAO_code: Country code used by FAO
precCRU, tMinCRU, tMaxCRU, tAvgCRU: total growing season precipitation, average growing season temperature, average minimum and maxmimum daily growing season temperature from the CRU TS 2.1 historical climate dataset. Data spans years from 1961--2002. - precUDel, tAvgUDel: total growing season precipitation, average growing season temperature, average minimum and maximum daily growing season temperature from the University of Delaware set. These data span from 1961--2008.
Yield: Yield (hg/ha) of a given crop. Here, this is just Production/Area. - Production: Quantity (tonnes) of a given crop produced in a country over the course of a year. Data from FAO.
Area: Area (ha) planted for a given crop in a country during the course of a year.
Region: We use "region" here to mean a country's yield quartile for a given crop, relative to other countries.