Embed this resource in your web site
Over 1.5 billions pounds of pumpkin are grown annually in the United States. Where are they sold, and for how much?
This dataset contains prices for which pumpkins were sold at selected U.S. cities’ terminal markets. Prices are differentiated by the commodities’ growing origin, variety, size, package and grade.
This dataset contains terminal market prices for different pumpkin crops in 13 cities in the United States from September 24, 2016 to September 30, 2017.
- Atlanta, GA
- Baltimore, MD
- Boston, MA
- Chicago, IL
- Columbia, SC
- Dallas, TX
- Detroit, MI
- Los Angeles, CA
- Miami, FL
- New York, NY
- Philadelphia, PA
- San Francisco, CA
- Saint Louis, MO
Data for each city includes the following columns (although not all information is available for every city)
- Commodity Name: Always pumpkin, since this is a pumpkin-only dataset
- City Name: City where the pumpkin was sold
- Sub Variety
- Grade: In the US, usually only canned pumpkin is graded
- Date: Date of sale (rounded up to the nearest Saturday)
- Low Price
- High Price
- Mostly Low
- Mostly High
- Origin: Where the pumpkins were grown
- Origin District
- Item Size
- Unit of Sale
- Repack: Whether the pumpkin has been repackaged before sale
- Trans Mode
This dataset is based on Specialty Crops Terminal Markets Standard Reports distributed by the United States Department of Agriculture. This data is in the public domain.
The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones).
This dataset is also available from the UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/wine+quality
For more information, read [Cortez et al., 2009]. Input variables (based on physicochemical tests):
1 - fixed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
What might be an interesting thing to do, is aside from using regression modelling, is to set an arbitrary cutoff for your dependent variable (wine quality) at e.g. 7 or higher getting classified as 'good/1' and the remainder as 'not good/0'. This allows you to practice with hyper parameter tuning on e.g. decision tree algorithms looking at the ROC curve and the AUC value. Without doing any kind of feature engineering or overfitting you should be able to get an AUC of .88 (without even using random forest algorithm)
Please include this citation if you plan to use this database: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
This is perhaps the best known database to be found in the pattern recognition literature. Fisher's paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant.
Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.
Predicting incidence of thyroid disease based on demographic data and the outcomes of several medical tests.
Fires in Spain
This model predicts one out of three Iris species, based on petal length and width and sepal length and width.
The data is taken from The UCI Machine Learning Repository and created by R.A. Fisher.
Fisher,R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to Mathematical Statistics" (John Wiley, NY, 1950).
Fires in Spain
New York Air Quality Measurements
Enigma EU government farm subsidies