DEVELOPMENT... { "data_id": "43695", "name": "Red-Wine-Quality", "exact_name": "Red-Wine-Quality", "version": 1, "version_label": "v1.0", "description": "Context\nThe two datasets are related to red and white variants of the Portuguese \"Vinho Verde\" wine. For more details, consult the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.). \nThese datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones). \n\nThis dataset is also available from the UCI machine learning repository, https:\/\/archive.ics.uci.edu\/ml\/datasets\/wine+quality , I just shared it to kaggle for convenience. (If I am mistaken and the public license type disallowed me from doing so, I will take this down if requested.)\nContent\nFor more information, read [Cortez et al., 2009].\nInput variables (based on physicochemical tests):\n1 - fixed acidity \n2 - volatile acidity \n3 - citric acid \n4 - residual sugar \n5 - chlorides \n6 - free sulfur dioxide \n7 - total sulfur dioxide \n8 - density \n9 - pH \n10 - sulphates \n11 - alcohol \nOutput variable (based on sensory data): \n12 - quality (score between 0 and 10) \nTips\nWhat might be an interesting thing to do, is aside from using regression modelling, is to set an arbitrary cutoff for your dependent variable (wine quality) at e.g. 7 or higher getting classified as 'good\/1' and the remainder as 'not good\/0'.\nThis allows you to practice with hyper parameter tuning on e.g. decision tree algorithms looking at the ROC curve and the AUC value.\nWithout doing any kind of feature engineering or overfitting you should be able to get an AUC of .88 (without even using random forest algorithm)\nKNIME is a great tool (GUI) that can be used for this.\n1 - File Reader (for csv) to linear correlation node and to interactive histogram for basic EDA.\n2- File Reader to 'Rule Engine Node' to turn the 10 point scale to dichtome variable (good wine and rest), the code to put in the rule engine is something like this:\n\nquality 6.5 = \"good\"\nTRUE = \"bad\" \n3- Rule Engine Node output to input of Column Filter node to filter out your original 10point feature (this prevent leaking)\n4- Column Filter Node output to input of Partitioning Node (your standard train\/tes split, e.g. 75\/25, choose 'random' or 'stratified')\n5- Partitioning Node train data split output to input of Train data split to input Decision Tree Learner node and \n6- Partitioning Node test data split output to input Decision Tree predictor Node\n7- Decision Tree learner Node output to input Decision Tree Node input\n8- Decision Tree output to input ROC Node.. (here you can evaluate your model base on AUC value)\n\nInspiration\nUse machine learning to determine which physiochemical properties make a wine 'good'!\nAcknowledgements\nThis dataset is also available from the UCI machine learning repository, https:\/\/archive.ics.uci.edu\/ml\/datasets\/wine+quality , I just shared it to kaggle for convenience. (I am mistaken and the public license type disallowed me from doing so, I will take this down at first request. I am not the owner of this dataset.\nPlease include this citation if you plan to use this database: \nP. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. \nModeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.\nRelevant publication\nP. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. \nIn Decision Support Systems, Elsevier, 47(4):547-553, 2009.", "format": "arff", "uploader": " Stewart", "uploader_id": 30123, "visibility": "public", "creator": null, "contributor": null, "date": "2022-03-24 07:14:23", "update_comment": null, "last_update": "2022-03-24 07:14:23", "licence": "Database: Open Database, Contents: Database Contents", "status": "active", "error_message": null, "url": "https:\/\/www.openml.org\/data\/download\/22102520\/dataset", "default_target_attribute": null, "row_id_attribute": null, "ignore_attribute": null, "runs": 0, "suggest": { "input": [ "Red-Wine-Quality", "Context The two datasets are related to red and white variants of the Portuguese \"Vinho Verde\" wine. For more details, consult the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.). These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are much more norma " ], "weight": 5 }, "qualities": { "NumberOfInstances": 1599, "NumberOfFeatures": 12, "NumberOfClasses": null, "NumberOfMissingValues": 0, "NumberOfInstancesWithMissingValues": 0, "NumberOfNumericFeatures": 12, "NumberOfSymbolicFeatures": 0, "PercentageOfSymbolicFeatures": 0, "AutoCorrelation": null, "PercentageOfNumericFeatures": 100, "PercentageOfMissingValues": 0, "PercentageOfInstancesWithMissingValues": 0, "PercentageOfBinaryFeatures": 0, "NumberOfBinaryFeatures": 0, "MinorityClassSize": null, "MinorityClassPercentage": null, "MajorityClassSize": null, "MajorityClassPercentage": null, "Dimensionality": 0.0075046904315197 }, "tags": [], "features": [ { "name": "fixed_acidity", "index": "0", "type": "numeric", "distinct": "96", "missing": "0", "min": "5", "max": "16", "mean": "8", "stdev": "2" }, { "name": "volatile_acidity", "index": "1", "type": "numeric", "distinct": "143", "missing": "0", "min": "0", "max": "2", "mean": "1", "stdev": "0" }, { "name": "citric_acid", "index": "2", "type": "numeric", "distinct": "80", "missing": "0", "min": "0", "max": "1", "mean": "0", "stdev": "0" }, { "name": "residual_sugar", "index": "3", "type": "numeric", "distinct": "91", "missing": "0", "min": "1", "max": "16", "mean": "3", "stdev": "1" }, { "name": "chlorides", "index": "4", "type": "numeric", "distinct": "153", "missing": "0", "min": "0", "max": "1", "mean": "0", "stdev": "0" }, { "name": "free_sulfur_dioxide", "index": "5", "type": "numeric", "distinct": "60", "missing": "0", "min": "1", "max": "72", "mean": "16", "stdev": "10" }, { "name": "total_sulfur_dioxide", "index": "6", "type": "numeric", "distinct": "144", "missing": "0", "min": "6", "max": "289", "mean": "46", "stdev": "33" }, { "name": "density", "index": "7", "type": "numeric", "distinct": "436", "missing": "0", "min": "1", "max": "1", "mean": "1", "stdev": "0" }, { "name": "pH", "index": "8", "type": "numeric", "distinct": "89", "missing": "0", "min": "3", "max": "4", "mean": "3", "stdev": "0" }, { "name": "sulphates", "index": "9", "type": "numeric", "distinct": "96", "missing": "0", "min": "0", "max": "2", "mean": "1", "stdev": "0" }, { "name": "alcohol", "index": "10", "type": "numeric", "distinct": "65", "missing": "0", "min": "8", "max": "15", "mean": "10", "stdev": "1" }, { "name": "quality", "index": "11", "type": "numeric", "distinct": "6", "missing": "0", "min": "3", "max": "8", "mean": "6", "stdev": "1" } ], "nr_of_issues": 0, "nr_of_downvotes": 0, "nr_of_likes": 0, "nr_of_downloads": 0, "total_downloads": 0, "reach": 0, "reuse": 0, "impact_of_reuse": 0, "reach_of_reuse": 0, "impact": 0 }