DEVELOPMENT... { "data_id": "44095", "name": "spambase_reproduced", "exact_name": "spambase_reproduced", "version": 1, "version_label": "reproduced", "description": "**Author**: Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt \n\n**Source**: [UCI](https:\/\/archive.ics.uci.edu\/ml\/datasets\/spambase) \n\n**Please cite**: [UCI](https:\/\/archive.ics.uci.edu\/ml\/citation_policy.html)\n\n\n\nSPAM E-mail Database \n\nThe \"spam\" concept is diverse: advertisements for products\/websites, make money fast schemes, chain letters, pornography... Our collection of spam e-mails came from our postmaster and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word 'george' and the area code '650' are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter.\n\n \n\nFor background on spam: \n\nCranor, Lorrie F., LaMacchia, Brian A. Spam! Communications of the ACM, 41(8):74-83, 1998. \n\n\n\n### Attribute Information: \n\nThe last column denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail. Most of the attributes indicate whether a particular word or character was frequently occurring in the e-mail. The run-length attributes (55-57) measure the length of sequences of consecutive capital letters. \n\n\n\nFor the statistical measures of each attribute, see the end of this file. Here are the definitions of the attributes: \n\n\n\n48 continuous real [0,100] attributes of type \n\nword_freq_WORD = percentage of words in the e-mail that match WORD, i.e. 100 * (number of times the WORD appears in the e-mail) \/ total number of words in e-mail. A \"word\" in this case is any string of alphanumeric characters bounded by non-alphanumeric characters or end-of-string.\n\n \n\n6 continuous real [0,100] attributes of type char_freq_CHAR = percentage of characters in the e-mail that match CHAR, i.e. 100 * (number of CHAR occurences) \/ total characters in e-mail\n\n \n\n1 continuous real [1,...] attribute of type capital_run_length_average\n\n = average length of uninterrupted sequences of capital letters\n\n \n\n1 continuous integer [1,...] attribute of type capital_run_length_longest\n\n = length of longest uninterrupted sequence of capital letters\n\n \n\n1 continuous integer [1,...] attribute of type capital_run_length_total\n\n = sum of length of uninterrupted sequences of capital letters\n\n = total number of capital letters in the e-mail\n\n \n\n1 nominal {0,1} class attribute of type spam\n\n = denotes whether the e-mail was considered spam (1) or not (0), \n\n i.e. unsolicited commercial e-mail.", "format": "arff", "uploader": " Perry", "uploader_id": 30495, "visibility": "public", "creator": "\"Mark Hopkins\",\"Erik Reeber\",\"George Forman\",\"Jaap Suermondt\",\"Hewlett-Packard Labs\"", "contributor": null, "date": "2022-06-21 23:31:44", "update_comment": null, "last_update": "2022-06-21 23:31:44", "licence": "Public", "status": "active", "error_message": null, "url": "https:\/\/old.openml.org\/data\/download\/22103191\/dataset", "default_target_attribute": "class", "row_id_attribute": null, "ignore_attribute": null, "runs": 0, "suggest": { "input": [ "spambase_reproduced", "SPAM E-mail Database The \"spam\" concept is diverse: advertisements for products\/websites, make money fast schemes, chain letters, pornography... Our collection of spam e-mails came from our postmaster and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word 'george' and the area code '650' are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-sp " ], "weight": 5 }, "qualities": { "NumberOfInstances": 4601, "NumberOfFeatures": 58, "NumberOfClasses": 2, "NumberOfMissingValues": 0, "NumberOfInstancesWithMissingValues": 0, "NumberOfNumericFeatures": 57, "NumberOfSymbolicFeatures": 1, "PercentageOfSymbolicFeatures": 1.7241379310344827, "AutoCorrelation": 0.9997826086956522, "PercentageOfNumericFeatures": 98.27586206896551, "PercentageOfMissingValues": 0, "PercentageOfInstancesWithMissingValues": 0, "PercentageOfBinaryFeatures": 1.7241379310344827, "NumberOfBinaryFeatures": 1, "MinorityClassSize": 1813, "MinorityClassPercentage": 39.404477287546186, "MajorityClassSize": 2788, "MajorityClassPercentage": 60.59552271245382, "Dimensionality": 0.012605955227124538 }, "tags": [], "features": [ { "name": "class", "index": "57", "type": "nominal", "distinct": "2", "missing": "0", "target": "1", "distr": [ [ "0", "1" ], [ [ "2788", "0" ], [ "0", "1813" ] ] ] }, { "name": "word_freq_telnet", "index": "30", "type": "numeric", "distinct": "128", "missing": "0", "min": "0", "max": "13", "mean": "0", "stdev": "0" }, { "name": "word_freq_labs", "index": "29", "type": "numeric", "distinct": "179", "missing": "0", "min": "0", "max": "6", "mean": "0", "stdev": "0" }, { "name": "word_freq_857", "index": "31", "type": "numeric", "distinct": "106", "missing": "0", "min": "0", "max": "5", "mean": "0", "stdev": "0" }, { "name": "word_freq_data", "index": "32", "type": "numeric", "distinct": "184", "missing": "0", "min": "0", "max": "18", "mean": "0", "stdev": "1" }, { "name": "word_freq_415", "index": "33", "type": "numeric", "distinct": "110", "missing": "0", "min": "0", "max": "5", "mean": "0", "stdev": "0" }, { "name": "word_freq_85", "index": "34", "type": "numeric", "distinct": "177", "missing": "0", "min": "0", "max": "20", "mean": "0", "stdev": "1" }, { "name": "word_freq_technology", "index": "35", "type": "numeric", "distinct": "159", "missing": "0", "min": "0", "max": "8", "mean": "0", "stdev": "0" }, { "name": "word_freq_1999", "index": "36", "type": "numeric", "distinct": "188", "missing": "0", "min": "0", "max": "7", "mean": "0", "stdev": "0" }, { "name": "word_freq_parts", "index": "37", "type": "numeric", "distinct": "53", "missing": "0", "min": "0", "max": "8", "mean": "0", "stdev": "0" }, { "name": "word_freq_pm", "index": "38", "type": "numeric", "distinct": "163", "missing": "0", "min": "0", "max": "11", "mean": "0", "stdev": "0" }, { "name": "word_freq_direct", "index": "39", "type": "numeric", "distinct": "125", "missing": "0", "min": "0", "max": "5", "mean": "0", "stdev": "0" }, { "name": "word_freq_cs", "index": "40", "type": "numeric", "distinct": "108", "missing": "0", "min": "0", "max": "7", "mean": "0", "stdev": "0" }, { "name": "word_freq_meeting", "index": "41", "type": "numeric", "distinct": "186", "missing": "0", "min": "0", "max": "14", "mean": "0", "stdev": "1" }, { "name": "word_freq_original", "index": "42", "type": "numeric", "distinct": "136", "missing": "0", "min": "0", "max": "4", "mean": "0", "stdev": "0" }, { "name": "word_freq_project", "index": "43", "type": "numeric", "distinct": "160", "missing": "0", "min": "0", "max": "20", "mean": "0", "stdev": "1" }, { "name": "word_freq_re", "index": "44", "type": "numeric", "distinct": "230", "missing": "0", "min": "0", "max": "21", "mean": "0", "stdev": "1" }, { "name": "word_freq_edu", "index": "45", "type": "numeric", "distinct": "227", "missing": "0", "min": "0", "max": "22", "mean": "0", "stdev": "1" }, { "name": "word_freq_table", "index": "46", "type": "numeric", "distinct": "38", "missing": "0", "min": "0", "max": "2", "mean": "0", "stdev": "0" }, { "name": "word_freq_conference", "index": "47", "type": "numeric", "distinct": "106", "missing": "0", "min": "0", "max": "10", "mean": "0", "stdev": "0" }, { "name": "char_freq_%3B", "index": "48", "type": "numeric", "distinct": "313", "missing": "0", "min": "0", "max": "4", "mean": "0", "stdev": "0" }, { "name": "char_freq_%28", "index": "49", "type": "numeric", "distinct": "641", "missing": "0", "min": "0", "max": "10", "mean": "0", "stdev": "0" }, { "name": "char_freq_%5B", "index": "50", "type": "numeric", "distinct": "225", "missing": "0", "min": "0", "max": "4", "mean": "0", "stdev": "0" }, { "name": "char_freq_%21", "index": "51", "type": "numeric", "distinct": "964", "missing": "0", "min": "0", "max": "32", "mean": "0", "stdev": "1" }, { "name": "char_freq_%24", "index": "52", "type": "numeric", "distinct": "504", "missing": "0", "min": "0", "max": "6", "mean": "0", "stdev": "0" }, { "name": "char_freq_%23", "index": "53", "type": "numeric", "distinct": "316", "missing": "0", "min": "0", "max": "20", "mean": "0", "stdev": "0" }, { "name": "capital_run_length_average", "index": "54", "type": "numeric", "distinct": "2161", "missing": "0", "min": "1", "max": "1103", "mean": "5", "stdev": "32" }, { "name": "capital_run_length_longest", "index": "55", "type": "numeric", "distinct": "271", "missing": "0", "min": "1", "max": "9989", "mean": "52", "stdev": "195" }, { "name": "capital_run_length_total", "index": "56", "type": "numeric", "distinct": "919", "missing": "0", "min": "1", "max": "15841", "mean": "283", "stdev": "606" }, { "name": "word_freq_free", "index": "15", "type": "numeric", "distinct": "253", "missing": "0", "min": "0", "max": "20", "mean": "0", "stdev": "1" }, { "name": "word_freq_address", "index": "1", "type": "numeric", "distinct": "171", "missing": "0", "min": "0", "max": "14", "mean": "0", "stdev": "1" }, { "name": "word_freq_all", "index": "2", "type": "numeric", "distinct": "214", "missing": "0", "min": "0", "max": "5", "mean": "0", "stdev": "1" }, { "name": "word_freq_3d", "index": "3", "type": "numeric", "distinct": "43", "missing": "0", "min": "0", "max": "43", "mean": "0", "stdev": "1" }, { "name": "word_freq_our", "index": "4", "type": "numeric", "distinct": "255", "missing": "0", "min": "0", "max": "10", "mean": "0", "stdev": "1" }, { "name": "word_freq_over", "index": "5", "type": "numeric", "distinct": "141", "missing": "0", "min": "0", "max": "6", "mean": "0", "stdev": "0" }, { "name": "word_freq_remove", "index": "6", "type": "numeric", "distinct": "173", "missing": "0", "min": "0", "max": "7", "mean": "0", "stdev": "0" }, { "name": "word_freq_internet", "index": "7", "type": "numeric", "distinct": "170", "missing": "0", "min": "0", "max": "11", "mean": "0", "stdev": "0" }, { "name": "word_freq_order", "index": "8", "type": "numeric", "distinct": "144", "missing": "0", "min": "0", "max": "5", "mean": "0", "stdev": "0" }, { "name": "word_freq_mail", "index": "9", "type": "numeric", "distinct": "245", "missing": "0", "min": "0", "max": "18", "mean": "0", "stdev": "1" }, { "name": "word_freq_receive", "index": "10", "type": "numeric", "distinct": "113", "missing": "0", "min": "0", "max": "3", "mean": "0", "stdev": "0" }, { "name": "word_freq_will", "index": "11", "type": "numeric", "distinct": "316", "missing": "0", "min": "0", "max": "10", "mean": "1", "stdev": "1" }, { "name": "word_freq_people", "index": "12", "type": "numeric", "distinct": "158", "missing": "0", "min": "0", "max": "6", "mean": "0", "stdev": "0" }, { "name": "word_freq_report", "index": "13", "type": "numeric", "distinct": "133", "missing": "0", "min": "0", "max": "10", "mean": "0", "stdev": "0" }, { "name": "word_freq_addresses", "index": "14", "type": "numeric", "distinct": "118", "missing": "0", "min": "0", "max": "4", "mean": "0", "stdev": "0" }, { "name": "word_freq_make", "index": "0", "type": "numeric", "distinct": "142", "missing": "0", "min": "0", "max": "5", "mean": "0", "stdev": "0" }, { "name": "word_freq_business", "index": "16", "type": "numeric", "distinct": "197", "missing": "0", "min": "0", "max": "7", "mean": "0", "stdev": "0" }, { "name": "word_freq_email", "index": "17", "type": "numeric", "distinct": "229", "missing": "0", "min": "0", "max": "9", "mean": "0", "stdev": "1" }, { "name": "word_freq_you", "index": "18", "type": "numeric", "distinct": "575", "missing": "0", "min": "0", "max": "19", "mean": "2", "stdev": "2" }, { "name": "word_freq_credit", "index": "19", "type": "numeric", "distinct": "148", "missing": "0", "min": "0", "max": "18", "mean": "0", "stdev": "1" }, { "name": "word_freq_your", "index": "20", "type": "numeric", "distinct": "401", "missing": "0", "min": "0", "max": "11", "mean": "1", "stdev": "1" }, { "name": "word_freq_font", "index": "21", "type": "numeric", "distinct": "99", "missing": "0", "min": "0", "max": "17", "mean": "0", "stdev": "1" }, { "name": "word_freq_000", "index": "22", "type": "numeric", "distinct": "164", "missing": "0", "min": "0", "max": "5", "mean": "0", "stdev": "0" }, { "name": "word_freq_money", "index": "23", "type": "numeric", "distinct": "143", "missing": "0", "min": "0", "max": "13", "mean": "0", "stdev": "0" }, { "name": "word_freq_hp", "index": "24", "type": "numeric", "distinct": "395", "missing": "0", "min": "0", "max": "21", "mean": "1", "stdev": "2" }, { "name": "word_freq_hpl", "index": "25", "type": "numeric", "distinct": "281", "missing": "0", "min": "0", "max": "17", "mean": "0", "stdev": "1" }, { "name": "word_freq_george", "index": "26", "type": "numeric", "distinct": "240", "missing": "0", "min": "0", "max": "33", "mean": "1", "stdev": "3" }, { "name": "word_freq_650", "index": "27", "type": "numeric", "distinct": "200", "missing": "0", "min": "0", "max": "9", "mean": "0", "stdev": "1" }, { "name": "word_freq_lab", "index": "28", "type": "numeric", "distinct": "156", "missing": "0", "min": "0", "max": "14", "mean": "0", "stdev": "1" } ], "nr_of_issues": 0, "nr_of_downvotes": 0, "nr_of_likes": 0, "nr_of_downloads": 0, "total_downloads": 0, "reach": 0, "reuse": 0, "impact_of_reuse": 0, "reach_of_reuse": 0, "impact": 0 }