DEVELOPMENT... OpenML
Data
Census-(Augmented)

Census-(Augmented)

active ARFF CC0: Public Domain Visibility: public Uploaded 23-03-2022 by Mark Murphy
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes
Issue #Downvotes for this reason By


Loading wiki
Help us complete this description Edit
This data was extracted from the 1994 Census bureau database by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics). A set of reasonably clean records was extracted using the following conditions: ((AAGE16) (AGI100) (AFNLWGT1) (HRSWK0)). The prediction task is to determine whether a person makes over 50K a year. Description of fnlwgt (final weight) The weights on the Current Population Survey (CPS) files are controlled to independent estimates of the civilian noninstitutional population of the US. These are prepared monthly for us by Population Division here at the Census Bureau. We use 3 sets of controls. These are: A single cell estimate of the population 16+ for each state. Controls for Hispanic Origin by age and sex. Controls by Race, age and sex. We use all three sets of controls in our weighting program and "rake" through them 6 times so that by the end we come back to all the controls we used. The term estimate refers to population totals derived from CPS by creating "weighted tallies" of any specified socio-economic characteristics of the population. People with similar demographic characteristics should have similar weights. There is one important caveat to remember about this statement. That is that since the CPS sample is actually a collection of 51 state samples, each with its own probability of selection, the statement only applies within state. Relevant papers Ron Kohavi, "Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid", Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996. (PDF) NEW : CTGAN used to generated more data

15 features

income (target)string2 unique values
0 missing
agenumeric997820 unique values
0 missing
workclassstring9 unique values
0 missing
fnlwgtnumeric997695 unique values
0 missing
educationstring16 unique values
0 missing
education-numnumeric997784 unique values
0 missing
marital-statusstring7 unique values
0 missing
occupationstring15 unique values
0 missing
relationshipstring6 unique values
0 missing
racestring5 unique values
0 missing
sexstring2 unique values
0 missing
capital-gainnumeric990143 unique values
0 missing
capital-lossnumeric987380 unique values
0 missing
hours-per-weeknumeric995059 unique values
0 missing
native-countrystring42 unique values
0 missing

19 properties

1000000
Number of instances (rows) of the dataset.
15
Number of attributes (columns) of the dataset.
2
Number of distinct values of the target attribute (if it is nominal).
0
Number of missing values in the dataset.
0
Number of instances with at least one value missing.
6
Number of numeric attributes.
0
Number of nominal attributes.
0
Percentage of nominal attributes.
1
Average class difference between consecutive instances.
40
Percentage of numeric attributes.
0
Percentage of missing values.
0
Percentage of instances having missing values.
0
Percentage of binary attributes.
0
Number of binary attributes.
315334
Number of instances belonging to the least frequent class.
31.53
Percentage of instances belonging to the least frequent class.
684666
Number of instances belonging to the most frequent class.
68.47
Percentage of instances belonging to the most frequent class.
0
Number of attributes divided by the number of instances.

0 tasks

Define a new task