DEVELOPMENT... OpenML
Data
porto-seguro_seed_1_nrows_2000_nclasses_10_ncols_100_stratify_True

porto-seguro_seed_1_nrows_2000_nclasses_10_ncols_100_stratify_True

active ARFF Publicly available Visibility: public Uploaded 17-11-2022 by David Wilson
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes
Issue #Downvotes for this reason By


Loading wiki
Help us complete this description Edit
Subsampling of the dataset porto-seguro (42742) with seed=1 args.nrows=2000 args.ncols=100 args.nclasses=10 args.no_stratify=True Generated with the following source code: ```python def subsample( self, seed: int, nrows_max: int = 2_000, ncols_max: int = 100, nclasses_max: int = 10, stratified: bool = True, ) -> Dataset: rng = np.random.default_rng(seed) x = self.x y = self.y # Uniformly sample classes = y.unique() if len(classes) > nclasses_max: vcs = y.value_counts() selected_classes = rng.choice( classes, size=nclasses_max, replace=False, p=vcs / sum(vcs), ) # Select the indices where one of these classes is present idxs = y.index[y.isin(classes)] x = x.iloc[idxs] y = y.iloc[idxs] # Uniformly sample columns if required if len(x.columns) > ncols_max: columns_idxs = rng.choice( list(range(len(x.columns))), size=ncols_max, replace=False ) sorted_column_idxs = sorted(columns_idxs) selected_columns = list(x.columns[sorted_column_idxs]) x = x[selected_columns] else: sorted_column_idxs = list(range(len(x.columns))) if len(x) > nrows_max: # Stratify accordingly target_name = y.name data = pd.concat((x, y), axis="columns") _, subset = train_test_split( data, test_size=nrows_max, stratify=data[target_name], shuffle=True, random_state=seed, ) x = subset.drop(target_name, axis="columns") y = subset[target_name] # We need to convert categorical columns to string for openml categorical_mask = [self.categorical_mask[i] for i in sorted_column_idxs] columns = list(x.columns) return Dataset( # Technically this is not the same but it's where it was derived from dataset=self.dataset, x=x, y=y, categorical_mask=categorical_mask, columns=columns, ) ```

58 features

target (target)nominal2 unique values
0 missing
ps_car_10_catnominal2 unique values
0 missing
ps_car_09_catnominal5 unique values
3 missing
ps_car_11_catnominal104 unique values
0 missing
ps_car_11numeric4 unique values
0 missing
ps_car_12numeric56 unique values
0 missing
ps_car_13numeric1847 unique values
0 missing
ps_car_14numeric348 unique values
140 missing
ps_car_15numeric15 unique values
0 missing
ps_calc_01numeric10 unique values
0 missing
ps_calc_02numeric10 unique values
0 missing
ps_calc_03numeric10 unique values
0 missing
ps_calc_04numeric6 unique values
0 missing
ps_calc_05numeric7 unique values
0 missing
ps_calc_06numeric8 unique values
0 missing
ps_calc_07numeric9 unique values
0 missing
ps_calc_08numeric9 unique values
0 missing
ps_calc_09numeric7 unique values
0 missing
ps_calc_10numeric20 unique values
0 missing
ps_calc_11numeric16 unique values
0 missing
ps_calc_12numeric8 unique values
0 missing
ps_calc_13numeric11 unique values
0 missing
ps_calc_14numeric18 unique values
0 missing
ps_calc_15_binnominal2 unique values
0 missing
ps_calc_16_binnominal2 unique values
0 missing
ps_calc_17_binnominal2 unique values
0 missing
ps_calc_18_binnominal2 unique values
0 missing
ps_calc_19_binnominal2 unique values
0 missing
ps_calc_20_binnominal2 unique values
0 missing
ps_ind_16_binnominal2 unique values
0 missing
ps_ind_02_catnominal4 unique values
2 missing
ps_ind_03numeric12 unique values
0 missing
ps_ind_04_catnominal2 unique values
0 missing
ps_ind_05_catnominal7 unique values
18 missing
ps_ind_06_binnominal2 unique values
0 missing
ps_ind_07_binnominal2 unique values
0 missing
ps_ind_08_binnominal2 unique values
0 missing
ps_ind_09_binnominal2 unique values
0 missing
ps_ind_10_binnominal1 unique values
0 missing
ps_ind_11_binnominal1 unique values
0 missing
ps_ind_12_binnominal2 unique values
0 missing
ps_ind_13_binnominal2 unique values
0 missing
ps_ind_14numeric2 unique values
0 missing
ps_ind_15numeric14 unique values
0 missing
ps_ind_01numeric8 unique values
0 missing
ps_ind_17_binnominal2 unique values
0 missing
ps_ind_18_binnominal2 unique values
0 missing
ps_reg_01numeric10 unique values
0 missing
ps_reg_02numeric19 unique values
0 missing
ps_reg_03numeric1144 unique values
347 missing
ps_car_01_catnominal12 unique values
0 missing
ps_car_02_catnominal2 unique values
0 missing
ps_car_03_catnominal2 unique values
1361 missing
ps_car_04_catnominal9 unique values
0 missing
ps_car_05_catnominal2 unique values
866 missing
ps_car_06_catnominal18 unique values
0 missing
ps_car_07_catnominal2 unique values
38 missing
ps_car_08_catnominal2 unique values
0 missing

19 properties

2000
Number of instances (rows) of the dataset.
58
Number of attributes (columns) of the dataset.
2
Number of distinct values of the target attribute (if it is nominal).
2775
Number of missing values in the dataset.
1545
Number of instances with at least one value missing.
26
Number of numeric attributes.
32
Number of nominal attributes.
55.17
Percentage of nominal attributes.
0.93
Average class difference between consecutive instances.
44.83
Percentage of numeric attributes.
2.39
Percentage of missing values.
77.25
Percentage of instances having missing values.
41.38
Percentage of binary attributes.
24
Number of binary attributes.
73
Number of instances belonging to the least frequent class.
3.65
Percentage of instances belonging to the least frequent class.
1927
Number of instances belonging to the most frequent class.
96.35
Percentage of instances belonging to the most frequent class.
0.03
Number of attributes divided by the number of instances.

0 tasks

Define a new task