DEVELOPMENT... OpenML
Data
amazon-commerce-reviews_seed_4_nrows_2000_nclasses_10_ncols_100_stratify_True

amazon-commerce-reviews_seed_4_nrows_2000_nclasses_10_ncols_100_stratify_True

active ARFF Publicly available Visibility: public Uploaded 17-11-2022 by David Wilson
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes
Issue #Downvotes for this reason By


Loading wiki
Help us complete this description Edit
Subsampling of the dataset amazon-commerce-reviews (1457) with seed=4 args.nrows=2000 args.ncols=100 args.nclasses=10 args.no_stratify=True Generated with the following source code: ```python def subsample( self, seed: int, nrows_max: int = 2_000, ncols_max: int = 100, nclasses_max: int = 10, stratified: bool = True, ) -> Dataset: rng = np.random.default_rng(seed) x = self.x y = self.y # Uniformly sample classes = y.unique() if len(classes) > nclasses_max: vcs = y.value_counts() selected_classes = rng.choice( classes, size=nclasses_max, replace=False, p=vcs / sum(vcs), ) # Select the indices where one of these classes is present idxs = y.index[y.isin(classes)] x = x.iloc[idxs] y = y.iloc[idxs] # Uniformly sample columns if required if len(x.columns) > ncols_max: columns_idxs = rng.choice( list(range(len(x.columns))), size=ncols_max, replace=False ) sorted_column_idxs = sorted(columns_idxs) selected_columns = list(x.columns[sorted_column_idxs]) x = x[selected_columns] else: sorted_column_idxs = list(range(len(x.columns))) if len(x) > nrows_max: # Stratify accordingly target_name = y.name data = pd.concat((x, y), axis="columns") _, subset = train_test_split( data, test_size=nrows_max, stratify=data[target_name], shuffle=True, random_state=seed, ) x = subset.drop(target_name, axis="columns") y = subset[target_name] # We need to convert categorical columns to string for openml categorical_mask = [self.categorical_mask[i] for i in sorted_column_idxs] columns = list(x.columns) return Dataset( # Technically this is not the same but it's where it was derived from dataset=self.dataset, x=x, y=y, categorical_mask=categorical_mask, columns=columns, ) ```

101 features

Class (target)nominal50 unique values
0 missing
V268numeric8 unique values
0 missing
V347numeric5 unique values
0 missing
V420numeric6 unique values
0 missing
V600numeric4 unique values
0 missing
V713numeric3 unique values
0 missing
V788numeric3 unique values
0 missing
V896numeric3 unique values
0 missing
V972numeric6 unique values
0 missing
V1149numeric3 unique values
0 missing
V1276numeric3 unique values
0 missing
V1325numeric3 unique values
0 missing
V1382numeric3 unique values
0 missing
V1530numeric4 unique values
0 missing
V1639numeric4 unique values
0 missing
V1703numeric4 unique values
0 missing
V1706numeric3 unique values
0 missing
V1763numeric4 unique values
0 missing
V1791numeric2 unique values
0 missing
V1902numeric2 unique values
0 missing
V1997numeric2 unique values
0 missing
V2062numeric3 unique values
0 missing
V2162numeric2 unique values
0 missing
V2208numeric2 unique values
0 missing
V2225numeric2 unique values
0 missing
V2665numeric2 unique values
0 missing
V2758numeric2 unique values
0 missing
V2954numeric3 unique values
0 missing
V3016numeric3 unique values
0 missing
V3044numeric4 unique values
0 missing
V3287numeric2 unique values
0 missing
V3349numeric3 unique values
0 missing
V3478numeric2 unique values
0 missing
V3480numeric2 unique values
0 missing
V3608numeric5 unique values
0 missing
V3618numeric4 unique values
0 missing
V3662numeric4 unique values
0 missing
V3665numeric4 unique values
0 missing
V3759numeric3 unique values
0 missing
V3816numeric3 unique values
0 missing
V4052numeric3 unique values
0 missing
V4265numeric2 unique values
0 missing
V4295numeric3 unique values
0 missing
V4585numeric3 unique values
0 missing
V4603numeric3 unique values
0 missing
V4726numeric2 unique values
0 missing
V4741numeric3 unique values
0 missing
V4775numeric4 unique values
0 missing
V4895numeric2 unique values
0 missing
V4903numeric3 unique values
0 missing
V4944numeric2 unique values
0 missing
V4951numeric3 unique values
0 missing
V4958numeric3 unique values
0 missing
V4970numeric2 unique values
0 missing
V5176numeric3 unique values
0 missing
V5191numeric3 unique values
0 missing
V5311numeric2 unique values
0 missing
V5315numeric3 unique values
0 missing
V5363numeric3 unique values
0 missing
V5366numeric3 unique values
0 missing
V5687numeric3 unique values
0 missing
V5727numeric2 unique values
0 missing
V5790numeric2 unique values
0 missing
V5845numeric2 unique values
0 missing
V6040numeric2 unique values
0 missing
V6376numeric2 unique values
0 missing
V6593numeric38 unique values
0 missing
V6608numeric32 unique values
0 missing
V6729numeric16 unique values
0 missing
V6900numeric11 unique values
0 missing
V6994numeric9 unique values
0 missing
V7263numeric12 unique values
0 missing
V7382numeric7 unique values
0 missing
V7679numeric7 unique values
0 missing
V7817numeric9 unique values
0 missing
V8025numeric7 unique values
0 missing
V8067numeric8 unique values
0 missing
V8195numeric6 unique values
0 missing
V8499numeric6 unique values
0 missing
V8504numeric8 unique values
0 missing
V8783numeric7 unique values
0 missing
V8823numeric8 unique values
0 missing
V8891numeric7 unique values
0 missing
V8934numeric9 unique values
0 missing
V8941numeric9 unique values
0 missing
V9138numeric6 unique values
0 missing
V9213numeric5 unique values
0 missing
V9216numeric7 unique values
0 missing
V9254numeric5 unique values
0 missing
V9341numeric5 unique values
0 missing
V9357numeric6 unique values
0 missing
V9525numeric9 unique values
0 missing
V9557numeric5 unique values
0 missing
V9568numeric10 unique values
0 missing
V9602numeric4 unique values
0 missing
V9607numeric4 unique values
0 missing
V9710numeric6 unique values
0 missing
V9723numeric5 unique values
0 missing
V9753numeric4 unique values
0 missing
V9917numeric6 unique values
0 missing
V9933numeric4 unique values
0 missing

19 properties

1500
Number of instances (rows) of the dataset.
101
Number of attributes (columns) of the dataset.
50
Number of distinct values of the target attribute (if it is nominal).
0
Number of missing values in the dataset.
0
Number of instances with at least one value missing.
100
Number of numeric attributes.
1
Number of nominal attributes.
0.99
Percentage of nominal attributes.
0.97
Average class difference between consecutive instances.
99.01
Percentage of numeric attributes.
0
Percentage of missing values.
0
Percentage of instances having missing values.
0
Percentage of binary attributes.
0
Number of binary attributes.
30
Number of instances belonging to the least frequent class.
2
Percentage of instances belonging to the least frequent class.
30
Number of instances belonging to the most frequent class.
2
Percentage of instances belonging to the most frequent class.
0.07
Number of attributes divided by the number of instances.

0 tasks

Define a new task