DEVELOPMENT... OpenML
Data
amazon-commerce-reviews_seed_1_nrows_2000_nclasses_10_ncols_100_stratify_True

amazon-commerce-reviews_seed_1_nrows_2000_nclasses_10_ncols_100_stratify_True

active ARFF Publicly available Visibility: public Uploaded 17-11-2022 by David Wilson
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes
Issue #Downvotes for this reason By


Loading wiki
Help us complete this description Edit
Subsampling of the dataset amazon-commerce-reviews (1457) with seed=1 args.nrows=2000 args.ncols=100 args.nclasses=10 args.no_stratify=True Generated with the following source code: ```python def subsample( self, seed: int, nrows_max: int = 2_000, ncols_max: int = 100, nclasses_max: int = 10, stratified: bool = True, ) -> Dataset: rng = np.random.default_rng(seed) x = self.x y = self.y # Uniformly sample classes = y.unique() if len(classes) > nclasses_max: vcs = y.value_counts() selected_classes = rng.choice( classes, size=nclasses_max, replace=False, p=vcs / sum(vcs), ) # Select the indices where one of these classes is present idxs = y.index[y.isin(classes)] x = x.iloc[idxs] y = y.iloc[idxs] # Uniformly sample columns if required if len(x.columns) > ncols_max: columns_idxs = rng.choice( list(range(len(x.columns))), size=ncols_max, replace=False ) sorted_column_idxs = sorted(columns_idxs) selected_columns = list(x.columns[sorted_column_idxs]) x = x[selected_columns] else: sorted_column_idxs = list(range(len(x.columns))) if len(x) > nrows_max: # Stratify accordingly target_name = y.name data = pd.concat((x, y), axis="columns") _, subset = train_test_split( data, test_size=nrows_max, stratify=data[target_name], shuffle=True, random_state=seed, ) x = subset.drop(target_name, axis="columns") y = subset[target_name] # We need to convert categorical columns to string for openml categorical_mask = [self.categorical_mask[i] for i in sorted_column_idxs] columns = list(x.columns) return Dataset( # Technically this is not the same but it's where it was derived from dataset=self.dataset, x=x, y=y, categorical_mask=categorical_mask, columns=columns, ) ```

101 features

Class (target)nominal50 unique values
0 missing
V197numeric4 unique values
0 missing
V395numeric5 unique values
0 missing
V540numeric3 unique values
0 missing
V616numeric3 unique values
0 missing
V621numeric8 unique values
0 missing
V689numeric3 unique values
0 missing
V815numeric3 unique values
0 missing
V915numeric3 unique values
0 missing
V1152numeric3 unique values
0 missing
V1161numeric5 unique values
0 missing
V1228numeric5 unique values
0 missing
V1233numeric3 unique values
0 missing
V1329numeric3 unique values
0 missing
V1476numeric3 unique values
0 missing
V1597numeric3 unique values
0 missing
V1911numeric3 unique values
0 missing
V1914numeric3 unique values
0 missing
V2018numeric3 unique values
0 missing
V2141numeric4 unique values
0 missing
V2199numeric4 unique values
0 missing
V2593numeric2 unique values
0 missing
V2602numeric3 unique values
0 missing
V2741numeric2 unique values
0 missing
V2751numeric3 unique values
0 missing
V2783numeric3 unique values
0 missing
V2912numeric3 unique values
0 missing
V2913numeric2 unique values
0 missing
V3005numeric2 unique values
0 missing
V3105numeric2 unique values
0 missing
V3206numeric2 unique values
0 missing
V3266numeric3 unique values
0 missing
V3435numeric2 unique values
0 missing
V3576numeric6 unique values
0 missing
V3610numeric5 unique values
0 missing
V3665numeric4 unique values
0 missing
V3800numeric3 unique values
0 missing
V3997numeric3 unique values
0 missing
V4186numeric2 unique values
0 missing
V4222numeric2 unique values
0 missing
V4253numeric2 unique values
0 missing
V4484numeric3 unique values
0 missing
V4495numeric3 unique values
0 missing
V4531numeric2 unique values
0 missing
V4553numeric3 unique values
0 missing
V4574numeric3 unique values
0 missing
V4719numeric3 unique values
0 missing
V4790numeric3 unique values
0 missing
V4816numeric2 unique values
0 missing
V4945numeric2 unique values
0 missing
V4981numeric2 unique values
0 missing
V5080numeric3 unique values
0 missing
V5095numeric2 unique values
0 missing
V5130numeric2 unique values
0 missing
V5262numeric3 unique values
0 missing
V5329numeric2 unique values
0 missing
V5376numeric3 unique values
0 missing
V5439numeric2 unique values
0 missing
V5545numeric2 unique values
0 missing
V5798numeric2 unique values
0 missing
V5909numeric2 unique values
0 missing
V6099numeric3 unique values
0 missing
V6174numeric3 unique values
0 missing
V6200numeric3 unique values
0 missing
V6388numeric2 unique values
0 missing
V6672numeric21 unique values
0 missing
V6718numeric18 unique values
0 missing
V6820numeric14 unique values
0 missing
V7152numeric9 unique values
0 missing
V7198numeric8 unique values
0 missing
V7212numeric11 unique values
0 missing
V7437numeric9 unique values
0 missing
V7444numeric9 unique values
0 missing
V7511numeric10 unique values
0 missing
V7581numeric10 unique values
0 missing
V7671numeric8 unique values
0 missing
V7689numeric9 unique values
0 missing
V7725numeric11 unique values
0 missing
V7811numeric11 unique values
0 missing
V7857numeric6 unique values
0 missing
V8013numeric10 unique values
0 missing
V8017numeric8 unique values
0 missing
V8097numeric9 unique values
0 missing
V8179numeric13 unique values
0 missing
V8296numeric6 unique values
0 missing
V8372numeric6 unique values
0 missing
V8396numeric6 unique values
0 missing
V8494numeric6 unique values
0 missing
V8546numeric8 unique values
0 missing
V8608numeric9 unique values
0 missing
V8682numeric5 unique values
0 missing
V8719numeric7 unique values
0 missing
V8762numeric9 unique values
0 missing
V8962numeric8 unique values
0 missing
V9128numeric5 unique values
0 missing
V9188numeric5 unique values
0 missing
V9548numeric7 unique values
0 missing
V9640numeric4 unique values
0 missing
V9683numeric4 unique values
0 missing
V9735numeric4 unique values
0 missing
V9783numeric7 unique values
0 missing

19 properties

1500
Number of instances (rows) of the dataset.
101
Number of attributes (columns) of the dataset.
50
Number of distinct values of the target attribute (if it is nominal).
0
Number of missing values in the dataset.
0
Number of instances with at least one value missing.
100
Number of numeric attributes.
1
Number of nominal attributes.
0.99
Percentage of nominal attributes.
0.97
Average class difference between consecutive instances.
99.01
Percentage of numeric attributes.
0
Percentage of missing values.
0
Percentage of instances having missing values.
0
Percentage of binary attributes.
0
Number of binary attributes.
30
Number of instances belonging to the least frequent class.
2
Percentage of instances belonging to the least frequent class.
30
Number of instances belonging to the most frequent class.
2
Percentage of instances belonging to the most frequent class.
0.07
Number of attributes divided by the number of instances.

0 tasks

Define a new task