DEVELOPMENT... OpenML
Data
pc4_seed_0_nrows_2000_nclasses_10_ncols_100_stratify_True

pc4_seed_0_nrows_2000_nclasses_10_ncols_100_stratify_True

active ARFF Publicly available Visibility: public Uploaded 17-11-2022 by David Wilson
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes
Issue #Downvotes for this reason By


Loading wiki
Help us complete this description Edit
Subsampling of the dataset pc4 (1049) with seed=0 args.nrows=2000 args.ncols=100 args.nclasses=10 args.no_stratify=True Generated with the following source code: ```python def subsample( self, seed: int, nrows_max: int = 2_000, ncols_max: int = 100, nclasses_max: int = 10, stratified: bool = True, ) -> Dataset: rng = np.random.default_rng(seed) x = self.x y = self.y # Uniformly sample classes = y.unique() if len(classes) > nclasses_max: vcs = y.value_counts() selected_classes = rng.choice( classes, size=nclasses_max, replace=False, p=vcs / sum(vcs), ) # Select the indices where one of these classes is present idxs = y.index[y.isin(classes)] x = x.iloc[idxs] y = y.iloc[idxs] # Uniformly sample columns if required if len(x.columns) > ncols_max: columns_idxs = rng.choice( list(range(len(x.columns))), size=ncols_max, replace=False ) sorted_column_idxs = sorted(columns_idxs) selected_columns = list(x.columns[sorted_column_idxs]) x = x[selected_columns] else: sorted_column_idxs = list(range(len(x.columns))) if len(x) > nrows_max: # Stratify accordingly target_name = y.name data = pd.concat((x, y), axis="columns") _, subset = train_test_split( data, test_size=nrows_max, stratify=data[target_name], shuffle=True, random_state=seed, ) x = subset.drop(target_name, axis="columns") y = subset[target_name] # We need to convert categorical columns to string for openml categorical_mask = [self.categorical_mask[i] for i in sorted_column_idxs] columns = list(x.columns) return Dataset( # Technically this is not the same but it's where it was derived from dataset=self.dataset, x=x, y=y, categorical_mask=categorical_mask, columns=columns, ) ```

38 features

c (target)nominal2 unique values
0 missing
HALSTEAD_ERROR_ESTnumeric120 unique values
0 missing
HALSTEAD_EFFORTnumeric1165 unique values
0 missing
HALSTEAD_LENGTHnumeric336 unique values
0 missing
HALSTEAD_LEVELnumeric40 unique values
0 missing
HALSTEAD_PROG_TIMEnumeric1159 unique values
0 missing
HALSTEAD_VOLUMEnumeric941 unique values
0 missing
MAINTENANCE_SEVERITYnumeric74 unique values
0 missing
MODIFIED_CONDITION_COUNTnumeric28 unique values
0 missing
MULTIPLE_CONDITION_COUNTnumeric40 unique values
0 missing
NODE_COUNTnumeric89 unique values
0 missing
NORMALIZED_CYLOMATIC_COMPLEXITYnumeric67 unique values
0 missing
NUM_OPERANDSnumeric184 unique values
0 missing
NUM_OPERATORSnumeric245 unique values
0 missing
NUM_UNIQUE_OPERANDSnumeric71 unique values
0 missing
NUM_UNIQUE_OPERATORSnumeric38 unique values
0 missing
NUMBER_OF_LINESnumeric171 unique values
0 missing
PERCENT_COMMENTSnumeric394 unique values
0 missing
LOC_TOTALnumeric116 unique values
0 missing
DESIGN_COMPLEXITYnumeric31 unique values
0 missing
BRANCH_COUNTnumeric61 unique values
0 missing
CALL_PAIRSnumeric22 unique values
0 missing
LOC_CODE_AND_COMMENTnumeric36 unique values
0 missing
LOC_COMMENTSnumeric57 unique values
0 missing
CONDITION_COUNTnumeric41 unique values
0 missing
CYCLOMATIC_COMPLEXITYnumeric43 unique values
0 missing
CYCLOMATIC_DENSITYnumeric70 unique values
0 missing
DECISION_COUNTnumeric23 unique values
0 missing
DECISION_DENSITYnumeric5 unique values
0 missing
LOC_BLANKnumeric54 unique values
0 missing
DESIGN_DENSITYnumeric76 unique values
0 missing
EDGE_COUNTnumeric105 unique values
0 missing
ESSENTIAL_COMPLEXITYnumeric25 unique values
0 missing
ESSENTIAL_DENSITYnumeric2 unique values
0 missing
LOC_EXECUTABLEnumeric107 unique values
0 missing
PARAMETER_COUNTnumeric8 unique values
0 missing
HALSTEAD_CONTENTnumeric1021 unique values
0 missing
HALSTEAD_DIFFICULTYnumeric708 unique values
0 missing

19 properties

1458
Number of instances (rows) of the dataset.
38
Number of attributes (columns) of the dataset.
2
Number of distinct values of the target attribute (if it is nominal).
0
Number of missing values in the dataset.
0
Number of instances with at least one value missing.
37
Number of numeric attributes.
1
Number of nominal attributes.
2.63
Percentage of nominal attributes.
0.78
Average class difference between consecutive instances.
97.37
Percentage of numeric attributes.
0
Percentage of missing values.
0
Percentage of instances having missing values.
2.63
Percentage of binary attributes.
1
Number of binary attributes.
178
Number of instances belonging to the least frequent class.
12.21
Percentage of instances belonging to the least frequent class.
1280
Number of instances belonging to the most frequent class.
87.79
Percentage of instances belonging to the most frequent class.
0.03
Number of attributes divided by the number of instances.

0 tasks

Define a new task