nip.code_validation.data.CodeValidationDataset#
- class nip.code_validation.data.CodeValidationDataset(hyper_params: HyperParameters, settings: ExperimentSettings, protocol_handler: ProtocolHandler, split: Literal['train', 'test', 'validation'] = 'train')[source]#
Base class for the code validation datasets.
Works with HuggingFace datasets.
The dataset should have the following columns:
“question”: The question text.
“solution”: The solution text.
“y”: The label, 1 for correct solutions and 0 for buggy solutions.
In addition, each datapoint should receive a “prover_stance” which is the verdict that the prover should be arguing for, in single-prover settings under the appropriate hyper-parameters. This can be computed from the (hash of the) solution text.
Methods Summary
__getitem__(index)__init__(hyper_params, settings, ...[, split])__len__()__repr__()Return repr(self).
Load the dataset.
_process_data(raw_dataset)Process the dataset.
_reduce_dataset_size(dataset)Reduce the size of a dataset if necessary.
Attributes
dataset_filepath_nameThe name of the dataset file.
instance_keysThe keys specifying the input instance.
keysThe keys (field names) in the dataset.
max_test_sizeThe maximum size of the test set.
max_train_sizeThe maximum size of the training set.
processed_dirThe path to the directory containing the processed data.
raw_dirThe path to the directory containing the raw data.
reduce_shuffle_seedThe seed used to shuffle the dataset before reducing its size.
split_dirThe name of the folder containing the split data.
validation_proportionThe proportion of the training set to use for validation.
Methods
- __getitem__(index: Any) NestedArrayDict[source]#
- __init__(hyper_params: HyperParameters, settings: ExperimentSettings, protocol_handler: ProtocolHandler, split: Literal['train', 'test', 'validation'] = 'train')[source]#
- abstract _load_raw_dataset() Dataset[source]#
Load the dataset.
- Returns:
raw_data (HuggingFaceDataset) – The unprocessed dataset.