nip.code_validation.data.CodeValidationDataset#

class nip.code_validation.data.CodeValidationDataset(hyper_params: HyperParameters, settings: ExperimentSettings, protocol_handler: ProtocolHandler, split: Literal['train', 'test', 'validation'] = 'train')[source]#

Base class for the code validation datasets.

Works with HuggingFace datasets.

The dataset should have the following columns:

“question”: The question text.
“solution”: The solution text.
“y”: The label, 1 for correct solutions and 0 for buggy solutions.

In addition, each datapoint should receive a “prover_stance” which is the verdict that the prover should be arguing for, in single-prover settings under the appropriate hyper-parameters. This can be computed from the (hash of the) solution text.

Methods Summary

`__getitem__`(index)
`__init__`(hyper_params, settings, ...[, split])
`__len__`()
`__repr__`()	Return repr(self).
`_load_raw_dataset`()	Load the dataset.
`_process_data`(raw_dataset)	Process the dataset.
`_reduce_dataset_size`(dataset)	Reduce the size of a dataset if necessary.

Attributes

`dataset_filepath_name`	The name of the dataset file.
`instance_keys`	The keys specifying the input instance.
`keys`	The keys (field names) in the dataset.
`max_test_size`	The maximum size of the test set.
`max_train_size`	The maximum size of the training set.
`processed_dir`	The path to the directory containing the processed data.
`raw_dir`	The path to the directory containing the raw data.
`reduce_shuffle_seed`	The seed used to shuffle the dataset before reducing its size.
`split_dir`	The name of the folder containing the split data.
`validation_proportion`	The proportion of the training set to use for validation.

Methods

__getitem__(index: Any) → NestedArrayDict[source]#

__init__(hyper_params: HyperParameters, settings: ExperimentSettings, protocol_handler: ProtocolHandler, split: Literal['train', 'test', 'validation'] = 'train')[source]#

__len__() → int[source]#

__repr__() → str[source]#: Return repr(self).

abstract _load_raw_dataset() → Dataset[source]#

Load the dataset.

Returns:: raw_data (HuggingFaceDataset) – The unprocessed dataset.

_process_data(raw_dataset: Dataset) → Dataset[source]#

Process the dataset.

Parameters:: raw_dataset (HuggingFaceDataset) – The unprocessed dataset.
Returns:: processed_dataset (HuggingFaceDataset) – The processed dataset.

_reduce_dataset_size(dataset: Dataset) → Dataset[source]#

Reduce the size of a dataset if necessary.

Parameters:: dataset (HuggingFaceDataset) – The dataset to reduce.
Returns:: reduced_dataset (HuggingFaceDataset) – The reduced dataset.

nip.code_validation.data.CodeValidationDataset

Contents

nip.code_validation.data.CodeValidationDataset#