create_cv_dataset.py

create_cv_dataset.py#

Generate a Buggy APPS dataset.

Our Buggy APPS dataset is based on the APPS dataset [HBK+21], which consists of problem statements and code solutions. The Buggy APPS dataset is augmented with buggy code solutions, generated by asking a large language model to introduce a subtle bug into each correct solution.

scripts/create_cv_dataset.py#

Generate a Buggy APPS dataset

usage: scripts/create_cv_dataset.py [-h] [--split SPLIT] [--num_data NUM_DATA]
                                    [--save_after SAVE_AFTER]
-h, --help#

show this help message and exit

--split <split>#

Whether to draw problems from the train or test split of the APPS dataset

--num_data <num_data>#

How many problems the dataset should contain (per split per difficulty level)

--save_after <save_after>#

The number of problems added after which to save (and possibly push) the dataset