This is the repo for ReCode (arXiv), providing a comprehensive evaluation for the practical robustness of code generation models like CodeGEN, Incoder, GPT-J. In specific, this benchmark provides over 30 different general perturbations on docstrings, function names, and codes. The perturbations are carefully selected and implemented such that the perturbed datasets are naturally and semantically close to the original non-perturbed datasets. All the perturbations are well implemented with automatic generation, providing easy usage and customization. With these perturbations available in our benchmark, the user can get to know a comprehensive analysis of model robustness performance.
Our benchmark is general with regard to datasets and models. Given the perturbed datasets, the users can evaluate any of public/customized code generation models with the default inference provided by our benchmark. We also allow users to provide their own datasets and models to evaluate robustness in our benchmark by configuring config.json and inference script evaluate-public-models/run_eval_models.sh.
After the model evaluation is done on perturbed datasets, we provide overall robustness analysis for the evaluated models such that the users can easily compare across different models and get aware of the possible practical robustness problems.
Lastly, we release a standard version of the perturbed datasets dataset-release/perturbed-finalized for HumanEval and MBPP in this benchmark for general robustness evaluation and compare across different models proposed in future works.