Multi-lingual evaluation of code generation models
2023
We present HumanEvalX and MBXP, execution-based code completion benchmarks in 10+ programming languages. These datasets are generated by our conversion framework that transpiles prompts and test cases from original datasets (HumanEval and MBPP) to the corresponding data in a target language. Based on these benchmarks, we are able to evaluate code generation models in a multilingual fashion, and in particular, discover generalization ability of language models on out-of-domain languages, advantages of multi-lingual models over monolingual, the ability of few-shot prompting to teach the model a new language, and zero-shot translation abilities. In addition, we use our code generation model to perform large-scale bootstrapping to obtain synthetic canonical solutions in several languages. These solutions can be used for other code-related evaluations besides function completion such as insertion-based, summarization, or code translation tasks.
Research areas