Automatic identification of coding best practices can scale the development of code and application analyzers. We present Doc2BP, a deep learning tool to identify coding best practices in software documentation. Natural language descriptions are mapped to an informative embedding space, optimized under the dual objectives of binary and few shot classification. The binary objective powers general classification into known best practice categories using a deep learning classifier. The few shot objective facilitates example-based classification into novel categories by matching embeddings with user-provided examples at run-time, without having to retrain the underlying model.
We analyze the effects of manually and synthetically labeled examples, context, and cross-domain information. We have applied Doc2BP to Java, Python, AWS Java SDK, and AWS CloudFormation documentations. With respect to prior works that primarily leverage keyword heuristics and our own parts of speech pattern baselines, we obtain 3-5% F1 score improvement for Java and Python, and 15-20% for AWS Java SDK and AWS CloudFormation. Experiments with four few shot use-cases show promising results (5-shot accuracy of 99%+ for Java NullPointerException and AWS Java metrics, 65% for AWS CloudFormation numerics, and 35% for Python best practices).
Doc2BP has contributed new rules and improved specifications in Amazon’s code and application analyzers: (a) 500+ new checks in cfn-lint, an open-source AWS CloudFormation linter, (b) over 97% automated coverage of metrics APIs and related practices in Amazon DevOps Guru, (c) support for nullable AWS APIs in Amazon CodeGuru’s Java NullPointerException (NPE) detector, (d) 200+ new best practices for Java, Python, and respective AWS SDKs in Amazon CodeGuru, and (e) 2% reduction in false positives in Amazon CodeGuru’s Java resource leak detector.
Learning-based identification of coding best practices from software documentation
2022
Research areas