Identifying correct and complete taint specifications is critical for detecting vulnerabilities in the ever-changing landscape of software security, and an automated scalable and practical solution remains elusive in the field. In this paper, we report our semi-automated scheme for inferring and maintaining taint specifications at industrial scale. Knowledge graph is adopted as the core engine to represent the ongoing accumulation of knowledge in the domain of software security: how different functional behaviors in programs relate and manifest in varying contexts of many security vulnerabilities and their defenses. Taint analysis rules are then mapped onto nodes in the knowledge graph to achieve the desired security enforcement. We begin by mining from a corpus of existing code analysis tools and code examples from the wild to discover candidate taint specifications, followed by human-in-the-loop labeling to assign concrete APIs to nodes in the knowledge graph. To continuously grow the knowledge graph, we propose a novel inference algorithm using multi-view active machine learning approach to characterize taint-relevant APIs via collective matrix factorization which combines different aspects of API use-pattern and its naming together. The obtained API embedding is then used as features in a tree-based classifier to expand taint specifications starting from a small list of well-known APIs (seeds). Finally, adequate tooling around the generated taint specifications enables their automatic and uniform deployment in an ensemble of security analysis tools. With the proposed technology, we expand the configurable taint rules used in AWS CodeGuru Reviewer, improving its detection capabilities both in covering novel security scenarios, as well as maintaining a high acceptance rate of its reported findings.
Research areas