Dependency-based neural representations for classifying lines of programs

How can ML models be used to represent lines of programs? We design a dependency-graph based neural representation of programs, which evaluates whether a given line of code has a vulnerability in it or not.

Tags: papers, ml-for-plse, program-representation

arXiv-2020

Dependency-Based Neural Representations for Classifying Lines of Programs

Srikant, Shashank, Lesimple, Nicolas, and O’Reilly, Una-May

arXiv preprint arXiv:2004.10166 2020

Abstract

We investigate the problem of classifying a line of program as containing a vulnerability or not using machine learning. Such a line-level classification task calls for a program representation which goes beyond reasoning from the tokens present in the line. We seek a distributed representation in a latent feature space which can capture the control and data dependencies of tokens appearing on a line of program, while also ensuring lines of similar meaning have similar features.

We present a neural architecture, Vulcan, that successfully demonstrates both these requirements. It extracts contextual information about tokens in a line and inputs them as Abstract Syntax Tree (AST) paths to a bi-directional LSTM with an attention mechanism. It concurrently represents the meanings of tokens in a line by recursively embedding the lines where they are most recently defined.

In our experiments, Vulcan compares favorably with a state-of-the-art classifier, which requires significant preprocessing of programs, suggesting the utility of using deep learning to model program dependence information.

Links

PDF

PDF - Local copy