Efficient and Flexible Discovery of PHP Application Vulnerabilities

Introduction The most popular and widely deployed language for Web applications is undoubtedly PHP, powering more than 80% of the top ten million websites [29], including widely used platforms such as Facebook, Wikipedia, Flickr, or WordPress, and contributing to almost 140,000 open-source projects on GitHub [38]. Yet from a security standpoint, the language is poorly designed: It typically yields a large attack surface (e.g., every PHP script on a server can potentially be used as an entry point by an attacker) and bears inconsistently designed functions with often surprising side effects [22], all of which a programmer must be aware of and keep in mind while developing a PHP application. As a result of its confusing and inconsistent APIs, PHP is particularly prone to programming mistakes that may lead to Web application vulnerabilities such as SQL injections and Cross-Site Scripting. Combined with its prevalence on the Web, PHP therefore constitutes a prime target for automated security analyses to assist developers in avoiding critical mistakes and consequently improve the overall security of applications on the Web. Indeed, a considerable amount of research has been dedicated to identifying vulnerable information flows in a machine-assisted manner [15, 16, 4, 5]. All these approaches successfully identify different types of PHP vulnerabilities in Web applications. However, all of these approaches have only been evaluated in a controlled environment of about half a dozen projects. Therefore it is unclear how scalable they are and how well they perform in much less controlled environments of very large sets of arbitrary PHP projects. (See Section 7 on related work for details). In addition, these approaches are hardly customizable, in the sense that they cannot be configured to look for various different kinds of vulnerabilities. The research question of how to detect PHP application vulnerabilities at large scale in an efficient manner, whilst maintaining an acceptable precision and the ability to customize the detection process as needed, has received significantly less attention so far. Yet it is a question that is crucial to cope with, given the rapidly increasing number of Web applications. Our Contributions. We propose a highly scalable and flexible approach for analyzing PHP applications that may consist of millions of lines of code. To this end, we leverage the recently proposed concept of code property graphs [35]: These graphs constitute a canonical representation of code incorporating a program’s syntax, control flow, and data dependencies in a single graph structure, which we further enrich with call edges to allow for interprocedural analysis. These graphs are then stored in a graph database that lays the foundation for efficient and easily programmable graph traversals amenable to identifying flaws in program code. As we show in this paper, this approach is well-suited to discover vulnerabilities in high-level, dynamic scripting languages such as PHP at a large scale. In addition, it is highly flexible: The bulk work of generating code property graphs and importing them into a database is done in a fully automated manner. Subsequently, an analyst can write traversals to query the database as desired so as to find various kinds of vulnerabilities: For instance, one may look to detect common code patterns or look for specific flows from given types of attacker-controller sources to given securitycritical function calls that are not appropriately sanitized; what sources, sinks, and sanitizers are to be considered may be easily specified and adapted as needed. We show how to model typical Web application vulnerabilities using such graph traversals that can be efficiently run by the database backend. We evaluate our approach on a set of 1,854 open-source PHP projects on GitHub. Our three main contributions are as follows: • Introduction of PHP code property graphs. We are the first to employ the concept of code property graphs for a high-level, dynamic scripting language such as PHP. We implement code property graphs for PHP using static analysis techniques and additionally augment them with call edges to allow for interprocedural analysis. These graphs are stored in a graph database that can subsequently be used for complex queries. The generation of these graphs is fully automated, that is, all that users have to do to implement their own interprocedural analyses is to write such queries. We make our implementation publicly available to facilitate independent research. • Modeling Web application vulnerabilities. We show that code property graphs can be used to find typical Web application vulnerabilities by modeling such flaws as graph traversals, i.e., fully programmable algorithms that travel along the graph to find specific patterns. These patterns are undesired flows from attackercontrolled input to security-critical function calls without appropriate sanitization routines. We detail such patterns precisely for attacks targeting both server and client, such as SQL injections, command injections, code injections, arbitrary file accesses, cross-site scripting and session fixation. While these graph traversals demonstrate the feasibility of our technique, we emphasize that more traversals may easily be written by PHP application developers and analysts to detect other kinds of vulnerabilities or patterns in program code. • Large-scale evaluation. To evaluate the efficacy of our approach, we report on a large-scale analysis of 1,854 popular PHP projects on GitHub totaling almost 80 million lines of code. In our analysis, we find that our approach scales well to the size of the analyzed code. In total, we found 78 SQL injection vulnerabilities, 6 command injection vulnerabilities, 105 code injection vulnerabilities, 6 vulnerabilities allowing an attacker to access arbitrary files on the server, and one session fixation vulnerability. XSS vulnerabilities are very common and our tool generated a considerable number of reports in our large-scale evaluation for this class of attack. We inspected only a small sample (under 2%) of these reports and found 26 XSS vulnerabilities. Paper Outline. The remainder of this paper is organized as follows: In Section 2, we discuss the technical background of our work, covering core concepts like ASTs, CFGs, PDGs, and call graphs. In Section 3, we present a conceptual overview of our approach, follow up with the necessary techniques to represent and query PHP code property graphs in a graph database, and discuss how typical classes of vulnerabilities can be modeled using traversals. Subsequently, Section 4 presents the implementation of our approach, while Section 5 presents the evaluation of our large-scale study. Following this, Section 6 discusses our technique, Section 7 presents related work, and Section 8 concludes.

Code Property Graphs Our work builds on the concept of code property graphs, a joint representation of a program’s syntax, control flow, and data flow, first introduced by Yamaguchi et al. [35] to discover vulnerabilities in C code. The key idea of this approach is to merge classic program representations into a so-called code property graph, which makes it possible to mine code for patterns via graph traversals. In particular, syntactical properties of code are derived from abstract syntax trees, control flow from control flow graphs, and finally, data flow from program dependence graphs. In addition, we enrich the resulting structure with call graphs so as to enable interprocedural analysis. In this section, we briefly review these concepts to provide the reader with technical background required for the remainder of the paper. We consider the PHP code listing shown in Figure 1 as a running example. For the sake of illustration, it suffers from a trivial SQL injection vulnerability. Using the techniques presented in this paper, this vulnerability can be easily discovered.