Supporting Analysis of SQL Queries in PHP AiR

INTRODUCTION PHP, one of the most popular languages for open-source development,1 is commonly used to build dynamic websites, where the information displayed on the page is based on user inputs and data stored in a (usually) relational database. Access to this data is through a variety of database libraries, either provided with the PHP language or available as separate downloads. Commonly-used libraries include the original MySQL library,2 the MySQL Improved (MySQLi) library,3 and the PHP Data Objects (PDO) library.4 Although these libraries have different APIs, all perform queries in a similar way: a query string is built using a combination of string literals— representing static parts of the query—and PHP expressions— representing dynamic parts of the query, such as the actual values used in a WHERE clause of a SELECT statement. As part of our current research, we are exploring how developers create queries that use the original MySQL library [1]. This library, available since version 2 of the PHP language, provides a procedural interface for querying MySQL databases. Queries are executed with the mysql_query function by passing a query string to the function, and features such as prepared statements are not supported. Even though the original MySQL library is now deprecated in PHP 5 releases from 5.5.0 and is no longer supported in PHP 7, a significant body of code still uses this library, including commonly-used systems such as the WordPress blogging platform. The ultimate goals of this research are to help developers better understand queries they find in code, to enable empirical research on how database queries are used in PHP code, and to renovate existing systems to allow the use of more modern and more secure database libraries such as MySQLi and PDO. Our approach, described below, is to statically extract a model representing the queries that can be passed to a specific occurrence of a query function such as mysql_query. This model can be used to better understand how the query is built in the code, and can also be used to statically generate the variants of the query, built based on different program paths taken at runtime (e.g., a query may have different WHERE clauses, based on different branches of a conditional). These variants can then be parsed using an extended MySQL parser that works with incomplete queries, where dynamic parts of the query are replaced with “holes”. The resulting MySQL ASTs can then be used to get more accurate information about which parts of the query are given statically and which parts (e.g., parameters, column names) are provided at runtime. The rest of the paper is organized as follows. In Section II we provide an overview of the process used to extract query models and to parse the partial queries yielded by a model. Section III and Section IV then provide detailed explanations: Section III explains how query models are extracted from PHP code and how these models are turned into a set of possible queries, while Section IV explains how queries with holes representing dynamically-provided information are parsed and represented as MySQL ASTs. We then briefly discuss related work in Section V, while Section VI discusses possible future work and concludes. The work described below is mainly implemented in the Rascal meta-programming language [2], [3] using the PHP Analysis in Rascal (PHP AiR) framework [4], [5]. The analysis and parsing code described in this paper is available online.5

OVERVIEW Figure 1 provides an overview of the process used to build a model and extract and parse the possible queries used in a database library call at a specific program point. The first step in the process is to parse the PHP scripts that make up the system being analyzed. This is done using our fork6 of an open-source PHP parser7, written in PHP. This generates a value of Rascal type System, which represents the parsed system and includes ASTs for each of the parsed PHP scripts. Using these ASTs, and the location of a call to a database query API function such as mysql_query (part of the original MySQL API), the Model Builder statically extracts a model representing the actual SQL queries that could be passed to this function, using a process described in Section III. This model differentiates between static and dynamic parts of the queries, provides links from names used in the query to the values that these names represent, and includes information about the conditions under which certain parts of a query are present. To get each of the possible queries represented by the model, the Query Yields Generator builds a representation— here called a yield—of each query, with each yield reflecting a specific query executed based on a specific program path from the start of the surrounding context (e.g., the function containing the API call) to the API call. As with the model, each yield includes a combination of static and dynamic pieces. To parse each of these queries, we use our fork8 of the opensource MySQL parser9 released as part of phpMyAdmin,10 a web-based tool for managing MySQL databases. This parser has been modified to properly parse queries with “holes” representing the dynamic parts of the query. The result of running the parser on each yield is to generate a Rascal AST for each query. At this time only queries using the MySQL dialect of SQL are supported, but similar parsers could be used for other dialects. More details on how the yields of the model are parsed, and on how each parsed query is represented in PHP and in Rascal, are presented in Section IV. Our current corpus for testing the model extraction and parsing processes uses a total of 17 systems with 749,451 lines of PHP code and 4,788 PHP files, as counted by the cloc11 tool. In total, these systems contain 2,607 calls to mysql_query. We plan to extend this to additional systems in the future.