Variable Feature Usage Patterns in PHP

INTRODUCTION PHP is an imperative, object-oriented language focused on server-side application development. As of April 2015, it ranks 7th on the TIOBE programming community index,1 and is used by 82 percent of all websites whose server-side language can be determined.2 Designed to allow for the rapid construction of websites, PHP includes a number of dynamic language features used to simplify code, provide reflective capabilities, and support deferring configuration decisions to runtime. An example of one such feature is variable variables. Instead of giving the name of a variable directly, it is given by an expression which should evaluate to a string containing the name of the variable. The actual variable to be accessed is then determined at runtime, based on the result of evaluating the expression providing the name. This provides a lightweight form of aliasing, and is often used to allow the same block of code to be applied to a number of different variables. At the same time, this makes it challenging to provide precise static analysis algorithms needed to support more advanced program analysis tasks and developer tools—without further analysis, a variable variable could refer to any variable in scope, including (if used in a global declaration) global variables. In prior work [1] we showed that many occurrences of variable variables could actually be resolved to a limited set of names statically, just by inspecting the code. Many occurrences also fell into a small number of standard usage patterns. However, this prior work was not automated, but instead was based on reviewing how each variable variable was actually used in the program. The lack of automation makes it difficult to use these results in other analyses and tools, or to update them to take account of new systems or new releases of the analyzed systems. This prior work also focused just on variable variables, ignoring patterns of use of similar features for specifying the names of functions, methods, properties, and classes at runtime. Below, we refer to these generally as variable features, and specifically as, e.g., variable functions or variable methods. The main contributions presented in this paper are as follows. First, taking advantage of our prior work, we have developed a number of patterns for detecting idiomatic uses of variable features in PHP programs and for resolving these uses to a set of possible names. Written using the Rascal programming language [2], [3], these patterns work over the ASTs of the PHP scripts, with more advanced patterns also taking advantage of control flow graphs and lightweight analysis algorithms for detecting value flow and reachability. Each of these patterns works generally across all variable features, instead of being designed specifically to recognize variable variables. Second, to empirically determine how often these patterns actually occur in practice, we have applied them across a corpus of 20 open-source PHP systems. This corpus, made up of 31,624 files and 3,725,904 lines of PHP, includes a number of popular frameworks and systems including WordPress, Joomla, MediaWiki, and Symfony. Results are reported both in total and for groupings of similar patterns, providing insight into how each of the systems in the corpus uses variable features. Several anti-patterns, indicating that an occurrence is most likely unresolvable statically, are also presented; their effectiveness is measured by comparing detected occurrences with occurrences that are actually resolved using the patterns. The rest of the paper is organized as follows. In Section II, we discuss PHP variable features in more depth, describing the various types of variable features available in the language and showing examples of how they are used. Section III then describes the corpus, tools, and research method applied to conduct this analysis. Following this, Section IV describes the patterns developed to detect idiomatic occurrences of variable features where these features can be statically resolved to precise sets of names as well as anti-patterns used to detect when this most likely is not possible. To determine how often these patterns and anti-patterns occur in actual open-source software, Section V presents the results from evaluating them using the corpus mentioned above. Finally, Section VI describes related work, and Section VII concludes. All software used in this paper, including the corpus used for the validation, is available for download at https://github.com/cwi-swat/php-analysis.