Auto-Locating and Fix-Propagating for HTML Validation Errors to PHP Server-side Code

INTRODUCTION Web applications have become a critical infrastructure in our society. The World Wide Web Consortium (W3C) has developed several standards to ensure the development of high-quality and reliable Web applications [1]. An important quality criterion for a Web application is Markup Validity [2], which defines the validity of a Web document in HTML and other client-side markup Web languages according to their corresponding grammar, vocabulary, and syntactical rules. Although modern Web browsers handle very well the parsing of even not well-formed HTML pages, some software defects in Web applications are not always easily caught due to the client-server and dynamic nature of Web contents. Checking HTML validation errors could really help the process of finding and fixing bugs in Web development. In a survey conducted by W3C [3], a majority of Web professionals stated that validation errors is the first thing they check whenever they run into a Web styling or scripting bug. Creating Web pages according to a widely accepted standard also makes them easier to maintain and evolve, even if the maintenance and evolution is performed by different developers [3]. Recognizing the importance of markup validity for Web pages, several organizations/individuals have produced automatic Web page validating tools (also called HTML validators). Some HTML validators (e.g. Tidy [4]) also provide automatic support for fixing markup errors to convert an HTML page into a well-formed one that conforms to HTML grammar and syntax. However, such auto-fixing tools work well only on static HTML pages and do not address several challenges in current Web development. The first challenge is that in a Web application, a client-side HTML page is often dynamically generated from the server-side code, which is written in different languages. For example, the server code is written in PHP, ASP, Perl, SQL, etc., while a client-side page is in HTML, JavaScript, CSS, and so on. The generated HTML code is embedded within the string literals or the values of variables in the server code. Moreover, those values are also scattered in multiple locations in server pages. For example, to produce an HTML table, multiple variables and string constants in different functions in the server code can be involved. Importantly, because the server code dynamically produces different client pages depending on run-time situations, if a validation error is found and reported in a Web page (e.g. via Tidy), it is challenging for its developers to manually map the buggy location(s) back to its source(s) in the server-side code. We propose PhpSync, an auto-locating and fix-propagating tool for HTML validation errors in PHP-based Web applications. Given an HTML page produced by a PHP server page, PhpSync uses Tidy, an HTML validating/correcting tool to find any validation errors on the HTML page. If errors are detected, PhpSync leverages the fixes from Tidy in the given HTML page and propagates them to the corresponding location(s) in the PHP code. In the cases that Tidy cannot provide the fixes, the auto-locating function in PhpSync will help developers to quickly locate the corresponding buggy locations in PHP code from the buggy HTML locations found by Tidy. PhpSync does not require the input that produces the erroneous page. The dynamic nature of a Web application is addressed via our symbolic execution algorithm that symbolically executes the given PHP program to create a single tree-based representation, called D-model, which approximates its possible HTML client page outputs. Each D-model represents a symbolic, string-based value that is resulted from the symbolic execution of any PHP expression(s). The D-model for the entire PHP server page or function is composed by the D-models resulted from the intermediate computations during the symbolic execution of the expressions in that page/function. Symbols in a D-model represent users’ inputs, data retrieved from databases, or unresolved values. A node in a D-model represents either 1) a determined value (e.g. a string literal), 2) a non-determined data value (e.g. a user’s input), 3) a concatenation operation, 4) a selection operation, or 5) a repetition operation on other nodes/values. This allows PhpSync to model the multi-valued and scattered server-side data and the multiple versions of client-side code generated from the server code. Another fundamental technique in PhpSync is CSMap, an algorithm that maps any text in the given HTML page produced by the given PHP program to the corresponding PHP code location by mapping that text to the node(s) of the corresponding D-model. Then, our fix-propagating algorithm derives the fixing changes from Tidy to the given HTML page and propagates them to the locations in PHP via the established client-to-server mappings. CSMap is generic and can be used in other applications such as locating the corresponding buggy PHP places for other types of errors found in an HTML page. Our empirical evaluation on real-world Web applications shows that PhpSync achieves on average 96.7% accuracy in locating the corresponding locations in PHP code from client pages, and 95% accuracy in fix-propagating to server code. The key contributions of this paper include: 1) PhpSync, an auto-locating and fix-propagating tool for HTML validation errors in PHP-based Web applications; 2) CSMap, a mapping algorithm from an HTML page (produced by a PHP page) to the corresponding PHP locations; 3) an empirical evaluation on several real-world Web applications to show PhpSync’s correctness and efficiency. Section II presents a motivating example. Section III discusses our representation model. Associated algorithms are described in Sections IV and V. Section VI is for our evaluation. Related work is in Section VII. Conclusions appear last.