Maintenance Patterns of large-scale PHP Web Applications

INTRODUCTION Various anecdotal sources in computer science claimed for long that despite the tremendous popularity of scripting languages [2], such as those employed in LAMP (LinuxApache-MySQL – Perl/Python/PHP), are not suitable for proper and professional software engineering [10]. In other words, the proponents of traditional compiled languages such as Java and C++ claimed that software projects based on scripting languages lack the architectural properties that allow systematic, effortless and viable maintenance. Such attacks to the scripting languages are less frequently documented in scientific papers, although the academic community usually tends to reject the change in programming practices brought about by scripting [10]. This skepticism is also reflected by the fact that in most academic institutions around the world, computing curricula do not rely on scripting or dynamic languages for their CS101 course. The number of empirical studies in the software engineering community on projects built with typeless scripting languages is also significantly smaller than that of strongly-typed, ‘system programming’ languages. On the other hand, evidence suggests that scripting languages enhance programmer productivity [12]. Prechelt [14] presented results according to which implementation times for programs written in scripting languages, such as Perl, Python, Rexx, and Tcl, were about one-half of the time required to implement the same functionality in C/C++/Java. The adoption of scripting languages by software practitioners is also reflected in the increased penetration to open-source development. In Sourceforge1 PHP’s project count is in the third place after Java (52,234 projects) and C++ (42,081 projects) numbering 33,259 projects and overcoming C counting 31,194 projects. In this paper we present an empirical study on five large open-source web applications implemented with the popular scripting language PHP, to investigate the evolution of web applications regarding their maturity, quality and adoption of the object oriented paradigm. We have examined several aspects of software evolution that might provide hints as to whether good practices in development and management have been followed: The existence of dead/unused code in any software system is a burden consuming resources and posing threats to maintainability. We examine the presence and survivability of unused code as a means of detecting architectural changes in the history of the examined systems. In scripting languages a major source of unused code is the employment of third party libraries, which at the same time is an accepted good practice in software development and a possible indication of maturity [1], [2]. In this context, we investigated the amount of library code being used over time in each system. Another factor implying software maturity is the stability of the corresponding APIs, and therefore, we have also examined six classes of possible API changes. Moreover, we investigated the migration of the analyzed projects to the object oriented paradigm as well as the evolution of their complexity. The rest of the paper is organized as follows: In Section II we introduce the web applications that have been analyzed, while in Section III we discuss issues and challenges related to the analysis of the examined versions. Results on each of the investigated aspects of software evolution are presented and discussed in Section IV. Threats to validity are summarized in section V. Related work on similar efforts for analyzing software systems and previous work on scripting languages is presented in Section VI. Finally, we conclude in Section VII.

APPLICATIONS The software systems used in the case study have been selected according to the following criteria:

they should be well known projects with established reputation in the open source community. they should have started in PHP versions prior to version 5. The reason for this choice is that despite the introduction of object-orientation in PHP 4, prior to version 5 the support was quite limited (e.g. there were no scope modifiers). they should have their code available in GitHub. they should have at least 30 unique tags in GitHub. Our major concern was to select acknowledged projects with a long history, large number of committers and even larger number of users. According to Samoladas et al. [15] the majority of open-source projects are abandoned after a short time period, rendering them inappropriate for systematic analysis of programming and maintenance habits. The case study has been conducted on the following five open source projects implemented in PHP: 1) Wordpess2 . The most popular blogging software; it has a vast community of both contributors and active users. 2) Drupal3 . One of the most advanced CMS (Content Management System). It is also characterized by a large and active community. 3) PhpBB4 . One of the most widely used forum software. 4) MantisBt5 . Probably the most popular bug tracking application written in PHP. 5) PhpMyAdmin6 . The well-known MySQL administration tool. The abovementioned software systems are to a large extent community driven and could be characterized as the founding projects of web application development (considering the PHP as programming language). They have set the standards and powered most of the web content created in the last decade. The fact that the examined projects have an enormous code base and numerous user plug-ins dependent upon them, implies that backward compatibility should never be broken. Due to the projects long existence there are many versions available. In Table I we show some statistics about the selected projects. Cumulatively, we have studied 390 official releases aggregating to 50 years of software evolution.

RELATED WORK Software evolution is one of the most studied areas in software engineering originating to the 1970’s when M. Lehman laid down the first principles of software evolution [7] which gradually evolved to eight laws. The validity of Lehman’s laws in various contexts has been studied by several researchers. Recently, Xie et al. [21] studied the software evolution of seven open source projects implemented in C.

McCabe’s CCN and LOC were used to investigate the validity of the second and the sixth law, respectively. Both laws have been validated. The findings for PHP projects are in agreement to these conclusions for C projects. Survival analysis to estimate aspects of software projects has been employed by Sentas et al. [17] as a tool to predict the duration of software projects. In a similar manner, Samoladas et al. [15] employed the Kaplan-Meier estimator to predict the duration of open source projects. Scanniello [16] applied the Kaplan-Meier estimator on Java open source projects, to study the effect of dead code in the evolution of projects. The results show that high rates of unused code are detected in most of the projects in that study. Regarding the use of libraries, Heinemann et al. [5] studied the extent of software reuse in Java open source software. The authors made a distinction between black box and white box usage which does not apply to scripting languages and in order to quantify the extent of reuse they measured byte code of jar files used. They showed that in most cases over 50% of the code size has its source in third party libraries. Mockus [11] investigated large-scale code reuse in open source projects by identifying components that are reused among several projects. However, Mockus’ work quantifies how often code entities are reused, rather than the actual third party code. Based on their results, code reuse is a common practice in open source projects, a fact which is confirmed by the findings in our study.