0% found this document useful (0 votes)
70 views

Source Code Attractiveness PDF

This study examines the relationship between source code metrics and attractiveness in free and open source software projects. The researchers analyzed 6,773 projects from SourceForge.net written in C language. They found that one structural complexity metric and two size metrics were correlated with attractiveness, as measured by number of downloads and members. The findings suggest that source code quality, as indicated by certain objective metrics, can positively influence a project's ability to build an active community of contributors and users.

Uploaded by

Emil Stankov
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views

Source Code Attractiveness PDF

This study examines the relationship between source code metrics and attractiveness in free and open source software projects. The researchers analyzed 6,773 projects from SourceForge.net written in C language. They found that one structural complexity metric and two size metrics were correlated with attractiveness, as measured by number of downloads and members. The findings suggest that source code quality, as indicated by certain objective metrics, can positively influence a project's ability to build an active community of contributors and users.

Uploaded by

Emil Stankov
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

A Study of the Relationships between Source Code

Metrics and Attractiveness in Free Software Projects


Paulo Meirelles, Carlos Santos Jr., João Miranda, Fabio Kon Antonio Terceiro, Christina Chavez
FLOSS Competence Center Department of Computer Science
Institute of Mathematics and Statistics Federal University of Bahia, Brazil
University of São Paulo, Brazil (DCC-UFBA)
(CCSL-IME/USP) {terceiro,flach}@dcc.ufba.br
{paulormm,denner,joaomm,fabio.kon}@ime.usp.br

Abstract—A significant number of Free Software projects has However, not all Free Software projects reach success and
been widely used and considered successful. However, there high quality [5]. The amount of inactive projects is undoubt-
is an even larger number of them that cannot overcome the edly higher compared to the number of active projects. To
initial steps towards building an active community of users and
developers. In this study, we investigated whether there are illustrate this scenario, consider the data extracted in Novem-
relationships between source code metrics and attractiveness, i.e., ber, 2009, from Sourceforge.net, one of the most popular Free
the ability of a project to attract users and developers. To verify Software repositories. Out of its 201,494 projects, only 60,642
these relationships, we analyzed 6,773 Free Software projects had more than one release, 40,228 had been downloaded more
from the SourceForge.net repository. The results indicated that than once, and 23,754 had more than one member. Finally,
attractiveness is indeed correlated to some source code metrics.
This suggests that measurable attributes of the project source only 12,141 projects matched all these criteria simultaneously.
code somehow affect the decision to contribute to and adopt This may indicate that no more than 6% of the projects on
a Free Software. The findings described in this paper show SourceForge.net are able to have a healthy community of users
that it is relevant for project leaders to monitor source code and developers benefiting from a Bazaar style of development
quality, particularly a few objective metrics, since these can have [7].
a positive influence in projects chances of forming a community
of contributors and users around their software, enabling further Santos Jr. et al. [8] defined a theoretical model for attrac-
enhancement in quality. tiveness as a crucial construct for Free Software projects,
proposing their (i) typical origins (e.g., license type and
I. I NTRODUCTION intended audience); (ii) indicators (e.g., number of members
and downloads); (iii) consequences (e.g, levels of activity and
The adoption of Free and Open Source Software1 has
time for task completion). They suggested that the success of
significantly increased in the last decades to the point of
any project depends on its level of attractiveness to potential
becoming influential to the global economy [1]. Although
contributors and users. Based on this model, our study ex-
Free Software has emerged as a movement supported by
plored some of the factors that may enable projects to build
volunteer developers, many large companies are now involved
a community by attracting users and developers. Specifically,
in it [2], [3]. According to a Forrester Consulting survey, which
our focus rests on objective factors: we investigated whether
compared large companies in Europe and North America [4],
attractiveness can also be influenced by measurable source
the usage of Free Software is currently widespread in the
code attributes – structural complexity and size.
back-end, middleware, office productivity tools, and business
Although source code metrics have been proposed since
applications software categories. Moreover, this survey states
the 1970s [9], their potential use as guidelines for software
that 92% of the senior business and IT executives say that Free
development has not been fully explored yet [10]. In particular,
Software products have met and, in some cases, exceeded their
we have observed that many Free Software projects do not
quality expectations.
practice source code quality evaluation and have no tools
This satisfaction and quality is usually achieved thanks to
available to do so. This lack of systematic code evaluation
the collaboration of a large user and developer community
leaves a lot of room for improvement in Free Software
who reports failures, fixes bugs and adds features. In fact, the
projects’ processes and practices [5].
Free Software development model is said to offer two main
In general, despite the high importance of source code in
advantages: the potential for peer-review and the possibility
the Free Software community, called the “show me the code”
of attracting developers from different parts of the world [5].
culture, source code metrics are often not perceived as an
Hence, an important issue for Free Software projects is to
indicator of quality. To address this apparent contradiction,
attract volunteers [6].
we argue, theoretically, that source code metrics are related
1 In this study, we consider the terms Free Software and Open Source to project attractiveness and, thus, influence its success. To
Software (OSS) equivalent. verify these ideas empirically, we analyzed 6,773 projects
written in the C language from SourceForge.net. We show that, that module variables should be only manipulated via ac-
considering such a sample, one structural complexity metric cessor functions [13], module variables should be private,
and two size metrics play an important role in explaining indicating that the optimal number for this metric is zero.
attractiveness, here represented by the number of downloads The number of public functions in a module represents the
and the number of project members. size of the “module interface”. Functions are directly related
The remainder of this paper is organized as follows: Sec- to the operations provided by their module. High values for
tion II presents the theoretical foundations of source code this metric indicate that a module has a lot of functions
metrics and attractiveness. Section III presents our hypotheses and, probably, many responsabilities, conflicting with good
and shows the selection criteria and definition for the vari- programming practices [13].
ables used in our study. Section IV evaluates the hypotheses Cohesion is a measure of the diversity of “topics” that a
and discusses their results. Section V reviews related work. module implements. High cohesion values indicate whether
Conclusions and future research directions are discussed in a module focus on a single aspect of the system, while low
Section VI. indicates it deals with several different aspects. In terms of
undestanding, modification and maintainability, highly cohe-
II. T HEORETICAL BACKGROUND sive modules should be seeked. A metric commonly used for
In this section we present the definitions of the selected cohesion is Lack of Cohesion of Methods (LCOM), originally
source code metrics (Section II-A) and the attractiveness con- proposed by Chidamber and Kemerer [15]. High LCOM
cept, including its proxies (Section II-B). These were chosen values indicate low cohesion, while low LCOM values indicate
to build a statistical model that represents the relationships high cohesion.
proposed. The first LCOM definition, called LCOM1, corresponds to
the number of function pairs of a module which manipulate the
A. Source code metrics same module variables. Once it has received several criticism
and revision proposals, through this study we used the revised
For our subset of source code metrics, we used the “module”
definition by Hitz and Montazeri, known as LCOM4 [16]. In
concept as a general term for the different types of structures
order to calculate LCOM4 of a module, it is necessary to build
used in software development. Therefore, it stands for classes,
an undirect graph in which the nodes are its functions and
abstract data types and source files. Specifically in this study,
variables. For every function, there should be an edge between
we consider a C source file as a “module”.
it and another function or variable it uses. The LCOM4 value
Similarly, we generalized the concept of “method” to “func-
is the number of weakly connected components of this graph.
tion”, denoting a portion of source code that performs a
specific task. Coupling is a measure of how one module is connected
The most commonly used metric to measure software size to other modules in the project. High coupling indicates a
is Lines of Code (LOC), which indicates the number of non- greater difficulty to change the modules of the system, since a
blank and non-comment source code lines. Using the LOC change in one module may have an impact in all other modules
metric as a basis for comparison between projects requires the that are coupled to it. In other words, if coupling is high,
projects to be written in the same programming language [11]. the software tends to be less flexible, more difficult to adapt,
On the other hand, Number of Modules is another useful and more difficult to understand. A straightforward method to
size indicator, which is somewhat less influenced by program- measure coupling is Coupling Between Objects (CBO), once
ming languages and line-level coding styles. With that being again proposed by Chidamber and Kemerer. CBO measures
said, it can be used to compare projects written in different how many modules are used by the one being analyzed [15].
languages [10]. The more complex a piece of software, the more challenging
When considering characteristics such as maintainability, it is to change and evolve it. Coupling and cohesion have been
flexibility, comprehension effort, and source code quality in described and discussed in many works as essential indicators
general, one has to take into account not only the size metrics of structural complexity [17]. Moreover, it is widely known
described above but also structural metrics, such as the ones that to build high-quality and flexible software, it is advisable
that follow. to seek low coupling and high cohesion [18].
Number of Functions (NOF) is used to measure module size In fact, Darcy et al. [17] showed that, individually, neither
in terms of the operations it implements. This metric is used coupling nor cohesion are related to a software maintenance
to help identify the reuse potential of a module. In general, effort. Both must be considered together. When combined, the
modules with a large number of functions are more difficult product of coupling and cohesion as a metric is positively
to reuse because they tend to be less cohesive [12]. Hence, a correlated to the maintenance effort.
module should not have an excessive number of functions [13]. In conclusion, seven source code metrics, selected according
Number of Public Variables (NPV) and Number of Pub- to criteria shown in Section III, were used on the statistical
lic Functions (NPF) are metrics related to module encap- model for our study. In particular, we use the product of
sulation. They measure the potential communication among coupling (CBO) and cohesion (LCOM4) as our metric of
modules [14]. Once good programming practices recommend structural complexity (SC) [17].
Fig. 1. Attractiveness research model – adapted from Santos Jr. et al [8].

B. Attractiveness games, scientific, etc [20]. This represents the applica-


tion domain and influences attractiveness for marketing
Attractiveness is the capacity of bringing users and devel- reasons normally known as niches. Some niches have
opers to a project. A Free Software project is as attractive more volunteer labor (and user base) available than
as it has the ability to be appealing to potential users and others. Moreover, some application domains have more
developers. They will later use the software and, ultimately, competing (similar) projects than others, making it harder
participate on tasks to improve the project [8]. for a project to stand out as a viable option for use [19].
In our study, the concept of attractiveness and its proxies • Development status – life-cycle status – refers to the
are based on Santos Jr.’s attractiveness model [8], one of current available versions of the software. This could be,
our previous works. In that work, we presented a research for example, testing, beta, stable, production, and mature.
model for attractiveness, shown in Figure 1. We now intend This status can influence a developer’s decision to join
to expand and adapt it to our hypotheses about source code and contribute to a project, as well as a user’s decision
attributes and attractiveness, as emphasized also in Figure 1. to adopt its software [8]. It also can affect members’
This model specifies project characteristics that influence its motivations to work in order to release a new version,
attractiveness, and the consequences of attractiveness (e.g., affecting productivity rates [21], [6].
levels of activity, efficiency, likelihood of task completion, We propose to insert source code attributes as a project
time for task completion, and quality perception) [8]. characteristic in this theoretical cause and effect model of
Originally, the project characteristics proposed were: Free Software attractiveness, shown in Figure 1. To achieve
• License Type under which a program is available, such as this, we defined a simple (intermediary) attractiveness model
GPL, BSD, and Mozilla Public. The license influences the to observe the individual influence of source code attributes on
use and distribution of a project, and defines the rules for Free Software attractiveness – a subset of variables highlighted
creating derivative works, regulating what can and cannot in Figure 1. In short, we propose a new element that can
be done with the source code [8]. These restrictions explain attractiveness partially. We did not deal with the
impact people’s motivations to use and develop Free consequences to attractiveness when we added source code
Software. attributes as an influence. However, we expect that they would
• Intended audience is the type of users (e.g., beginner, work in the same causal chain manner as shown in Figure 1.
advanced) and members (e.g., system administrator, Java Before a Free Software project can receive failure reports,
programmers) a project aims at. Audience can influence bug fixes, and new features, it should be attractive to volun-
the number of potential developers for it expands or teers, who normally first use and join the project, providing
shrinks the target population size [19]. Moreover, specific contributions later. Over time, these contributions affect the
types of users attract specific members, which define their number of downloads and bring more members, creating
expertise and likelihood to contribute [8]. a positive feedback loop. Thus, we estimated attractiveness
• Type of project refers to the specific area to which a based on two of its empirical indicators:
project is related, such as genealogy, payroll, browsing, 1) number of downloads as registered for SourceForge.net
projects, which represents the number of people inter- In addition, tools based on object code are not capable of ana-
ested in using the software. lyzing features present only in the source, such as comments.
2) number of members as registered for SourceForge.net To avoid these problems, Analizo is designed to extract the
projects, which represents the number of contributors to information directly from source code files.
the project. Analizo uses Doxyparse3 , a multi-language source
One should note that the numbers of downloads and mem- code parser based on Doxygen’s internals, to parse the source
bers at SourceForge.net are proxies to the actual numbers of code. This feature provides Analizo with the potential
users and developers of the project, respectively. This study ability to parse all the languages supported by Doxygen4 .
explored a large number of projects and applied the same So, up to the moment, it has been tested only with C, C++,
criteria uniformly to all of them. Although we recognize that and Java source code, supporting the computation of fourteen
using these proxies is a limitation, we did not find in previous metrics:
works better proxies that could represent more faithfully the • Afferent Connections per Class (ACC),
number of users and number of developers in a large sample • Coupling between Objects (CBO),
of projects. • Coupling Factor (COF),
The meaning of “success” to Free Software projects was • Depth of Inheritance Tree (DIT),
discussed from different perspectives in previous research: • Lack of Cohesion on Methods/Functions (LCOM4),
(i) source code modularity [22]; (ii) number of lines of • Lines of Code (LOC),
code generated [23]; (iii) velocity of closing bugs [6]; (iv) • Lines per Method/Function (AMZ_Size),
ability of a project to advance through development phases • Number of Attributes/Variables (NOV),
(e.g., from alpha to beta to stable) [21], [20]; (v) number of • Number of Children per Class (NOC),
downloads [24]; (vi) number of members [25]. • Number of Methods/Functions (NOF),
In our understanding, these measures, when individually • Number of Classes/Module (NM),
used, do not indicate fully a successful Free Software project; • Number of Public Attributes/Variables (NPV),
but, when analyzed together, they offer the means to help • Number of Public Methods/Functions (NPF),
achieve success, or keep it [8]. Additionally, the vast majority • Response for Class (RFC).
of collaboration for a Free Software project lies on its source The correctness of the metrics computation was evaluated
code, which is the most important “artifact” generated and by comparing the results provided by Analyzo and other
managed by and for its community. Therefore, we propose to existing tools such as CCCC5 , Cscope6 , Eclipse-Metrics7 , and
insert in the attractiveness model the source code attributes – Macxim/Spago4Q8 .
obtained for source code metrics – as one of the attractiveness
origins. In summary, some source code characteristics can lead B. Sample and Data Collection
to more contributions for a project, which may attract more SourceForge.net shares its data to support Free Software
users and developers – our hypothesis are based on this idea. researchers. In this study, we used the data available in a
III. R ESEARCH D ESIGN database managed by the University of Notre Dame9 and an-
other one provided by the FLOSSMole project10 . We accessed
Initially, this Section presents a source code analysis tool these databases in November, 2009 and collected data about
under development by our group called Analizo (Sec- all the projects that matched the following criteria:
tion III-A). It was used to calculate source code metric of • Source code written in the C language. While the vast
6,773 Free Software projects from SourceForge.net. Later, majority of Free Software applications is written in C
Section III-B presents the criteria used for sample and data [27], a large amount of research work focuses their
collection. Finally, Section III-C shows the multiple regression analyzes in projects written in Java (e.g., the related work
model defined to test our hypothesis, which are discussed in reported in section V). Given this disparity between the
Section III-D. actual Free Software ecosystem and the research that
A. The Analizo Tool addresses it, and our previous experience with analysis of
Free Software written in C [28], we chose to focus the
Analizo2 is a multi-language source code analysis tool. analysis in this work to such projects as well. This is our
Its architecture was designed to support source code parsing first study relating source code metrics and attractiveness,
in different languages and to report useful information about and in the future we plan to include other programming
it.
3 softwarelivre.org/mezuro/doxyparse
A basic requirement of our source code analysis tool was the
4 doxygen.org
ability to analyze source code written in multiple languages. 5 cccc.sourceforge.net
Most existing tools use object code to extract data, making 6 cscope.sourceforge.net
it impossible to process projects that do not compile due to 7 metrics.sourceforge.net
failures in either the source code or in its dependencies [26]. 8 qualipso.dscpi.uninsubria.it/macxim
9 nd.edu/~oss/Data/data.html
2 softwarelivre.org/mezuro/analizo 10 flossmole.org
TABLE I
D ESCRIPTIVE STATISTICS
Raw Logarithm
Metric Minimum Maximum Mean Std. Deviation Mean Std. Deviation
(Average) Coupling Between Objects 0.0015 711.50 2.26 9.04 0.35 0.98
(Average) Lack of Cohesion on Methods 0.0004 262.00 4.77 12.00 1.01 1.09
(Average) Structural Complexity 0 4,940.00 15.79 114.69 1.37 1.57
(Total) Number of Modules 1 7,177.00 74.98 276.54 3.08 1.39
(Total) Lines of Code 11 2,983,103.00 17,722.23 91,614.70 8.28 1.58
(Total) Number of Public Variables 1 516034.00 994.80 8850.44 4.91 1.80
(Total) Number of Functions 1 99468.00 612.54 2987.28 4.92 1.63
(Total) Number of Public Functions 1 99468.00 642.12 3025.94 5.02 1.58
(Total) Number of Members 1 288.00 2.90 6.19 0.59 0.79
(Total) Number of Downloads 6 941,498,760.00 956,674.26 17,760,732.37 8.20 2.66

languages (e.g., C++, Java), as well as trying to identify which reduced the Skewness and Kurtosis values and made
similarities and discrepancies among projects written in them proper to run multiple regressions [20]. The arithmetic
different languages. mean and standard deviation of the logarithm values can be
• More than one download. Projects with no downloads seen in the second part of Table I.
are probably either non-development projects, or projects
C. Variables
that have just started, or are other special cases.
Among the fourteen source code metrics that Analizo
This provided us with a list of 11,433 projects. After this
provided, we selected seven for our initial analysis: total LOC,
preliminary sampling, the following steps were automatically
total NM, total NOF, total NPV, total NPF, average LCOM4,
executed by scripts developed by our group, to perform data
and average CBO. To be able to apply to the procedural
collection:
paradigm of the C language some metrics that area widely-
1) Download the code of all the projects. This resulted used in the literature such as NOF, NPV, NPF, LCOM4,
in the source code for 10,128 projects since some of and CBO, we assumed a mapping of the object concepts of
them had no available files (empty “files” section in the “class” and “method” to the C concepts of “source file” and
SourceForge.net project pages); “function”.
2) Run Analizo sequentially for all projects and store the In this first study, we limited the scope of metrics used to
computed metrics in a single database. be able to reach a simple yet comprehensive model to relate
The metrics were successfully computed for 6,773 source code and attractiveness. Thus, the ACC, AMZ_Size,
projects only, because (i) some downloaded files did not COF, DIT, NOC, NOV, and RFC metrics were left out of the
contain source code (e.g., binary-only downloads), (ii) scope of this study.
the source code was not written in C, (the project was
incorrectly classified as being written in C), or (iii) some Nevertheless, in the first stage of our statistical analysis,
files could not be processed by Analizo due to severe the LOC, NOF, NPV and NPF metrics showed a high corre-
errors in the source code (e.g., syntax errors); lation between each other, according to Pearson’s parametric
3) Cross-join the two datasets. Finally the two datasets – correlations as shown by the bold numbers in Table II. Highly
the SourceForge.net data available from the University correlated variables indicate that they are representations of a
of Notre Dame and FLOSSMole on the one side and same attribute, making it unnecessary to use more than one.
the source code metrics calculated by Analizo on the Since all these metrics represent a similar concept, we selected
other side – were cross-joined so that we could perform one of them – LOC – for our statistical analysis to reduce
the needed statistical analysis. multicollinearity.
Table I summarizes our sample, but the complete data set We have also analyzed the Spearman and Kendal non-
used for this study is available on the Web 11 . Section III-C parametric correlations (see Table III and Table IV, respec-
discusses in detail how we selected the variables presented tively) given that some of our variables are not normally dis-
in Table I. This table shows natural values of minimum, tributed. In our analysis, we observed that after transforming
maximum, arithmetic mean, and standard deviation for each our variables in their logarithmic form, Pearson correlations
variable, indicating the characteristics of our sample. performed just as well as the non-parametric indices. Thus, we
We analyzed our selected variables in their natural form chose the Pearson parametric correlation because it represents
(Raw) to verify their distribution, which is presented in the the most commonly used form of correlation index, and it also
first part of Table I. Thereby, we observed that the Skewness provides the basis for the regression analysis we performed
and Kurtosis probability distribution showed high values, indi- later. That way, we could maintain consistency once multiple
cating non-normality [29]. Because of this non-normality, we regression techniques are based on parametric indices [29].
transformed the variables to a logarithm scale for linearization, According to the analysis described above, we ended up
considering LOC and NM as size metrics. Theoretically, the
11 ccsl.ime.usp.br/mangue/data more LOC, the more NM. However, it is possible to have
TABLE II
PARAMETRIC C ORRELATIONS : P EARSON
Variable CBO LCOM4 SC NM LOC NPV NOF NPF Mbrs DLs
Coupling Between Objects - 0.141 0.723 0.380 0.608 0.423 0.434 0.492 0.113 0.129
Lack of Cohesion on Methods 0.141 - 0.786 0.019 0.472 0.102 0.311 0.361 0.080 0.107
Structural Complexity 0.723 0.786 - 0.254 0.666 0.338 0.493 0.564 0.127 0.156
Number of Modules 0.308 0.019 0.254 - 0.799 0.730 0.815 0.827 0.311 0.344
Lines of Code 0.608 0.472 0.666 0.799 - 0.872 0.923 0.927 0.328 0.410
Number of Public Variables 0.423 0.102 0.338 0.730 0.872 - 0.756 0.761 0.303 0.386
Number of Functions 0.434 0.311 0.493 0.815 0.923 0.756 - 0.886 0.320 0.380
Number of Public Functions 0.492 0.361 0.564 0.827 0.927 0.761 0.886 - 0.308 0.365
Number of Members 0.113 0.080 0.127 0.311 0.328 0.303 0.320 0.308 - 0.676
Number of Downloads 0.129 0.107 0.156 0.344 0.410 0.386 0.380 0.365 0.676 -

TABLE III
N ON -PARAMETRIC C ORRELATIONS : S PEARMAN
Variable CBO LCOM4 SC NM LOC NPV NOF NPF Mbrs DLs
Coupling Between Objects - 0.340 0.803 0.473 0.662 0.523 0.518 0.566 0.156 0.169
Lack of Cohesion on Methods 0.340 - 0.773 0.213 0.490 0.348 0.478 0.516 0.129 0.162
Structural Complexity 0.803 0.773 - 0.370 0.685 0.478 0.571 0.631 0.164 0.196
Number of Modules 0.473 0.213 0.370 - 0.793 0.718 0.818 0.828 0.284 0.320
Lines of Code 0.662 0.490 0.685 0.793 - 0.863 0.918 0.922 0.307 0.392
Number of Public Variables 0.523 0.348 0.478 0.718 0.863 - 0.758 0.765 0.280 0.363
Number of Functions 0.518 0.478 0.571 0.818 0.918 0.758 - 0.895 0.300 0.362
Number of Public Functions 0.566 0.516 0.631 0.828 0.922 0.765 0.895 - 0.288 0.347
Number of Members 0.156 0.129 0.164 0.284 0.307 0.280 0.300 0.288 - 0.598
Number of Downloads 0.169 0.162 0.196 0.320 0.392 0.363 0.362 0.347 0.598 -

TABLE IV
N ON -PARAMETRIC C ORRELATIONS : K ENDALL
Variable CBO LCOM4 SC NM LOC NPV NOF NPF Mbrs DLs
Coupling Between Objects - 0.244 0.650 0.341 0.483 0.377 0.373 0.410 0.118 0.114
Lack of Cohesion on Methods 0.244 - 0.597 0.148 0.341 0.240 0.333 0.362 0.097 0.109
Structural Complexity 0.650 0.597 - 0.262 0.497 0.345 0.413 0.460 0.124 0.132
Number of Modules 0.341 0.148 0.262 - 0.605 0.546 0.641 0.648 0.217 0.219
Lines of Code 0.483 0.341 0.497 0.605 - 0.660 0.763 0.771 0.233 0.270
Number of Public Variables 0.377 0.240 0.345 0.546 0.660 - 0.588 0.596 0.213 0.249
Number of Functions 0.373 0.333 0.413 0.641 0.763 0.588 - 0.864 0.228 0.248
Number of Public Functions 0.410 0.362 0.460 0.648 0.771 0.596 0.864 - 0.219 0.237
Number of Members 0.118 0.097 0.124 0.217 0.233 0.213 0.228 0.219 - 0.471
Number of Downloads 0.114 0.109 0.132 0.219 0.270 0.249 0.248 0.237 0.471 -

more lines of code without having more modules, by adding In summary, our multiple regression model ended up with
code into existing modules. Also, a software could have more the following variables:
modules but keep the number of lines of code when refac- • Independent variables (source code metrics)
toring is applied. Furthermore, we understood that number
– Structural Complexity (SC): The product of CBO and
of modules (NM) did not highly correlate with the others,
LCOM4 metrics.
since we considered a high correlation when the Pearson’s
– Lines of Code (LOC), the sum of lines of code in all
correlations values were approximately 0.9 or higher as our
modules of the project;
criteria, emphasized in Table II.
– Number of Modules (NM), the total number of all
In conclusion, LOC and NM were collected since they modules of the project.
measure different kinds of size metrics, more or less influenced
• Dependent variables (attractiveness)
by programming languages and coding styles, respectively.
Finally, to obtain the value of our structural complexity metric – Number of Downloads: a proxy for the number of
(SC), explained in Section II, CBO and LCOM4 were multi- users of the project;
plied. These three metrics did not show high correlations with – Number of Members: a proxy for the number of
other metrics. As expected, SC showed a positive correlation developers in the project.
with both. However, it was not as high as the others were, The model developed in this study revolves around attrac-
because CBO and LCOM4 had a low correlation with each tiveness, aiming at the explanation of its causes. We defined
other. This means that SC, statistically, represents different a multiple regression model that has attractiveness as its
attributes when compared to CBO and LCOM4, thus endorsing dependent variable. It was measured through two indicators:
the theory that CBO and LCOM4 together offer different number of downloads and number of members. Thus, we have
information. two different regressions, one for each attractiveness indicator.
They are the variables explained by the source code attributes Table V summarizes the regression results based on the
proposed in our hypotheses. Consequently, the SC, LOC and Pearson’s correlation values. These statistical results indicated
NM metrics, which represent the source code attributes, are a linear dependency between our source code metrics and
the independent variables – the influencers of attractiveness. each attractiveness variable. In this table, β is a coefficient
that indicates the size of the influence of each metric on each
D. Research Hypotheses
attractiveness indicator.
In this first study about the relationships between source As we can see in Table V, lines of code is more strongly
code metrics and attractiveness, we investigated whether two correlated to downloads and members than structural com-
attributes – structural complexity and size – obtained via four plexity and number of modules, according to the standardized
source code metrics might influence the attractiveness of Free beta (Std. β). Standardized betas are calculated to perform
Software projects. Thereby, we can later observe whether comparisons between variables that are measured using dif-
these attributes influence people’s perception of quality as ferent scales (e.g., lines of code and structural complexity).
consequence of attractiveness. According to the metrics chosen One cannot compare regular beta coefficients without first
to represent structural complexity and size, we formulated standardizing them.
three hypotheses: Moreover, structural complexity has a negative correlation
H1 – Free Software projects with higher structural com- with attractiveness, as expected. Noteworthy is that the T-
plexity have lower attractiveness. The higher the software test and P (probability) values represent whether a source
complexity, the more difficult it is to understand its source code metric is a statistically significant predictor or influencer
code for maintenance and evolution purposes. This leads to an of attractiveness indicators. For downloads, the number of
increase in the maintenance effort, and makes it more difficult modules is not significant because its P-value is greater than
to attract new members and users for the project. Over time, 0.05. Finally, in the last line of Table V, R-squared values
with less members and users, the project may lose its ability indicate the percentage of attractiveness (users and developers)
to add new features and fix bugs and, consequently, its ability variance that this set of source code metrics is capable of
to evolve and meet the user’s changing requirements. explaining. So, roughly speaking, an R-squared of 20 percent
H2 – Free Software projects with more lines of code have indicates that a set of predictors can explain 20 percent of a
higher attractiveness. To some extent, lines of code reflect the dependent variable. We obtained the following equations:
amount of features of the project and the amount of work that
have been put into it. Therefore, projects with more lines of downloads = 1.551 − 0.286 × log(SC)
code will usually attract more users (since they have more +0.856 × log(LOC) + 0.008 × log(N M )
features) and developers – since they offer more opportunities
members = −0.668 − 0.033 × log(SC)
for contribution. +0.126 × log(LOC) + 0.087 × log(N M )
H3 – Free Software projects with a higher number of
modules have higher attractiveness. The number of modules Each equation has one R-value. The coefficient (β) of each
may indicate the project size and the possibility of working in variable is the size of influence that one of the source code
parallel in independent modules. More modules may indicate metrics (the independent variable) has on the attractiveness –
a concern with good design and better modularization, which the dependent variable. So, one unit change in an indepen-
facilitates contributions. This attracts more members, who can dent variable generates a β-size influence on the dependent
write more features and fix more bugs, which would then variable, on average.
attract more users. The R-value represents the amount of the dependent vari-
ables that can be explained by that set of independent vari-
IV. H YPOTHESES T ESTING ables. In our analysis, the R-value indicated that source code
We specified a multiple regression model to explain the metrics explain 18% (R2 = 0.180) of the number of down-
relationships between the selected source code metrics and loads and 12% (R2 = 0.121) of the number of members. These
attractiveness in Free Software projects. Before running this are significant values for the social context that an adoption
model, we analyzed and applied statistical techniques on the or volunteering of a Free Software projects are involved.
descriptive statistical values of our dataset presented in Table I, A. Hypothesis 1
discussed in Section III-B. With the results in hand, we The data analysis supports our first hypothesis – Free
selected the variables of our regression model according to Software projects with higher structural complexity have lower
our scope definition, the analysis of the Pearson parametric attractiveness. In fact, structural complexity has a negative
correlations and Spearman and Kendal non-parametric corre- influence on attractiveness. When related to downloads, it
lations, shown in Table II, Table III, and Table IV respectively, presents a -0.286 β coefficient and p < 0.001. This means that
and presented in detail in Section III-C. Finally, with the structural complexity has an statistically significant impact on
linearized values of SC, NM, and LOC (independent variables) user interest.
and number of downloads and number of members (dependent In the Free Software context, structural complexity may
variables), we tested our hypotheses according to our statistical indicate the difficulty to make improvements to the software,
multiple regression model compound for these variables. such as new features and bug fixes. So, most users may loose
TABLE V
E QUATIONS AND P EARSON C ORRELATIONS
Downloads Members
Metric β Std. β T-value P-value β Std. β T-value P-value
(Constant) 1.551 - 6.12 <0.001 -0.668 - -8.47 <0.001
Structural Complexity (log) -0.286 -0,150 -8.616 <0.001 -0.033 -0.058 -3.238 0.001
Lines of Code (log) 0.856 0.506 18.624 <0.001 0.126 0.249 8.846 <0.001
Number of Modules (log) 0.008 0.004 0.186 0.852 0.087 0.148 6.625 <0.001
R 0.425 0.348
R2 0.180 0.121

interest in the software because another project may have a characteristics of software size. In this context, lines of code
greater capacity to meet their evolving needs. Therefore, a probably is related to the amount of features in the project,
smaller number of users, generating less reports could lead to which helps to attract both users and developers to the project.
less bug fixes and new features, which in turn could lead to However, when lines of code is kept constant, different
less users in the future. values in the number of modules represent different ways
When related to members, structural complexity presents a of organizing these features over different modules. A higher
β of -0.033 with p = 0.001, indicating that developers avoid to number of modules thus indicates a higher modularity, which
join projects with high structural complexity. A more complex makes it easier for developers to work on the project and
source code is more difficult to understand and, consequently, requires less coordination effort. For users, on the other hand,
to change. This may prevent new developers from joining the it is probably the case that it does not matter whether the
project. With fewer members, the community around a project software is modular or not; they are only interested in the
is less active. provided features.
Finally, collaborators in Free Software projects often start
B. Hypothesis 2 participating in the project as users, attracted by the software
The second hypothesis – Free Software projects with more features. After that, those users who have the potential to
lines of code have higher attractiveness – is also supported by become developers may begin to contribute with the code.
our data. Lines of code has a positive influence on attractive- While a high number of lines of code (and thus of features) is
ness. enough to attract users, project leaders should pay attention to
For downloads, this metric has β = 0.856, with p < 0.001. source code quality. To turn users into developers, the project
In this context, lines of code can be an indication of the amount has to provide a source code that is easy to understand and
of software features and amount of work that have been put modify by keeping structural complexity as low as possible
into the project so far. The more features available, the more and modularity at a good level.
users will become interested in the project. This may make
the software more famous and more useful, attracting new V. R ELATED W ORK
members and users. Large Free Software projects such as Debian GNU/Linux,
In addition, lines of code in relation to number of members GNOME, and KDE have invested in the creation of dedicated
indicated that developers are interested in larger projects. The teams for quality assurance. These efforts involve everything
β coefficient of this metric for members is 0.126 (p < 0.001). from removing bugs and obsolete components to the definition
Therefore, for both downloads and members, lines of code is of standards and strategies to prevent bugs and improve quality
the metric with the highest influence because it is associated [5]. However, most projects do not have the resources to have
with software features and project size. a dedicated quality team.
Michlmayr et al. [5] performed a study on quality assurance
C. Hypothesis 3 problems in Free Software such as unsupported code, con-
The most interesting results were related to our third hy- figuration management, security updates, users not knowing
pothesis – Free Software projects with a higher number of how to report bugs, the difficulty in attracting volunteers,
modules have higher attractiveness. For downloads, the data lack of documentation, and problems with coordination and
does not support the hypothesis: the high p-value ( p = 0.852) communication. None of these problems, however, are related
does not allow us to claim that the number of modules has to the quality of the source code per se.
any influence on the number of downloads. For members, on Barkmann et al. [30] analyzed 146 Free Software projects
the other hand, the hypothesis is confirmed: the number of written in Java, identifying the correlation between a set
modules influences the number of members with β = 0.087, of object-oriented metrics and their theoretical ideal values.
and p < 0.001, which is statistically significant. However, in their work the values of source code metrics were
Both lines of code and number of modules are metrics that not associated with problems or attractiveness of Free Software
represent software size. The fact that both influence number projects.
of members, but only lines of codes influence the number of Stamelos et al. [31] presented empirical results on the
downloads makes us wonder whether they represent different relationship between the size of application components and
the delivered quality measured as user satisfaction. Quality Software projects fail when they lack attractiveness. Therefore,
characteristics of 100 applications written for GNU/Linux understanding what influences attractiveness provides manage-
were compared to industrial standards. The results indicated rial knowledge to project leaders, pointing them to the right
that the so-called structural quality (e.g., component size) of direction on prioritizing their resources.
an application is related to user satisfaction. Our results indicated that source code size and structural
Midha [32] analyzed 450 projects from SourceForge.net and complexity explain a relevant percentage of the attractiveness
verified that high values of MacCabe’s Cyclomatic Complex- of Free Software projects. Attractiveness is based on human
ity and Haltead’s Effort (complexity metrics) are positively perceptions and influenced by people’s cognition, making it
correlated with the number of bugs and with the time needed a complex issue, hard to understand and explain completely.
to fix bugs. These metrics were also found to be negatively Nevertheless, our study was able to explain 18% of software
correlated with contributions from new developers, i.e., more users and 12% of project developers, through a set of four
complex code is less likely to attract new developers. How- source code metrics. These statistical results are significant
ever, Midha’s study used complexity metrics measured at the for the social context that Free Software projects adoption and
subroutine level, while in our study we use complexity metrics volunteering are inserted.
at the module level. In this paper, we showed that lines of code (LOC) has a
Capra et al. [33] have shown that open governance is asso- significant effect on the number of project users. Our results
ciated with higher software design quality on a study with 75 also indicated that structural complexity (SC) has a negative
Free Software projects. They defined software design quality influence on project attractiveness. Therefore, a project will
in terms of 5 Object-Oriented metrics, of which only CBO face greater difficulties to grow without observing some source
is used in our study. An open governance structure together code attributes such as cohesion, coupling, and modularity,
with the lack of formal management and strict deadlines which favors developer contributions such as new features and
enables developers to enhance software design to have a high- bug fixes.
quality product, since they do not suffer pressure to release In other words, our analysis indicated that software struc-
the software [33]. Moreover, a better software design fosters a tural complexity growth may decrease the positive effects of
more open governance by allowing developers to work in in- new added features on attractiveness. Ideally, a project should
dependent modules without the need for explicit coordination keep its complexity constant as new code is incorporated,
activities. However, Capra’s study has not addressed the issue because developers are interested in improving the software,
of attractiveness. and the users in the improvements. This demonstrates to
Bargallo et al. [34] analyzed 56 Free Software projects, project leaders (in communities, foundations, governments,
studying the relationship between software design quality and and companies) the importance to monitor metrics such as
project success. They defined success in terms of downloads, LCOM4 and CBO together with NM and LOC, thereby in-
page views and development activity, and design quality in creasing their chances of forming a community of contributors
terms of the object-oriented metrics CBO, DIT, MIF, and around their software, further enhancing its quality. Thus,
NOC. They found that the most successful projects exhibited projects should grow managing their complexity, keeping the
lower design quality. They argue that perhaps in successful new members willingness to contribute.
projects the main developers tend to shift their attention to Our study differs from related work because we analyzed
lateral activities, such as replying to users in forums, instead a large sample of Free Software projects. Table I shows how
of focusing on enhancing the code quality. Our results seemed diverse our sample of 6,773 projects is. There are projects
to contradict theirs, but this is not the case. First, their concep- with thousands of modules (7,177 – Broadcom replacement
tualization of success is different from our conceptualization of firmware12 ), millions of lines of code (2,983,103 – Broadcom
attractiveness. Moreover, we considered structural complexity replacement firmware), large structural complexity (4,940 –
in terms of CBO and LCOM4 metrics together, while they pyCDK13 ), several hundred members (288 – TinyOS14 ), and
used a different set of metrics to represent the notion of design hundreds of millions of downloads (941,498,760 – MinGW:
quality, having only CBO in common with the present study. Minimalist GNU for Windows15 ). This sample was based
Therefore, a straightforward comparison between their study on well-defined criteria and the number of projects involved
and ours is not so simple. provided us with statistical confidence in the results.
Nevertheless, this study has some limitations that motivate
VI. C ONCLUSION future work. Our sample is restricted to projects written in
A systematic review of 63 empirical studies showed that C available at SourceForge.net and our analysis to a limited
there is little research addressing the characteristics or prop- set of metrics. In the future, we will include projects from
erties of Free Software projects, such as their quality, growth, other repositories, and extend this study to other source code
and evolution [35]. Our study contributes with an unprece- metrics and programming languages such as C++ and Java.
dented analysis of source code metrics from thousands of 12 sourceforge.net/projects/newbroadcom
Free Software projects, causally linking software source code 13 sourceforge.net/projects/pycdk
characteristics with attractiveness. In doing so, we expect to 14 sourceforge.net/projects/tinyos

raise awareness on an important topic so far neglected. Free 15 sourceforge.net/projects/mingw


Furthermore, widely known projects such as GNU/Linux and [16] M. Hitz and B. Montazeri, “Measuring Coupling and Cohesion in
Firefox should be included in studies of this kind, for their Object-Oriented Systems,” in Proceedings of International Symposium
on Applied Corporate Computing, 1995.
metrics may signal represent values that could be seen as [17] D. P. Darcy, C. F. Kemerer, S. A. Slaughter, and J. E. Tomayko, “The
targets or references. Structural Complexity of Software: An Experimental Test,” Software
Finally, we acknowledge that source code metrics are not the Engineering, IEEE Transactions on, vol. 31, no. 11, pp. 982–995, Nov.
2005.
only variables capable of influencing attractiveness. Our previ- [18] C. Richter, Designing Flexible Object-Oriented Systems with UML.
ous work has identified that things such as the restrictiveness Thousand Oaks, CA, USA: New Riders Publishing, 1999.
of the license, type of project, software life-cycle stage, and [19] J. P. Johnson, “Open source software: Private provision of a public
good,” Journal of Economics and Management Strategy, vol. 11, no. 4,
intended audience are all capable of influencing attractiveness pp. 637–662, 2002.
[8]. At first sight, assuming that all these variables from our [20] K. Crowston and B. Scozzi, “Open Source Software Projects as Virtual
previous study are independent from the source code metrics Organizations: Competency Rallying for Software Development,” in IEE
Proceedings Software, vol. 149, no. 1, 2002, pp. 3–17.
we studied here, roughly 40% of attractiveness variance would [21] U. Raja and M. J. Tretter, “Investigating open source project success:
be then explained. However, including variables in an equation A data mining approach to model formulation, validation and testing,”
in a statistically sound manner is not a trivial task. Accord- Working Paper, Texas A&M University, College Station, Texas, Tech.
Rep. Paper-071-31, 2006.
ingly, there is a need to further identify the interaction between [22] M. Shaikh and T. Cornford, “Version management tools: Cvs to bk in
that set of variables with the ones reported in this study. the linux kernel,” Long Range Planning, vol. 34, pp. 699–725, 2003.
[23] A. Mockus, R. T. Fielding, and J. Herbsleb, “A case study of open source
software development: the apache server,” in ICSE ’00: Proceedings of
ACKNOWLEDGMENTS the 22nd international conference on Software engineering. New York,
The authors of this paper are supported by CNPQ, FAPESP, NY, USA: ACM, 2000, pp. 263–272.
[24] V. Balijepally, R. K. Mahapatra, S. P. Nerur, and K. Price, “Are
and the Qualipso project. This research has been developed in two heads better than one for software development? the productivity
the USP Free Software Competence Center and the authors paradox of pair programming,” MIS Quarterly, vol. 33, no. 1, pp. 91–
would like to thank Claudia Melo, Lucianna Almeida, Joenio 118, 2009. [Online]. Available: http://aisel.aisnet.org/misq/vol33/iss1/7/
[25] K. Crowston, , and J. Howison, “Hierarchy and centralization in free and
Costa, Beraldo Leal, and Nelson Lago for their contributions. open source software team communications,” Knowledge Technology &
Policy, vol. 18, pp. 65–85, 2006.
R EFERENCES [26] A. E. Hassan, Z. M. Jiang, and R. C. Holt, “Source versus object code
extraction for recovering software architecture,” Reverse Engineering,
[1] Y. Benkler, The Wealth of Networks: How Social Production Transforms Working Conference on, vol. 0, pp. 67–76, 2005.
Markets And Freedom. Yale University Press, 2006. [27] G. Robles, J. M. Gonzalez-Barahona, M. Michlmayr, and J. J. Amor,
[2] A. Wasserman and E. Capra, “Evaluating Software Engineering Pro- “Mining Large Software Compilations over Time: Another Perspective
cesses in Commercial and Community Open Source Projects,” in Work- of Software Evolution,” in Proceedings of the International Workshop
shop Emerging Trends in FLOSS Research and Development, 2007. on Mining Software Repositories (MSR 2006), Shanghai, China, 2006.
[3] D. Riehle, “The Economic Motivation of Open Source Software: Stake- [28] A. Terceiro and C. Chavez, “Structural Complexity Evolution in Free
holder Perspectives,” IEEE Computer, vol. 40, no. 4, pp. 25–32, 2007. Software Projects: A Case Study,” in QACOS-OSSPL 2009: Proceedings
[4] Forrester-Consulting, “Open Source Paves the Way for the Next Gener- of the Joint Workshop on Quality and Architectural Concerns in Open
ation of Enterprise IT,” Forrester Research, Tech. Rep., 2008. Source Software (QACOS) and Open Source Software and Product Lines
[5] M. Michlmayr, F. Hunt, and D. Probert, “Quality Practices and Problems (OSSPL), M. Ali Babar, B. Lundell, and F. van der Linden, Eds., 2009.
in Free Software Projects,” in First International Conference on Open [29] J. F. Hair, W. C. Black, B. J. Babin, R. E. Anderson, and R. L. Tatham,
Source Systems, M. Scotto and G. Succi, Eds., Genova, Italy, 2005, pp. Multivariate data analysis, 6th ed. Upper Saddle River, NJ: Pearson
309–310. Education In, 2006.
[6] K. J. Stewart and S. Gosain, “The Impact of Ideology on Effectiveness [30] H. Barkmann, R. Lincke, and W. Löwe, “Quantitative Evaluation of
in Open Source Software Development Teams,” MIS Quarterly, vol. 30, Software Quality Metrics in Open-Source Projects,” in AINA Workshops,
no. 2, pp. 291–314, June 2006. 2009, pp. 1067–1072.
[7] E. S. Raymond, The Cathedral & the Bazaar, T. O’Reilly, Ed. Se- [31] I. Stamelos, L. Angelis, A. Oikonomou, and G. L. Bleris, “Code Quality
bastopol, CA, USA: O’Reilly & Associates, Inc., 1999. Analysis in Open Source Software Development,” Information Systems
[8] C. Santos Jr., J. Pearson, and F. Kon, “Attractiveness of Free and Journal, vol. 12, pp. 43–60, 2002.
Open Source Software Projects.” in Proceedings of the 18th European [32] V. Midha, “Does Complexity Matter? The Impact of Change in Struc-
Conference on Information Systems (ECIS), Pretoria, South Africa, 2010, tural Complexity On Software Maintenance and New Developers’ Con-
(forthcoming). tributions in Open Source Software,” in ICIS 2008 Proceedings, 2008.
[9] E. E. Mills, “Software Metrics,” Software Engineering Institute, SEI - [33] E. Capra, C. Francalanci, and F. Merlo, “An Empirical Study on the
Carnegie Mellon University, Tech. Rep., 1988. Relationship Between Software Design Quality, Development Effort and
[10] E. Tempero, “On Measuring Java Software,” in ACSC ’08: Proceedings Governance in Open Source Projects,” IEEE Transactions on Software
of the Thirty-First Australasian Conference On Computer Science, Engineering, vol. 34, no. 6, pp. 765–782, Nov.-Dec. 2008.
vol. 74. Darlinghurst, Australia, Australia: Australian Computer [34] D. Barbagallo, C. Francalenei, and F. Merlo, “The Impact of Social
Society, Inc., 2008, pp. 7–7. Networking on Software Design Quality and Development Effort in
[11] T. C. Jones, Applied Software Measurement: Assuring Productivity and Open Source Projects,” in ICIS 2008 Proceedings, 2008. [Online].
Quality. New York: McGraw-Hill, 1991. Available: {http://aisel.aisnet.org/icis2008/201}
[12] M. Lorenz and J. Kidd, Object-Oriented Software Metrics. Prentice [35] K.-J. Stol, M. A. Babar, B. Russo, and B. Fitzgerald, “The Use of
Hall, 1994. Empirical Methods in Open Source Software Research: Facts, Trends
[13] K. Beck, Smalltalk: best practice patterns. Upper Saddle River, NJ, and Future Directions,” in FLOSS’09: Proceedings of the 2009 ICSE
USA: Prentice-Hall, Inc., 1997. Workshop on Emerging Trends in Free/Libre/Open Source Software
[14] J. Bansiya and C. Davi, “Automated Metrics and Object-Oriented De- Research and Development. Washington, DC, USA: IEEE Computer
velopment: Using QMOOD++ for Object-Oriented Metrics,” Dr. Dobb’s Society, 2009, pp. 19–24.
Journal, vol. 22, no. 12, pp. 42, 44–48, December 1997.
[15] S. R. Chidamber and C. F. Kemerer, “A Metrics Suite for Object-
Oriented Design,” IEEE Transactions on Software Engineering, vol. 20,
no. 6, pp. 476–493, 1994.

You might also like