Source Code Attractiveness PDF
Source Code Attractiveness PDF
Abstract—A significant number of Free Software projects has However, not all Free Software projects reach success and
been widely used and considered successful. However, there high quality [5]. The amount of inactive projects is undoubt-
is an even larger number of them that cannot overcome the edly higher compared to the number of active projects. To
initial steps towards building an active community of users and
developers. In this study, we investigated whether there are illustrate this scenario, consider the data extracted in Novem-
relationships between source code metrics and attractiveness, i.e., ber, 2009, from Sourceforge.net, one of the most popular Free
the ability of a project to attract users and developers. To verify Software repositories. Out of its 201,494 projects, only 60,642
these relationships, we analyzed 6,773 Free Software projects had more than one release, 40,228 had been downloaded more
from the SourceForge.net repository. The results indicated that than once, and 23,754 had more than one member. Finally,
attractiveness is indeed correlated to some source code metrics.
This suggests that measurable attributes of the project source only 12,141 projects matched all these criteria simultaneously.
code somehow affect the decision to contribute to and adopt This may indicate that no more than 6% of the projects on
a Free Software. The findings described in this paper show SourceForge.net are able to have a healthy community of users
that it is relevant for project leaders to monitor source code and developers benefiting from a Bazaar style of development
quality, particularly a few objective metrics, since these can have [7].
a positive influence in projects chances of forming a community
of contributors and users around their software, enabling further Santos Jr. et al. [8] defined a theoretical model for attrac-
enhancement in quality. tiveness as a crucial construct for Free Software projects,
proposing their (i) typical origins (e.g., license type and
I. I NTRODUCTION intended audience); (ii) indicators (e.g., number of members
and downloads); (iii) consequences (e.g, levels of activity and
The adoption of Free and Open Source Software1 has
time for task completion). They suggested that the success of
significantly increased in the last decades to the point of
any project depends on its level of attractiveness to potential
becoming influential to the global economy [1]. Although
contributors and users. Based on this model, our study ex-
Free Software has emerged as a movement supported by
plored some of the factors that may enable projects to build
volunteer developers, many large companies are now involved
a community by attracting users and developers. Specifically,
in it [2], [3]. According to a Forrester Consulting survey, which
our focus rests on objective factors: we investigated whether
compared large companies in Europe and North America [4],
attractiveness can also be influenced by measurable source
the usage of Free Software is currently widespread in the
code attributes – structural complexity and size.
back-end, middleware, office productivity tools, and business
Although source code metrics have been proposed since
applications software categories. Moreover, this survey states
the 1970s [9], their potential use as guidelines for software
that 92% of the senior business and IT executives say that Free
development has not been fully explored yet [10]. In particular,
Software products have met and, in some cases, exceeded their
we have observed that many Free Software projects do not
quality expectations.
practice source code quality evaluation and have no tools
This satisfaction and quality is usually achieved thanks to
available to do so. This lack of systematic code evaluation
the collaboration of a large user and developer community
leaves a lot of room for improvement in Free Software
who reports failures, fixes bugs and adds features. In fact, the
projects’ processes and practices [5].
Free Software development model is said to offer two main
In general, despite the high importance of source code in
advantages: the potential for peer-review and the possibility
the Free Software community, called the “show me the code”
of attracting developers from different parts of the world [5].
culture, source code metrics are often not perceived as an
Hence, an important issue for Free Software projects is to
indicator of quality. To address this apparent contradiction,
attract volunteers [6].
we argue, theoretically, that source code metrics are related
1 In this study, we consider the terms Free Software and Open Source to project attractiveness and, thus, influence its success. To
Software (OSS) equivalent. verify these ideas empirically, we analyzed 6,773 projects
written in the C language from SourceForge.net. We show that, that module variables should be only manipulated via ac-
considering such a sample, one structural complexity metric cessor functions [13], module variables should be private,
and two size metrics play an important role in explaining indicating that the optimal number for this metric is zero.
attractiveness, here represented by the number of downloads The number of public functions in a module represents the
and the number of project members. size of the “module interface”. Functions are directly related
The remainder of this paper is organized as follows: Sec- to the operations provided by their module. High values for
tion II presents the theoretical foundations of source code this metric indicate that a module has a lot of functions
metrics and attractiveness. Section III presents our hypotheses and, probably, many responsabilities, conflicting with good
and shows the selection criteria and definition for the vari- programming practices [13].
ables used in our study. Section IV evaluates the hypotheses Cohesion is a measure of the diversity of “topics” that a
and discusses their results. Section V reviews related work. module implements. High cohesion values indicate whether
Conclusions and future research directions are discussed in a module focus on a single aspect of the system, while low
Section VI. indicates it deals with several different aspects. In terms of
undestanding, modification and maintainability, highly cohe-
II. T HEORETICAL BACKGROUND sive modules should be seeked. A metric commonly used for
In this section we present the definitions of the selected cohesion is Lack of Cohesion of Methods (LCOM), originally
source code metrics (Section II-A) and the attractiveness con- proposed by Chidamber and Kemerer [15]. High LCOM
cept, including its proxies (Section II-B). These were chosen values indicate low cohesion, while low LCOM values indicate
to build a statistical model that represents the relationships high cohesion.
proposed. The first LCOM definition, called LCOM1, corresponds to
the number of function pairs of a module which manipulate the
A. Source code metrics same module variables. Once it has received several criticism
and revision proposals, through this study we used the revised
For our subset of source code metrics, we used the “module”
definition by Hitz and Montazeri, known as LCOM4 [16]. In
concept as a general term for the different types of structures
order to calculate LCOM4 of a module, it is necessary to build
used in software development. Therefore, it stands for classes,
an undirect graph in which the nodes are its functions and
abstract data types and source files. Specifically in this study,
variables. For every function, there should be an edge between
we consider a C source file as a “module”.
it and another function or variable it uses. The LCOM4 value
Similarly, we generalized the concept of “method” to “func-
is the number of weakly connected components of this graph.
tion”, denoting a portion of source code that performs a
specific task. Coupling is a measure of how one module is connected
The most commonly used metric to measure software size to other modules in the project. High coupling indicates a
is Lines of Code (LOC), which indicates the number of non- greater difficulty to change the modules of the system, since a
blank and non-comment source code lines. Using the LOC change in one module may have an impact in all other modules
metric as a basis for comparison between projects requires the that are coupled to it. In other words, if coupling is high,
projects to be written in the same programming language [11]. the software tends to be less flexible, more difficult to adapt,
On the other hand, Number of Modules is another useful and more difficult to understand. A straightforward method to
size indicator, which is somewhat less influenced by program- measure coupling is Coupling Between Objects (CBO), once
ming languages and line-level coding styles. With that being again proposed by Chidamber and Kemerer. CBO measures
said, it can be used to compare projects written in different how many modules are used by the one being analyzed [15].
languages [10]. The more complex a piece of software, the more challenging
When considering characteristics such as maintainability, it is to change and evolve it. Coupling and cohesion have been
flexibility, comprehension effort, and source code quality in described and discussed in many works as essential indicators
general, one has to take into account not only the size metrics of structural complexity [17]. Moreover, it is widely known
described above but also structural metrics, such as the ones that to build high-quality and flexible software, it is advisable
that follow. to seek low coupling and high cohesion [18].
Number of Functions (NOF) is used to measure module size In fact, Darcy et al. [17] showed that, individually, neither
in terms of the operations it implements. This metric is used coupling nor cohesion are related to a software maintenance
to help identify the reuse potential of a module. In general, effort. Both must be considered together. When combined, the
modules with a large number of functions are more difficult product of coupling and cohesion as a metric is positively
to reuse because they tend to be less cohesive [12]. Hence, a correlated to the maintenance effort.
module should not have an excessive number of functions [13]. In conclusion, seven source code metrics, selected according
Number of Public Variables (NPV) and Number of Pub- to criteria shown in Section III, were used on the statistical
lic Functions (NPF) are metrics related to module encap- model for our study. In particular, we use the product of
sulation. They measure the potential communication among coupling (CBO) and cohesion (LCOM4) as our metric of
modules [14]. Once good programming practices recommend structural complexity (SC) [17].
Fig. 1. Attractiveness research model – adapted from Santos Jr. et al [8].
languages (e.g., C++, Java), as well as trying to identify which reduced the Skewness and Kurtosis values and made
similarities and discrepancies among projects written in them proper to run multiple regressions [20]. The arithmetic
different languages. mean and standard deviation of the logarithm values can be
• More than one download. Projects with no downloads seen in the second part of Table I.
are probably either non-development projects, or projects
C. Variables
that have just started, or are other special cases.
Among the fourteen source code metrics that Analizo
This provided us with a list of 11,433 projects. After this
provided, we selected seven for our initial analysis: total LOC,
preliminary sampling, the following steps were automatically
total NM, total NOF, total NPV, total NPF, average LCOM4,
executed by scripts developed by our group, to perform data
and average CBO. To be able to apply to the procedural
collection:
paradigm of the C language some metrics that area widely-
1) Download the code of all the projects. This resulted used in the literature such as NOF, NPV, NPF, LCOM4,
in the source code for 10,128 projects since some of and CBO, we assumed a mapping of the object concepts of
them had no available files (empty “files” section in the “class” and “method” to the C concepts of “source file” and
SourceForge.net project pages); “function”.
2) Run Analizo sequentially for all projects and store the In this first study, we limited the scope of metrics used to
computed metrics in a single database. be able to reach a simple yet comprehensive model to relate
The metrics were successfully computed for 6,773 source code and attractiveness. Thus, the ACC, AMZ_Size,
projects only, because (i) some downloaded files did not COF, DIT, NOC, NOV, and RFC metrics were left out of the
contain source code (e.g., binary-only downloads), (ii) scope of this study.
the source code was not written in C, (the project was
incorrectly classified as being written in C), or (iii) some Nevertheless, in the first stage of our statistical analysis,
files could not be processed by Analizo due to severe the LOC, NOF, NPV and NPF metrics showed a high corre-
errors in the source code (e.g., syntax errors); lation between each other, according to Pearson’s parametric
3) Cross-join the two datasets. Finally the two datasets – correlations as shown by the bold numbers in Table II. Highly
the SourceForge.net data available from the University correlated variables indicate that they are representations of a
of Notre Dame and FLOSSMole on the one side and same attribute, making it unnecessary to use more than one.
the source code metrics calculated by Analizo on the Since all these metrics represent a similar concept, we selected
other side – were cross-joined so that we could perform one of them – LOC – for our statistical analysis to reduce
the needed statistical analysis. multicollinearity.
Table I summarizes our sample, but the complete data set We have also analyzed the Spearman and Kendal non-
used for this study is available on the Web 11 . Section III-C parametric correlations (see Table III and Table IV, respec-
discusses in detail how we selected the variables presented tively) given that some of our variables are not normally dis-
in Table I. This table shows natural values of minimum, tributed. In our analysis, we observed that after transforming
maximum, arithmetic mean, and standard deviation for each our variables in their logarithmic form, Pearson correlations
variable, indicating the characteristics of our sample. performed just as well as the non-parametric indices. Thus, we
We analyzed our selected variables in their natural form chose the Pearson parametric correlation because it represents
(Raw) to verify their distribution, which is presented in the the most commonly used form of correlation index, and it also
first part of Table I. Thereby, we observed that the Skewness provides the basis for the regression analysis we performed
and Kurtosis probability distribution showed high values, indi- later. That way, we could maintain consistency once multiple
cating non-normality [29]. Because of this non-normality, we regression techniques are based on parametric indices [29].
transformed the variables to a logarithm scale for linearization, According to the analysis described above, we ended up
considering LOC and NM as size metrics. Theoretically, the
11 ccsl.ime.usp.br/mangue/data more LOC, the more NM. However, it is possible to have
TABLE II
PARAMETRIC C ORRELATIONS : P EARSON
Variable CBO LCOM4 SC NM LOC NPV NOF NPF Mbrs DLs
Coupling Between Objects - 0.141 0.723 0.380 0.608 0.423 0.434 0.492 0.113 0.129
Lack of Cohesion on Methods 0.141 - 0.786 0.019 0.472 0.102 0.311 0.361 0.080 0.107
Structural Complexity 0.723 0.786 - 0.254 0.666 0.338 0.493 0.564 0.127 0.156
Number of Modules 0.308 0.019 0.254 - 0.799 0.730 0.815 0.827 0.311 0.344
Lines of Code 0.608 0.472 0.666 0.799 - 0.872 0.923 0.927 0.328 0.410
Number of Public Variables 0.423 0.102 0.338 0.730 0.872 - 0.756 0.761 0.303 0.386
Number of Functions 0.434 0.311 0.493 0.815 0.923 0.756 - 0.886 0.320 0.380
Number of Public Functions 0.492 0.361 0.564 0.827 0.927 0.761 0.886 - 0.308 0.365
Number of Members 0.113 0.080 0.127 0.311 0.328 0.303 0.320 0.308 - 0.676
Number of Downloads 0.129 0.107 0.156 0.344 0.410 0.386 0.380 0.365 0.676 -
TABLE III
N ON -PARAMETRIC C ORRELATIONS : S PEARMAN
Variable CBO LCOM4 SC NM LOC NPV NOF NPF Mbrs DLs
Coupling Between Objects - 0.340 0.803 0.473 0.662 0.523 0.518 0.566 0.156 0.169
Lack of Cohesion on Methods 0.340 - 0.773 0.213 0.490 0.348 0.478 0.516 0.129 0.162
Structural Complexity 0.803 0.773 - 0.370 0.685 0.478 0.571 0.631 0.164 0.196
Number of Modules 0.473 0.213 0.370 - 0.793 0.718 0.818 0.828 0.284 0.320
Lines of Code 0.662 0.490 0.685 0.793 - 0.863 0.918 0.922 0.307 0.392
Number of Public Variables 0.523 0.348 0.478 0.718 0.863 - 0.758 0.765 0.280 0.363
Number of Functions 0.518 0.478 0.571 0.818 0.918 0.758 - 0.895 0.300 0.362
Number of Public Functions 0.566 0.516 0.631 0.828 0.922 0.765 0.895 - 0.288 0.347
Number of Members 0.156 0.129 0.164 0.284 0.307 0.280 0.300 0.288 - 0.598
Number of Downloads 0.169 0.162 0.196 0.320 0.392 0.363 0.362 0.347 0.598 -
TABLE IV
N ON -PARAMETRIC C ORRELATIONS : K ENDALL
Variable CBO LCOM4 SC NM LOC NPV NOF NPF Mbrs DLs
Coupling Between Objects - 0.244 0.650 0.341 0.483 0.377 0.373 0.410 0.118 0.114
Lack of Cohesion on Methods 0.244 - 0.597 0.148 0.341 0.240 0.333 0.362 0.097 0.109
Structural Complexity 0.650 0.597 - 0.262 0.497 0.345 0.413 0.460 0.124 0.132
Number of Modules 0.341 0.148 0.262 - 0.605 0.546 0.641 0.648 0.217 0.219
Lines of Code 0.483 0.341 0.497 0.605 - 0.660 0.763 0.771 0.233 0.270
Number of Public Variables 0.377 0.240 0.345 0.546 0.660 - 0.588 0.596 0.213 0.249
Number of Functions 0.373 0.333 0.413 0.641 0.763 0.588 - 0.864 0.228 0.248
Number of Public Functions 0.410 0.362 0.460 0.648 0.771 0.596 0.864 - 0.219 0.237
Number of Members 0.118 0.097 0.124 0.217 0.233 0.213 0.228 0.219 - 0.471
Number of Downloads 0.114 0.109 0.132 0.219 0.270 0.249 0.248 0.237 0.471 -
more lines of code without having more modules, by adding In summary, our multiple regression model ended up with
code into existing modules. Also, a software could have more the following variables:
modules but keep the number of lines of code when refac- • Independent variables (source code metrics)
toring is applied. Furthermore, we understood that number
– Structural Complexity (SC): The product of CBO and
of modules (NM) did not highly correlate with the others,
LCOM4 metrics.
since we considered a high correlation when the Pearson’s
– Lines of Code (LOC), the sum of lines of code in all
correlations values were approximately 0.9 or higher as our
modules of the project;
criteria, emphasized in Table II.
– Number of Modules (NM), the total number of all
In conclusion, LOC and NM were collected since they modules of the project.
measure different kinds of size metrics, more or less influenced
• Dependent variables (attractiveness)
by programming languages and coding styles, respectively.
Finally, to obtain the value of our structural complexity metric – Number of Downloads: a proxy for the number of
(SC), explained in Section II, CBO and LCOM4 were multi- users of the project;
plied. These three metrics did not show high correlations with – Number of Members: a proxy for the number of
other metrics. As expected, SC showed a positive correlation developers in the project.
with both. However, it was not as high as the others were, The model developed in this study revolves around attrac-
because CBO and LCOM4 had a low correlation with each tiveness, aiming at the explanation of its causes. We defined
other. This means that SC, statistically, represents different a multiple regression model that has attractiveness as its
attributes when compared to CBO and LCOM4, thus endorsing dependent variable. It was measured through two indicators:
the theory that CBO and LCOM4 together offer different number of downloads and number of members. Thus, we have
information. two different regressions, one for each attractiveness indicator.
They are the variables explained by the source code attributes Table V summarizes the regression results based on the
proposed in our hypotheses. Consequently, the SC, LOC and Pearson’s correlation values. These statistical results indicated
NM metrics, which represent the source code attributes, are a linear dependency between our source code metrics and
the independent variables – the influencers of attractiveness. each attractiveness variable. In this table, β is a coefficient
that indicates the size of the influence of each metric on each
D. Research Hypotheses
attractiveness indicator.
In this first study about the relationships between source As we can see in Table V, lines of code is more strongly
code metrics and attractiveness, we investigated whether two correlated to downloads and members than structural com-
attributes – structural complexity and size – obtained via four plexity and number of modules, according to the standardized
source code metrics might influence the attractiveness of Free beta (Std. β). Standardized betas are calculated to perform
Software projects. Thereby, we can later observe whether comparisons between variables that are measured using dif-
these attributes influence people’s perception of quality as ferent scales (e.g., lines of code and structural complexity).
consequence of attractiveness. According to the metrics chosen One cannot compare regular beta coefficients without first
to represent structural complexity and size, we formulated standardizing them.
three hypotheses: Moreover, structural complexity has a negative correlation
H1 – Free Software projects with higher structural com- with attractiveness, as expected. Noteworthy is that the T-
plexity have lower attractiveness. The higher the software test and P (probability) values represent whether a source
complexity, the more difficult it is to understand its source code metric is a statistically significant predictor or influencer
code for maintenance and evolution purposes. This leads to an of attractiveness indicators. For downloads, the number of
increase in the maintenance effort, and makes it more difficult modules is not significant because its P-value is greater than
to attract new members and users for the project. Over time, 0.05. Finally, in the last line of Table V, R-squared values
with less members and users, the project may lose its ability indicate the percentage of attractiveness (users and developers)
to add new features and fix bugs and, consequently, its ability variance that this set of source code metrics is capable of
to evolve and meet the user’s changing requirements. explaining. So, roughly speaking, an R-squared of 20 percent
H2 – Free Software projects with more lines of code have indicates that a set of predictors can explain 20 percent of a
higher attractiveness. To some extent, lines of code reflect the dependent variable. We obtained the following equations:
amount of features of the project and the amount of work that
have been put into it. Therefore, projects with more lines of downloads = 1.551 − 0.286 × log(SC)
code will usually attract more users (since they have more +0.856 × log(LOC) + 0.008 × log(N M )
features) and developers – since they offer more opportunities
members = −0.668 − 0.033 × log(SC)
for contribution. +0.126 × log(LOC) + 0.087 × log(N M )
H3 – Free Software projects with a higher number of
modules have higher attractiveness. The number of modules Each equation has one R-value. The coefficient (β) of each
may indicate the project size and the possibility of working in variable is the size of influence that one of the source code
parallel in independent modules. More modules may indicate metrics (the independent variable) has on the attractiveness –
a concern with good design and better modularization, which the dependent variable. So, one unit change in an indepen-
facilitates contributions. This attracts more members, who can dent variable generates a β-size influence on the dependent
write more features and fix more bugs, which would then variable, on average.
attract more users. The R-value represents the amount of the dependent vari-
ables that can be explained by that set of independent vari-
IV. H YPOTHESES T ESTING ables. In our analysis, the R-value indicated that source code
We specified a multiple regression model to explain the metrics explain 18% (R2 = 0.180) of the number of down-
relationships between the selected source code metrics and loads and 12% (R2 = 0.121) of the number of members. These
attractiveness in Free Software projects. Before running this are significant values for the social context that an adoption
model, we analyzed and applied statistical techniques on the or volunteering of a Free Software projects are involved.
descriptive statistical values of our dataset presented in Table I, A. Hypothesis 1
discussed in Section III-B. With the results in hand, we The data analysis supports our first hypothesis – Free
selected the variables of our regression model according to Software projects with higher structural complexity have lower
our scope definition, the analysis of the Pearson parametric attractiveness. In fact, structural complexity has a negative
correlations and Spearman and Kendal non-parametric corre- influence on attractiveness. When related to downloads, it
lations, shown in Table II, Table III, and Table IV respectively, presents a -0.286 β coefficient and p < 0.001. This means that
and presented in detail in Section III-C. Finally, with the structural complexity has an statistically significant impact on
linearized values of SC, NM, and LOC (independent variables) user interest.
and number of downloads and number of members (dependent In the Free Software context, structural complexity may
variables), we tested our hypotheses according to our statistical indicate the difficulty to make improvements to the software,
multiple regression model compound for these variables. such as new features and bug fixes. So, most users may loose
TABLE V
E QUATIONS AND P EARSON C ORRELATIONS
Downloads Members
Metric β Std. β T-value P-value β Std. β T-value P-value
(Constant) 1.551 - 6.12 <0.001 -0.668 - -8.47 <0.001
Structural Complexity (log) -0.286 -0,150 -8.616 <0.001 -0.033 -0.058 -3.238 0.001
Lines of Code (log) 0.856 0.506 18.624 <0.001 0.126 0.249 8.846 <0.001
Number of Modules (log) 0.008 0.004 0.186 0.852 0.087 0.148 6.625 <0.001
R 0.425 0.348
R2 0.180 0.121
interest in the software because another project may have a characteristics of software size. In this context, lines of code
greater capacity to meet their evolving needs. Therefore, a probably is related to the amount of features in the project,
smaller number of users, generating less reports could lead to which helps to attract both users and developers to the project.
less bug fixes and new features, which in turn could lead to However, when lines of code is kept constant, different
less users in the future. values in the number of modules represent different ways
When related to members, structural complexity presents a of organizing these features over different modules. A higher
β of -0.033 with p = 0.001, indicating that developers avoid to number of modules thus indicates a higher modularity, which
join projects with high structural complexity. A more complex makes it easier for developers to work on the project and
source code is more difficult to understand and, consequently, requires less coordination effort. For users, on the other hand,
to change. This may prevent new developers from joining the it is probably the case that it does not matter whether the
project. With fewer members, the community around a project software is modular or not; they are only interested in the
is less active. provided features.
Finally, collaborators in Free Software projects often start
B. Hypothesis 2 participating in the project as users, attracted by the software
The second hypothesis – Free Software projects with more features. After that, those users who have the potential to
lines of code have higher attractiveness – is also supported by become developers may begin to contribute with the code.
our data. Lines of code has a positive influence on attractive- While a high number of lines of code (and thus of features) is
ness. enough to attract users, project leaders should pay attention to
For downloads, this metric has β = 0.856, with p < 0.001. source code quality. To turn users into developers, the project
In this context, lines of code can be an indication of the amount has to provide a source code that is easy to understand and
of software features and amount of work that have been put modify by keeping structural complexity as low as possible
into the project so far. The more features available, the more and modularity at a good level.
users will become interested in the project. This may make
the software more famous and more useful, attracting new V. R ELATED W ORK
members and users. Large Free Software projects such as Debian GNU/Linux,
In addition, lines of code in relation to number of members GNOME, and KDE have invested in the creation of dedicated
indicated that developers are interested in larger projects. The teams for quality assurance. These efforts involve everything
β coefficient of this metric for members is 0.126 (p < 0.001). from removing bugs and obsolete components to the definition
Therefore, for both downloads and members, lines of code is of standards and strategies to prevent bugs and improve quality
the metric with the highest influence because it is associated [5]. However, most projects do not have the resources to have
with software features and project size. a dedicated quality team.
Michlmayr et al. [5] performed a study on quality assurance
C. Hypothesis 3 problems in Free Software such as unsupported code, con-
The most interesting results were related to our third hy- figuration management, security updates, users not knowing
pothesis – Free Software projects with a higher number of how to report bugs, the difficulty in attracting volunteers,
modules have higher attractiveness. For downloads, the data lack of documentation, and problems with coordination and
does not support the hypothesis: the high p-value ( p = 0.852) communication. None of these problems, however, are related
does not allow us to claim that the number of modules has to the quality of the source code per se.
any influence on the number of downloads. For members, on Barkmann et al. [30] analyzed 146 Free Software projects
the other hand, the hypothesis is confirmed: the number of written in Java, identifying the correlation between a set
modules influences the number of members with β = 0.087, of object-oriented metrics and their theoretical ideal values.
and p < 0.001, which is statistically significant. However, in their work the values of source code metrics were
Both lines of code and number of modules are metrics that not associated with problems or attractiveness of Free Software
represent software size. The fact that both influence number projects.
of members, but only lines of codes influence the number of Stamelos et al. [31] presented empirical results on the
downloads makes us wonder whether they represent different relationship between the size of application components and
the delivered quality measured as user satisfaction. Quality Software projects fail when they lack attractiveness. Therefore,
characteristics of 100 applications written for GNU/Linux understanding what influences attractiveness provides manage-
were compared to industrial standards. The results indicated rial knowledge to project leaders, pointing them to the right
that the so-called structural quality (e.g., component size) of direction on prioritizing their resources.
an application is related to user satisfaction. Our results indicated that source code size and structural
Midha [32] analyzed 450 projects from SourceForge.net and complexity explain a relevant percentage of the attractiveness
verified that high values of MacCabe’s Cyclomatic Complex- of Free Software projects. Attractiveness is based on human
ity and Haltead’s Effort (complexity metrics) are positively perceptions and influenced by people’s cognition, making it
correlated with the number of bugs and with the time needed a complex issue, hard to understand and explain completely.
to fix bugs. These metrics were also found to be negatively Nevertheless, our study was able to explain 18% of software
correlated with contributions from new developers, i.e., more users and 12% of project developers, through a set of four
complex code is less likely to attract new developers. How- source code metrics. These statistical results are significant
ever, Midha’s study used complexity metrics measured at the for the social context that Free Software projects adoption and
subroutine level, while in our study we use complexity metrics volunteering are inserted.
at the module level. In this paper, we showed that lines of code (LOC) has a
Capra et al. [33] have shown that open governance is asso- significant effect on the number of project users. Our results
ciated with higher software design quality on a study with 75 also indicated that structural complexity (SC) has a negative
Free Software projects. They defined software design quality influence on project attractiveness. Therefore, a project will
in terms of 5 Object-Oriented metrics, of which only CBO face greater difficulties to grow without observing some source
is used in our study. An open governance structure together code attributes such as cohesion, coupling, and modularity,
with the lack of formal management and strict deadlines which favors developer contributions such as new features and
enables developers to enhance software design to have a high- bug fixes.
quality product, since they do not suffer pressure to release In other words, our analysis indicated that software struc-
the software [33]. Moreover, a better software design fosters a tural complexity growth may decrease the positive effects of
more open governance by allowing developers to work in in- new added features on attractiveness. Ideally, a project should
dependent modules without the need for explicit coordination keep its complexity constant as new code is incorporated,
activities. However, Capra’s study has not addressed the issue because developers are interested in improving the software,
of attractiveness. and the users in the improvements. This demonstrates to
Bargallo et al. [34] analyzed 56 Free Software projects, project leaders (in communities, foundations, governments,
studying the relationship between software design quality and and companies) the importance to monitor metrics such as
project success. They defined success in terms of downloads, LCOM4 and CBO together with NM and LOC, thereby in-
page views and development activity, and design quality in creasing their chances of forming a community of contributors
terms of the object-oriented metrics CBO, DIT, MIF, and around their software, further enhancing its quality. Thus,
NOC. They found that the most successful projects exhibited projects should grow managing their complexity, keeping the
lower design quality. They argue that perhaps in successful new members willingness to contribute.
projects the main developers tend to shift their attention to Our study differs from related work because we analyzed
lateral activities, such as replying to users in forums, instead a large sample of Free Software projects. Table I shows how
of focusing on enhancing the code quality. Our results seemed diverse our sample of 6,773 projects is. There are projects
to contradict theirs, but this is not the case. First, their concep- with thousands of modules (7,177 – Broadcom replacement
tualization of success is different from our conceptualization of firmware12 ), millions of lines of code (2,983,103 – Broadcom
attractiveness. Moreover, we considered structural complexity replacement firmware), large structural complexity (4,940 –
in terms of CBO and LCOM4 metrics together, while they pyCDK13 ), several hundred members (288 – TinyOS14 ), and
used a different set of metrics to represent the notion of design hundreds of millions of downloads (941,498,760 – MinGW:
quality, having only CBO in common with the present study. Minimalist GNU for Windows15 ). This sample was based
Therefore, a straightforward comparison between their study on well-defined criteria and the number of projects involved
and ours is not so simple. provided us with statistical confidence in the results.
Nevertheless, this study has some limitations that motivate
VI. C ONCLUSION future work. Our sample is restricted to projects written in
A systematic review of 63 empirical studies showed that C available at SourceForge.net and our analysis to a limited
there is little research addressing the characteristics or prop- set of metrics. In the future, we will include projects from
erties of Free Software projects, such as their quality, growth, other repositories, and extend this study to other source code
and evolution [35]. Our study contributes with an unprece- metrics and programming languages such as C++ and Java.
dented analysis of source code metrics from thousands of 12 sourceforge.net/projects/newbroadcom
Free Software projects, causally linking software source code 13 sourceforge.net/projects/pycdk
characteristics with attractiveness. In doing so, we expect to 14 sourceforge.net/projects/tinyos