Calculating Geometric Means
Calculating Geometric Means
The equation is also flipped around when calculating the financial rate of return if you know the starting value, end value, and the time period. This equation is used in these cases when the average rate of return is needed (or population growth rate):
Note: If you subtract 1 from the equation above, this is your compound interest rate. To use this equation, if years=5, this is the "fifth root", which is the same as raising to the power of 1/5 or 0.2). Problem submitted by a student: "A recent article suggested that if you earn $25,000 a year today and the inflation rate continues at 3 percent per year, you'll need to make $33,598 in 10 years to have the same buying power. ... Confirm that this statement is accurate by finding the geometric mean rate of increase" Solution using a formula in Excel: =Power(33598/25000,.1)=1.03
Consider this example. Suppose you wanted to calculate the geometric mean of the numbers 2 and 32. This simple example can be done in your head. First, take the product; 2 times 32 is 64. Because there are only two numbers, the n-th root is the square root, and the square root of 64 is 8. Therefore the geometric mean of 2 and 32 is 8. Now, let's solve the problems using logs. In this case, we will convert to base-2 logs so that we can solve the problem in our head (in fact, any base could be used). Converting our numbers, we have: 2=21 32=25 21 x 25 = 26 (=64) the square root of 26 is 23 (=8) Of course, the short cut to solve the problem is to take the average of the two exponents (1 and 5) which is 3, and 23 is 8. Problem: Can you calculate the geometric mean of these 5 numbers, in your head? 23, 25, 28, 23, 21 (These values of course equal 8, 32, 256, 8, and 2) (Hint: The 5 exponents add up to 20.) Click for the answer. From the discussion above, you can see that the calculation of the Geometric Mean can be performed by either of two procedures on a calculator, depending upon which functions are available. Computer-based spreadsheet programs like Excel have built geometric mean functions, and in general you should use these (see below) to save time if a computer with the appropriate software is available. Calculation Procedure 1: Multiply all of the data points, and take the n-th root of this product. Example: Suppose you have this beach monitoring data from different dates: (data are Enterococci bacteria per 100 milliliters of sample) 6 ent./100 ml 50 ent./100 ml 9 ent./100 ml 1200 ent./100 ml Geometric Mean = 4th root of (6)(50)(9)(1200) = 4th root of 3,240,000
Geometric Mean = 42.4 ent./100 ml On a good scientific calculator, you would multiply the numbers together, press equal, then the root key, then the number 4 to get the forth root (or enter 0.25 with the exponent key on the last part).
Calculation Procedure 2: Take the average of the logs, then convert to a base 10 number Of course, many calculators do not have a root key that allow the calculation of any root, so you must use the logarithm function, which is typically more widely available on calculators. To use this calculation procedure, you must have a calculator which will give logarithms (log or ln) and anti-logarithms (exp or e). The first step in calculating the Geometric Mean using this method is to determine the logarithm of each data point using your calculator. Next, add all of the data point logarithms together and divide this sum by the number of data points (n). In other words, take the average of the logs. Next, convert this log average back to a base 10 number using the antilogarithm function key on the calculator. Example (using previous data): log 6= 0.77815 log 50= 1.69897 log 9= 0.95424 log 1200= 3.07918 Sum= 6.51054 The logarithm of the Geometric Mean is 6.51054/4 = 1.62764 (the average of the logs) From your calculator, determine the number whose logarithm is 1.62764 (use the antilogarithm key), and you will find that the Geometric Mean = 42.4 ent./100 ml This process works whether or not you use natural logs ("ln" key) or base 10 logs ("log" key). That is, on your calculator you could do ln(x1), ln(x2), etc. then use the 'ex' key on the average of the logs, or you would do log(x1), log(x2), etc. then use the '10x' key on the average of the logs. (key names may vary among calculators). Incidentally, for this example data set, the arithmetic mean (average) of the four data points is: Arithmetic Mean = (6 + 50 + 9 + 1200)/4 = 1265/4 Arithmetic Mean = 316.3 colonies/100 ml
The geometric mean is always less than the arithmetic mean (except of course if all the data points have an identical value). On most scientific calculators your key sequences to calculate the geometric mean would be: enter a data point, press either the Log or ln function key, record the result or store it in memory, calculate the mean or average of these log values, calculate the antilog value of this mean ('10x' key if you used 'Log' key, 'ex' key if you used 'ln' key)
Excel #Num! overflow error In Excel and Quattro an error may be obtained in the geometric mean function if you apply the function to a very long list of numbers. This occurs because of a numeric overflow error (the product of the numbers is so large the software cannot compute them the way the software is written). If this occurs, you can use an "array formula." An array formula is one that repeats the same calculations over an array (list) of numbers. This "average of the logs" formula will work fine in such situations: {=EXP(AVERAGE(LN(A1:A200)))} Do not enter the curly brackets. Enter the formula "=EXP(A....", then create the array formula by pressing Control+Shift+Enter simultaneously on your keyboard while your cursor is inside the formula cell. Change A1 and A2 to the actual locations of the first and last values of the data set.
The calculation of the Geometric Mean may appear impossible if one or more of the data points is zero (0). In these cases, however, the convention used is that a value of either '1', one half the limit of detection, or some other substitution is allowed for each zero or "less than" value, so that the information contained in these data is not lost. For example, the US Food and Drug Administration in its shellfish sanitation program regulations requires the substitution of a value that is one significant digit less than the detection limit [i.e. "less than 2" becomes "1.9"]. Because of how geometric mean is calculated, the precise substitution value generally does not appreciably affect the result of the calculation, and ensures that all the data remains usable. Here is an example with a non detect (and assuming the detection limit was 2 bacteria per 100 milliliters): 1100 0 ("less than 2") 30 13000 Geometric Mean = 4th root of 1100 X 1 X 30 X 13000 = 4th root of 429,000,000 Geometric Mean = 143.9 Incidentally, substituting 1.9 for the less than value results in a geometric mean of 169.0, which is nearly statistically different (alpha=0.05) using a t-test using the substituted value 1.0. See additional comments in the bacteria data section below.
Debate on the use of substitutions of below reporting limits and other censored data
Many statisticians have criticized common procedures for providing substituted values for non-detects or below-reported-limits value data. Other alternatives, such as "delta lognormal models" have also received criticism and even legal challenges when applied to regulatory discharges permits. These problems and alternative analysis strategies are presented in Helsel (1990, 2005) and EPA (2002). These references also contain useful citations to other publications.
deviations of the log-transformed data. However, a special problem is created when reporting standard deviations of log data. That is because plus or minus (+/-) a log constant creates unequal error bars when converting back to base 10 (see note below on plotting geometric means). To overcome other log transformation problems, values less than detection limits should be replaced with non-zero value to avoid log of zero errors. As noted above, certain regulatory programs, like the US FDA requires the substitution of a number one significant digit less than the detection limit [i.e. "less than 2" becomes "1.9"] under their shellfish sanitation program regulations. Other agencies have required models to predict the variance of these below-reported-limits data. Another special problem that exists with bacteria testing is that bacterial plates can be inundated with bacteria so that bacteria colony forming units are expressed as exceeding a certain number. These "greater than" values are similarly converted for geometric mean calculations (FDA requires conversion to the next significant digit (">1200" becomes "1300"). Regulatory programs like these also have water quality standards that incorporate median values and 90th percentile values because of concerns about possible non-normal distributions of even the log-transformed data. The calculated means and variances of log-transformed data can be plugged into a t-Test to evaluate whether there is a statistical between two stations. To answer the question whether there is a statistical difference among three or more stations, use an ANOVA test. When analyzing log-transformed data, you may be surprised to find that two sites with remarkably different arithmetic means may be not statistically different from one another. The substitution values for non-detects can sometimes affect the outcomes of statistical tests, especially in cases where a large percentage of the data are non-detect or zero. Helsel (1990, 2005) describes a variety of tests and approaches that are more robust and valid in evaluating this type of data.
For example, to calculate the geometric mean of the values +12%, -8%, and +2%, instead calculate the geometric mean of their decimal multiplier equivalents of 1.12, 0.92, and 1.02, to compute a geometric mean of 1.0167. Subtracting 1 from this value gives the geometric mean of +1.67% as a net rate of population growth (or financial return). Incidentally, if you do not have a negative percent value in a data set, you should still convert the percent values to the decimal equivalent multiplier. It is important to recognize that when dealing with percents, the geometric mean of percent values does not equal the geometric mean of the decimal multiplier equivalents. For example: Geometric mean of [12%, 4%, 2%] does not equal the Geometric mean of [1.12,1.04,1.02]. 4.6% does not equal 5.9%
Calculating Geometric Means with Both Large Negative and Positive Numbers Combined
I have received a number of queries, particularly from those analyzing gene block microarray data sets, about how to calculate geometric means on data sets that includes both very large and very positive numbers. The analysis of data from gene blocks to evaluate similarity a complex topic and the statistics of this field is evolving, and you should perform an internet search to find the latest thinking on this topic. However, in principal, comparing data sets consisting of very large negative and positive numbers together is an easy matter, and all that is required is to temporarily suspend the negative signs of the data. Consider, for example, two sets sample data sets as follows: | A= {-5,-3,-2, 3} and B={-1, 0, 2, 4} The mean of data set A is -1.75, and the mean of data set B is +1.25. A simple Student's t-test (assuming alpha=0.05 and equal sample variances) would suggest these samples are not statistically different from one another. This approach would be no different than if you were to calculate geometric mean in these two data sets: | A'={-100000, -1000, -100, 1000} and B"={-10,1,100,10000} If you were to take off the negative signs, take the log, then add the negative sign back on, you could then compare the means of the A' and B' data sets. In fact, you might have
noticed that data sets A and B are really the log (base 10) transformed data sets A' and B'. You might therefore conclude that A' and B' are not statistically different samples using the same t-test. Of course like any statistical analysis you have to make sure you have not violated the assumptions of the statistical test (in this case you must assume the log transformed data is normally distributed, and the sample variances were equal).
Using Method 1, you would take the 17th root of the product 15 x 15 x 15 x 25 x 25 x 25 x 25 x 25 x 25 x 25 x 25 x 25 x 35 x 35 x 35 x 35 x 35, which is also equal to 25.221.
In a spreadsheet, you would type this formula: =((15^3)*(25^9)*(35^5))^(1/17) As you might imagine, if you have large mid-point values or large frequencies, your calculator or spreadsheet program could not compute the formula because the intermediate numbers are impossibly large, and the result would be an error. To calculate geometric mean in these cases, you must use Method 2. You might also consider the spreadsheet "array formula" method in the "Excel #Num! overflow error" callout box above. If your grouped data includes large negative numbers, you have no choice but think of a clever transformation to make the values positive and use Method 2. For Method 2, as shown in the table above, you would calculate the weighted mean of the natural logarithms of the mid-point values, which in this case is 3.228. When the value is converted back to base 10, the geometric mean is 25.221. Interestingly, this problem is quite similar to one faced by the Buzzards Bay NEP, in evaluating the extent of oiling from an oil spill. In this case the data consisted of an average width and the length of the beach. For example, 1500 ft of beach may have had between 0 and 5 foot-wide band of oil, 10,000 feet may have been documented to have a band of oil between 5 and 10 ft, etc. The length of beach oiled became the frequency for the interval. Whether geometric mean is an appropriate metric for evaluating this type of data, or any other data set, always needs to carefully considered.
Working Backwards
This following problem was posed by a student: If Geomean(8,a)=12, what is a? The question can be most easily be rephrased using the nth root definition of geometric mean. That is: square root of (8 x a)=12 solve first by squaring both sides: (8xa)=144 a=144/8 = 18 Using logs, the mathematical solution is: First express the problem as the mean of logs: (ln(8)+ln(a))/2 =ln(12)
Solving: ln(8)+ln(a) =2 x ln(12) ===> ln(a)=(2 x ln(12))-ln(8) ===>a =exp((2 x ln(12))-ln(8)) ===> a=exp(2.8904) ===> a=18
Answer
The answer to the mental math problem above: The exponents add to 20, 20 divided by 5 is 4, so the geometric mean is 24 or 16.
References Cited
EPA 2002, Development Document for Proposed Effluent Limitations Guidelines and Standards for the Concentrated Aquatic Animal Production Industry Point Source Category. APPENDIX E MODIFIED DELTA-LOG NORMAL DISTRIBUTION EPA 2002, Development Document for Proposed Effluent Limitations Guidelines and Standards for the Concentrated Aquatic Animal Production Industry Point Source Category. APPENDIX F ALTERNATIVE STATISTICAL METHODS Dennis R. Helsel. 1990. LESS THAN OBVIOUS: Statistical Treatment of Data Below the Reporting Limit Dennis R. Helsel. 2005. More Than Obvious: Better Methods for Interpreting Nondetect Data