Amarok now reads the tags written by Foobar2000 and by mp3gain (written when you call mp3gain without the -a or -r options) from MP3 files. However, the final part of MP3 support is tricker: the RVA2 tag in the ID3v2.4 spec.
Naturally, the specification leaves out an all-important detail: the format of the peak volume field. It tells you which bits represent the peak volume, but not how to interpret them.
Luckily, mutagen, a Python audio metadata library, supports this tag, so it’s implementation can serve as a reference. However, they try to be clever with their implementation, so reverse-engineering it to arrive at the format of the original data requires some work.
The documentation on the Python class implementing the RVA2 frame support says that the peak volume is a float between 0 and 1. So 0 is silent, 1 is full volume (digital full scale). This doesn’t seem right to me, because the replay gain specification points out that it is possible to have a peak volume over 1 in some circumstances in a compressed audio file. But we’ll leave that aside for the moment.
Let’s start with the code. data contains the raw bytes, the first of which is a number specifying how many bits (not bytes) of the remaining data is occupied by the number representing the peak volume.
peak = 0
bits = ord(data[0])
bytes = min(4, (bits + 7) >> 3)
# not enough frame data
if bytes + 1 > len(data): raise ID3JunkFrameError
shift = ((8 - (bits & 7)) & 7) + (4 - bytes) * 8
for i in range(1, bytes+1):
peak *= 256
peak += ord(data[i])
peak *= 2**shift
return (float(peak) / (2**31-1))
Let’s start with bytes. This is simply bits (the number of bits representing the peak volume) rounded up to the nearest 8, then divided by 8. So if bits is 8n + k, bytes is n in the case that k = 0 and (n+1) in the case that k > 0.
The next variable is the shift. This is the first bit of clever magic, and it takes some time spent staring at it (preferably with a pad and paper to hand) to arrive at the following conclusion:
- if k = 0, shift is 8(4 – n)
- if k > 0, shift is 8(4 – (n + 1))
Then we read the bits into peak. Remember that if k > 0, the last (8 – k) bits will be junk. Now we shift it right (shift is always at least 0, because of our contraint on bytes to be at least 4) so that the first 32 bits are all that remains (I assume here that Python is treating peak as an integer). Then we turn peak into a float and divide it by (231) – 1. This contant is a magic number, being the largest value that can be stored in a signed 32-bit integer.
Something that might shed light on this is that, when it writes the peak volume out, it simply writes the value multiplied by 215 as a 16-bit unsigned integer. This would make interpreting the value as simple as placing a “decimal” point after the first binary digit (so we get 1 digit before the point and 15 after). Note that this does indeed allow a peak volume greater than 1 (but less than 2).
I’m left with two questions:
- Why do we divide the number by MAX_INT_32, rather than simply 231? (I just made up that constant name now, don’t complain that it’s wrong.)
- Why does mutagen put a 32-bit minimum on the number, and then write a 16-bit number when it writes out RVA2 tags?
Answers on a postcard (or just in the comments).