我使用 Python 3.11.10 实现了两个上亿行的id文件,取差集(高版本Python无法安装Roaringbitmap)。
1. 实现说明
假设每个文件都是一行一个数字。
我们会用到 Roaringbitmap ,每个生成1个bitmap,bitmap的每一个bit位代表了一个元素,也就是每个 Bitmap 有2亿个bit位, 最后2个bitmap做一下差集即可。
Roaringbitmap 会压缩 bit 位,比如连续的 100 个 bit 位都是1,那么它就可以压缩成 1*100 (简单举例,实际更复杂),所以完全不用担心内存会压爆了。
如果这2亿行文件有重复,那计算速度会更快。
如果这2亿行文件不是连续,中间有断开,计算速度可能会变慢一点,综合一下,速度不一定会慢太多的。
2. 生成两个文件
# ids2.txt 比 ids1.txt 少 10 个数字,为了后面能看出差集是多少
import time
start_ts = int(time.time())
max_num = 200_000_000
with open('ids1.txt', 'w') as f:
for i in range(max_num):
f.write(str(i) + "\n")
with open('ids2.txt', 'w') as f:
for i in range(max_num):
if i < max_num - 10:
f.write(str(i) + "\n")
print(int(time.time() - start_ts))
看生成2亿行的2个文件,花了44秒,文件大小 1.8 G
3. 计算差集
代码如下
from roaringbitmap import RoaringBitmap
import time
def read_ids_to_bitmap(file_path):
bitmap = RoaringBitmap()
with open(file_path, 'r') as f:
for line in f:
try:
id_num = int(line.strip())
bitmap.add(id_num)
except ValueError:
continue
return bitmap
def compute_difference(file1_path, file2_path, output_path):
start_time = time.time()
# Read both files into RoaringBitmaps
print("Reading first file...")
bitmap1 = read_ids_to_bitmap(file1_path)
print(f"First file loaded. Size: {len(bitmap1)} IDs")
print("Reading second file...")
bitmap2 = read_ids_to_bitmap(file2_path)
print(f"Second file loaded. Size: {len(bitmap2)} IDs")
# Compute difference
print("Computing difference...")
diff = bitmap1 - bitmap2
# Write result to output file
print("Writing result...")
with open(output_path, 'w') as f:
for id_num in diff:
f.write(f"{id_num}\n")
print(f"Difference computed. Result size: {len(diff)} IDs")
print(f"Total time: {time.time() - start_time:.2f} seconds")
if __name__ == "__main__":
file1 = "ids1.txt"
file2 = "ids2.txt"
output = "diff.txt"
compute_difference(file1, file2, output)
计算差集花费 42 秒。
最开始接触 RoaringBitmap 是因为 Hbase 数据库,接触到了 Hbase 协处理器,在 Java 中的 RoaringBitmap 使用更方便,下面是 RoaringBitmap Github 地址。
最后插播一条,欢迎大家访问我新搭建的网站,专门收集整理 MAC 上的常用优质软件。