详解 Spark 核心编程之 RDD 分区器

最新推荐文章于 2025-05-13 08:23:34 发布

原创

最新推荐文章于 2025-05-13 08:23:34 发布 · 423 阅读

4 ·

CC 4.0 BY-SA版权

文章标签：

#spark #大数据 #分布式

一、RDD 分区器简介

Spark 分区器的父类是 Partitioner 抽象类
分区器直接决定了 RDD 中分区的个数、RDD 中每条数据经过 Shuffle 后进入哪个分区，进而决定了 Reduce 的个数
只有 Key-Value 类型的 RDD 才有分区器，非 Key-Value 类型的 RDD 分区的值是 None
每个 RDD 的分区索引的范围：0~(numPartitions - 1)

二、HashPartitioner

默认的分区器，对于给定的 key，计算其 hashCode 并除以分区个数取余获得数据所在的分区索引

class HashPartitioner(partitions: Int) extends Partitioner {
   
   
    require(partitions >= 0, s"Number of partitions ($partitions) cannot be negative.")
    
    def numPartitions: Int = partitions
    
    def getPartition(key: Any): Int = key match {
   
   
    	case null => 0
    	case _ => Utils.nonNegativeMod(key.hashCode, numPartitions)
    }
    
    override def equals(other: Any): Boolean = other match {
   
   
    	case h: HashPartitioner => h.numPartitions == numPartitions
    	case _ => false
    }
    
    override def hashCode: Int = numPartitions
}