Spark源码分析之分区器的作用

RangePartitioner.determineBounds(candidates, partitions)

}

def sketch[K : ClassTag](

rdd: RDD[K],

sampleSizePerPartition: Int): (Long, Array[(Int, Long, Array[K])]) = {

val shift = rdd.id

// val classTagK = classTag[K] // to avoid serializing the entire partitioner object

val sketched = rdd.mapPartitionsWithIndex { (idx, iter) =>

val seed = byteswap32(idx ^ (shift << 16))

val (sample, n) = SamplingUtils.reservoirSampleAndCount(

iter, sampleSizePerPartition, seed)

//包装成三元组，（索引号，分区的内容个数，抽样的内容）

Iterator((idx, n, sample))

}.collect()

val numItems = sketched.map(_._2).sum

//返回（数据条数，（索引号，分区的内容个数，抽样的内容））

(numItems, sketched)

}

真正的抽样算法在SamplingUtils中,因为在Spark中是须要一次性取多个值的，是以直接去前n个数值，然后依次概率调换即可：

def reservoirSampleAndCount[T: ClassTag]( 
      input: Iterator[T], 
      k: Int, 
      seed: Long = Random.nextLong()) 
    : (Array[T], Long) = { 
    //创建临时数组 
    val reservoir = new Array[T](k) 
    // Put the first k elements in the reservoir. 
    // 掏出前k个数，并把对应的rdd中的数据放入对应的序号的数组中 
    var i = 0 
    while (i < k && input.hasNext) { 
      val item = input.next() 
      reservoir(i) = item 
      i += 1 
    } 
 
    // If we have consumed all the elements, return them. Otherwise do the WordStrment. 
    // 如不雅全部的元素，比要采取的采样数少，那么直接返回 
    if (i < k) { 
      // If input size < k, trim the array 	
			 5/9   首页 上一页 3 4 5 6 7 8 下一页 尾页	
			

　　推荐阅读
　　SDN能解决很多问题，但不包括安全
            
            跟着数字化企业尽力寻求最佳安然解决筹划来保护其赓续扩大的收集，很多企颐魅正在寻求供给互操作性功能的下一代对象。软件定义收集(SDN)具有很多的优势，经由过程将多个设备的┞菲握平面整>>>详细阅读


本文标题：Spark源码分析之分区器的作用
地址：http://www.17bianji.com/lsqh/34919.html
 1/2    1