Histogram in Spark (1)

Spark’s DoubleRDDFunctions provide a histogram function for RDD[Double]. However there are no histogram function for RDD[String]. Here is a quick exercise for doing it. We will use immutable Map in this exercise.

Create a dummy RDD[String] and apply the aggregate method to calculate histogram

scala> val d=sc.parallelize((1 to 10).map(_ % 3).map("val"+_.toString))
scala> d.aggregate(Map[String,Int]())(
     | (m,c)=>m.updated(c,m.getOrElse(c,0)+1),
     | (m,n)=>(m /: n){case (map,(k,v))=>map.updated(k,v+map.getOrElse(k,0))}
     | )

The 2nd function of aggregate method is to merge 2 maps. We can actually define a Scala function

scala> def mapadd[T](m:Map[T,Int],n:Map[T,Int])={
     | (m /: n){case (map,(k,v))=>map.updated(k,v+map.getOrElse(k,0))}
     | }

It combine the histogram on the different partitions together

scala> mapadd(Map("a"->1,"b"->2),Map("a"->2,"c"->1))
res3: scala.collection.mutable.Map[String,Int] = Map(b -> 2, a -> 3, c -> 1)

Use mapadd we can rewrite the aggregate step

scala> d.aggregate(Map[String,Int]())(
     | (m,c)=>m.updated(c,m.getOrElse(c,0)+1),
     | mapadd(_,_)
     | )

This entry was posted in Spark. Bookmark the permalink.

	Bo Zhang on GroupBy on DataFrame is NOT th…
	crayola on GroupBy on DataFrame is NOT th…
	aedwip on Repartition vs. Coalesce
	crayola on GroupBy on DataFrame is NOT th…
	Bo Zhang on GroupBy on DataFrame is NOT th…

	Bo Zhang on GroupBy on DataFrame is NOT th…
	crayola on GroupBy on DataFrame is NOT th…
	aedwip on Repartition vs. Coalesce
	crayola on GroupBy on DataFrame is NOT th…
	Bo Zhang on GroupBy on DataFrame is NOT th…

Histogram in Spark (1)

1 Response to Histogram in Spark (1)

Leave a comment Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta

Recent Posts

Recent Comments

Archives

Categories

Meta

Histogram in Spark (1)

Share this:

Related

1 Response to Histogram in Spark (1)

Leave a comment Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta

Recent Posts

Recent Comments

Archives

Categories

Meta