In an earlier post, I mentioned that first
aggregate function is actually performed a “first-none-null”. This post is a consequences from that bug/feature.
Here is a quick test of dropDuplicates
DF-method within the Spark-shell
scala> case class Rec(id:Int, k:Int, v:String) defined class Rec scala> val df=sqlContext.createDataFrame(Seq(Rec(1,2,null),Rec(1,3,"a"))) df: org.apache.spark.sql.DataFrame = [id: int, k: int, v: string] scala> df.show +---+---+----+ | id| k| v| +---+---+----+ | 1| 2|null| | 1| 3| a| +---+---+----+ scala> df.dropDuplicates(Seq("id")).show +---+---+---+ | id| k| v| +---+---+---+ | 1| 2| a| +---+---+---+
As you can see here that the result is even not one of the input record! If we consider first
only returning none-null
value as a feature, above behavior of dropDuplicates
is definitely a bug.
For our own project, we actually have to create our own First
aggregate expression to allow true first
. Then we implemented dedupByKey
method within SMV.