dropDuplicates may create unexpected result

In an earlier post, I mentioned that first aggregate function is actually performed a “first-none-null”. This post is a consequences from that bug/feature.

Here is a quick test of dropDuplicates DF-method within the Spark-shell

scala> case class Rec(id:Int, k:Int, v:String)
defined class Rec

scala> val df=sqlContext.createDataFrame(Seq(Rec(1,2,null),Rec(1,3,"a")))
df: org.apache.spark.sql.DataFrame = [id: int, k: int, v: string]

scala> df.show
+---+---+----+
| id|  k|   v|
+---+---+----+
|  1|  2|null|
|  1|  3|   a|
+---+---+----+


scala> df.dropDuplicates(Seq("id")).show
+---+---+---+
| id|  k|  v|
+---+---+---+
|  1|  2|  a|
+---+---+---+

As you can see here that the result is even not one of the input record! If we consider first only returning none-null value as a feature, above behavior of dropDuplicates is definitely a bug.

For our own project, we actually have to create our own First aggregate expression to allow true first. Then we implemented dedupByKey method within SMV.

This entry was posted in DataFrame, Spark. Bookmark the permalink.

Leave a comment