dropDuplicates may create unexpected result | Big Data Analytics with Spark

dropDuplicates may create unexpected result

Posted on November 18, 2015 by Bo Zhang

In an earlier post, I mentioned that first aggregate function is actually performed a “first-none-null”. This post is a consequences from that bug/feature.

Here is a quick test of dropDuplicates DF-method within the Spark-shell

scala&gt; case class Rec(id:Int, k:Int, v:String)
defined class Rec

scala&gt; val df=sqlContext.createDataFrame(Seq(Rec(1,2,null),Rec(1,3,&quot;a&quot;)))
df: org.apache.spark.sql.DataFrame = [id: int, k: int, v: string]

scala&gt; df.show
+---+---+----+
| id|  k|   v|
+---+---+----+
|  1|  2|null|
|  1|  3|   a|
+---+---+----+


scala&gt; df.dropDuplicates(Seq(&quot;id&quot;)).show
+---+---+---+
| id|  k|  v|
+---+---+---+
|  1|  2|  a|
+---+---+---+

As you can see here that the result is even not one of the input record! If we consider first only returning none-null value as a feature, above behavior of dropDuplicates is definitely a bug.

For our own project, we actually have to create our own First aggregate expression to allow true first. Then we implemented dedupByKey method within SMV.

This entry was posted in DataFrame, Spark. Bookmark the permalink.

Leave a comment Cancel reply