Be Careful with First

Posted on July 24, 2015 by Bo Zhang

In Spark document the aggregate function “first” described as

Aggregate function: returns the first value of a column in a group

It is not accurate! It’s actually return the first NON-NULL value of the column.
A scary example here shows the problem of this:

Input data with 3 column “k, t, v”

z, 1, null
z, 2, 1.5
z, 3, 2.4

Code:

df.groupBy("k").agg(
  $"k",
  first($"t"),
  first($"v")
)

Output:

z, 1, 1.5

This result is a mix of 2 records!

I believe it’s a bug of Spark (even the same in 1.4.0) instead of a feature of the First aggregate function.

This entry was posted in DataFrame, Spark. Bookmark the permalink.

xuxiaoshun says:

July 28, 2015 at 12:59 AM

This is true according to the spark source code:
spark/sql/core/src/main/scala/org/apache/spark/sql/functions.scala:
/** * Aggregate function: returns the first value in a group. * * @group agg_funcs * @since 1.3.0 */ def first(e: Column): Column = First(e.expr)

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/functions.scala:
override val updateExpressions = Seq( /* first = */ If(IsNull(first), child, first) )

In fact, all the basic aggregate functions for Expression are “ignoring” null, such as count:
override val updateExpressions = Seq( /* currentCount = */ If(IsNull(child), currentCount, currentCount + 1L) )

Reply
Bo Zhang says:

July 29, 2015 at 9:12 AM

“Ignoring” null could be perfectly reasonable for count or sum. In that sense
“`
count($”v”)
“`
is actually the Non-Null Count of column “v”, and
“`
count(lit(1))
“`
Is the total count of number of records.

Reply
Pingback: dropDuplicates may create unexpected result | Big Data Analytics with Spark

	Bo Zhang on GroupBy on DataFrame is NOT th…
	crayola on GroupBy on DataFrame is NOT th…
	aedwip on Repartition vs. Coalesce
	crayola on GroupBy on DataFrame is NOT th…
	Bo Zhang on GroupBy on DataFrame is NOT th…

	Bo Zhang on GroupBy on DataFrame is NOT th…
	crayola on GroupBy on DataFrame is NOT th…
	aedwip on Repartition vs. Coalesce
	crayola on GroupBy on DataFrame is NOT th…
	Bo Zhang on GroupBy on DataFrame is NOT th…