In Spark document the aggregate function “first” described as
Aggregate function: returns the first value of a column in a group
It is not accurate! It’s actually return the first NON-NULL value of the column.
A scary example here shows the problem of this:
Input data with 3 column “k, t, v”
z, 1, null z, 2, 1.5 z, 3, 2.4
Code:
df.groupBy("k").agg( $"k", first($"t"), first($"v") )
Output:
z, 1, 1.5
This result is a mix of 2 records!
I believe it’s a bug of Spark (even the same in 1.4.0) instead of a feature of the First aggregate function.
This is true according to the spark source code:
spark/sql/core/src/main/scala/org/apache/spark/sql/functions.scala:
/**
* Aggregate function: returns the first value in a group.
*
* @group agg_funcs
* @since 1.3.0
*/
def first(e: Column): Column = First(e.expr)
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/functions.scala:
override val updateExpressions = Seq(
/* first = */ If(IsNull(first), child, first)
)
In fact, all the basic aggregate functions for Expression are “ignoring” null, such as count:
override val updateExpressions = Seq(
/* currentCount = */ If(IsNull(child), currentCount, currentCount + 1L)
)
“Ignoring” null could be perfectly reasonable for count or sum. In that sense
“`
count($”v”)
“`
is actually the Non-Null Count of column “v”, and
“`
count(lit(1))
“`
Is the total count of number of records.
Pingback: dropDuplicates may create unexpected result | Big Data Analytics with Spark