Be Careful with First

In Spark document the aggregate function “first” described as

Aggregate function: returns the first value of a column in a group

It is not accurate! It’s actually return the first NON-NULL value of the column.
A scary example here shows the problem of this:

Input data with 3 column “k, t, v”

z, 1, null
z, 2, 1.5
z, 3, 2.4

Code:

df.groupBy("k").agg(
  $"k",
  first($"t"),
  first($"v")
)

Output:

z, 1, 1.5

This result is a mix of 2 records!

I believe it’s a bug of Spark (even the same in 1.4.0) instead of a feature of the First aggregate function.

This entry was posted in DataFrame, Spark. Bookmark the permalink.

3 Responses to Be Careful with First

  1. xuxiaoshun says:

    This is true according to the spark source code:
    spark/sql/core/src/main/scala/org/apache/spark/sql/functions.scala:

    /**
    * Aggregate function: returns the first value in a group.
    *
    * @group agg_funcs
    * @since 1.3.0
    */
    def first(e: Column): Column = First(e.expr)

    spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/functions.scala:

    override val updateExpressions = Seq(
    /* first = */ If(IsNull(first), child, first)
    )

    In fact, all the basic aggregate functions for Expression are “ignoring” null, such as count:

    override val updateExpressions = Seq(
    /* currentCount = */ If(IsNull(child), currentCount, currentCount + 1L)
    )

  2. Bo Zhang says:

    “Ignoring” null could be perfectly reasonable for count or sum. In that sense
    “`
    count($”v”)
    “`
    is actually the Non-Null Count of column “v”, and
    “`
    count(lit(1))
    “`
    Is the total count of number of records.

  3. Pingback: dropDuplicates may create unexpected result | Big Data Analytics with Spark

Leave a comment