On the Unreasonable Effectiveness of NULL Checks
If you filter for values on a NULLABLE
column, strongly consider adding a
NOT NULL
check as well.
NULLABLE
columns in a relational database are useful: they can save space,
and they can be used for 1 to 0 or 1 relationship situations. However, they will
affect your query performance in surprising ways, and one must cover for them
by making queries a little more complex.
Our database includes a table for tracking individual activities of our
players. These activities may reference a single learning
objective, so we have the
NULLABLE
field learning_objective_id
on our player_activities
table that
is a foreign key to our learning_objective
table. Without discussing the
alternatives to this design, let’s focus on the performance effects.
First, the table is large, so let’s get a baseline:
psql 1> explain select count(*) from player_activities;
QUERY PLAN
-----------------------------------------------------------------
Finalize Aggregate (cost=5365250.92..5365250.93 rows=1 width=8)
...
We have an index for learning_objective_id
on this table. Here is the cost
for searching for a specific kind of player activity by learning objective:
psql 1> explain select count(*) from player_activities where learning_objective_id = 1;
QUERY PLAN
-----------------------------------------------------------------
Aggregate (cost=7868.23..7868.24 rows=1 width=8)
This cost may not be terrible in the grand scheme of things, and we can
celebrate, but if this query were embedded in something more complex, such as
with nested subqueries, groupings, and joins, we might be thankful for an
improvement. Look what happens when we add a NULL
check on this column:
psql 1> explain select count(*) from player_activities where learning_objective_id is not null and learning_objective_id = 1;
QUERY PLAN
-----------------------------------------------------------------
Aggregate (cost=4780.10..4780.11 rows=1 width=8)
This is a feature of using a NULLABLE
column, not one of value. For example,
we could instead look for learning_objective_id
s that are not some other
real value and see that the cost is actually worse than just looking the one
value:
psql 1> explain select count(*) from player_activities where learning_objective_id != 2 and learning_objective_id = 1;
QUERY PLAN
-----------------------------------------------------------------
Aggregate (cost=7935.40..7935.41 rows=1 width=8)
The general rule of thumb is: if you are filtering or joining on a NULLABLE
column, also check that it is NOT NULL
. This is not a perfect rule, and we
have run into rare situations where the extra check made performance cost
worse. As always with optimization advice, profile!