Think Twice, Join Once: Distinct On in PostgreSQL

Joining tables is a common practice when writing a SQL-based application, and I can writing a join in my sleep, but it’s always frustrating when you have a table and you want to join it to another, only once, and you realize that SQL doesn’t have a built-in way of expressing that. Think of creating a report that only shows the most recent login for each customer. They may have logged in multiple times, but you only want to see the newest one. Fortunately, Postgres has a special feature called the Distinct On clause that allows you to do exactly that. Let’s see that with the example I just mentioned (I am using Postgres 9.2):

Create Table Users
(
	 username Text Not Null
	,email Text Not Null
	,Constraint users_pk Primary Key (username)
);

Insert Into users (username, email)
Values
 ('Alice', 'alice@example.com')
,('Bob', 'bob@example.com')
,('Charlie', 'charlie@example.com');

Create Table logins
(
	 username Text Not Null
	,browser Text Not Null
	,login_time Timestamptz Not Null
	,Constraint logins_pk Primary Key (username, login_time)
	,Constraint logins_fk_username Foreign Key (username) References users(username)
);

Insert Into logins (username, browser, login_time)
Values
 ('Alice', 'IE6', '2013-01-01 00:00:00')
,('Alice', 'IE6', '2013-01-02 00:00:00')
,('Alice', 'IE7', '2013-01-03 00:00:00')
,('Bob', 'FF22', '2013-02-01 00:00:00')
,('Bob', 'FF23', '2013-02-02 00:00:00')
,('Bob', 'FF23', '2013-02-03 00:00:00')
,('Charlie', 'Opera 11', '2013-03-01 00:00:00')
,('Charlie', 'Opera 11', '2013-03-02 00:00:00')
,('Charlie', 'Opera 11', '2013-03-03 00:00:00')
;

We have multiple users and multiple logins for each user so let’s write a query using the Distinct On clause in Postgres.

Select Distinct On (u.username)
	 u.username
	,u.email
	,l.browser
	,l.login_time
From users u
Join logins l On l.username = u.username
Order By u.username, login_time Desc

The rules are simple: put the column or columns that identify the record uniquely (in this case, username) in the Distinct On clause, then put all those columns as the first columns in the order by. Notice that the query does not work if we add a column to the distinct on clause without putting into the order by clause:

Select Distinct On (u.username, u.email)
	 u.username
	,u.email
	,l.browser
	,l.login_time
From users u
Join logins l On l.username = u.username
Order By u.username, login_time Desc

ERROR:  SELECT DISTINCT ON expressions must match initial ORDER BY expressions
LINE 1: Select Distinct On (u.username, u.email)
                                        ^

Now, to show that this works not matter how many joins there are, let’s create a new table representing messages sent to the user:

Create Table messages
(
	 message_id Serial Not Null
	,username Text Not Null
	,message Text Not Null
	,message_time Timestamptz Not Null
	,Constraint messages_pk Primary Key (message_id)
	,Constraint messages_fk_username Foreign Key (username) References users(username)
);

Insert Into messages (username, message, message_time)
Values
 ('Alice', 'First Message', '2013-04-01 00:00:00')
,('Alice', 'Second Message', '2013-04-02 00:00:00')
,('Alice', '3rd Message', '2013-04-03 00:00:00')
,('Bob', 'First Message', '2013-05-01 00:00:00')
,('Bob', 'Second Message', '2013-05-02 00:00:00')
,('Bob', '3rd Message', '2013-05-03 00:00:00')
,('Charlie', 'First Message', '2013-06-01 00:00:00')
,('Charlie', 'Second Message', '2013-06-02 00:00:00')
,('Charlie', '3rd Message', '2013-06-03 00:00:00')
;

Once again, we get the most recent login by running the query, even though we are now joining to messages, which has multiple records per user.

Select Distinct On (u.username)
	 u.username
	,u.email
	,l.browser
	,l.login_time
	,m.message
From users u
Join logins l On l.username = u.username
Join messages m On m.username = u.username
Order By u.username, login_time Desc, m.message_time Asc

Now, if we just wanted to get timestamps of the messages or logins, this feature would be uninteresting and easily done using group by and the max() aggregate function.

Select
	 u.username
	,u.email
	,Max(l.login_time) As last_login_time
	,Max(m.message_time) As last_message_time
From users u
Join logins l On l.username = u.username
Join messages m On m.username = u.username
Group By u.username, u.email

But notice how trying to get the most recent message falls on its face:

Select
	 u.username
	,u.email
	,Max(m.message) As last_message
From users u
Join logins l On l.username = u.username
Join messages m On m.username = u.username
Group By u.username, u.email

You get the second message for each user, which is clearly not what we wanted. In other DBMSs that don’t have this feature, here’s how you’d actually get the last message:

Select
	 s.username
	,s.email
	,m.message
From
(
	Select
		 u.username
		,u.email
		,Max(m.message_time) As last_message_time
	From users u
	Join logins l On l.username = u.username
	Join messages m On m.username = u.username
	Group By u.username, u.email
) s
Join messages m on m.username = s.username and m.message_time = s.last_message_time

Unsightly! So, this feature is when you want to join to another table just once without using a tangle of group by’s, aggregate functions, and subqueries. The beauty of SQL comes when you can express your intent, and Distinct On allows you to express that you want to join only once.

Posted in Databases
One comment on “Think Twice, Join Once: Distinct On in PostgreSQL
  1. Adam Donahue says:

    This post would be more useful if you included output from running the queries you mention above.

Leave a Reply

Your email address will not be published. Required fields are marked *

*