Tag Archives: Data Modeling

The definitive answer to the ID vs TABLE_ID debate

This is a generated image showing two tables with surrogate keys, one with the surrogate key named ID and another with the surrogate key named TABLE_ID. The is a box below the tables with the text "ID vs TABLE_ID".

Few debates in database design spark as much disagreement as what to name a surrogate key. There has been a debate over the years about the name of that column, and it generally falls into two choices:

  1. Every table that has a surrogate key should name the column ID.
  2. Every table that has a surrogate key should name the column TABLE_ID.

I have very strong opinions about this debate, born out of over 35 years of building database-backed business applications.

My sincere belief is that every surrogate key should be prefixed with the table name. No, I’m not going to fall for the straw man argument that any of the other columns in a table should be prefixed with the table name. The surrogate key is special.

The ID folks try to make some of the following arguments:

  1. It is short and universal. You don’t have to think as much.
  2. If you are using an ORM (Object-Relational Mapping framework like Hibernate for example), you don’t have to do the incredibly simple thing of updating the ORM to expect table_id instead of just id, because most ORMs default to just ID. Of course, they don’t say it that way; they just say “Our ORM expects ID.”.
  3. We already have tables that just use ID, so we should stick to that standard.

And.. That’s it. Now, sometimes folks will talk about already having to prefix column names in select clauses, and employee.employee_id is redundant, but the reality is that you are almost certainly going to alias the employee table name as either e or maybe emp, you’re never really going to type employee.employee_id.

So, why are the ID folks wrong?

To me, the biggest, most fundamental thing is this:

You have a thing in the database, and you are calling it two different names depending on where it is located

If we have an employee table with an ID column as a surrogate key, whenever we use that surrogate key in another table as a foreign key, we typically put the table name in front of it. So ID in the employee table, and then EMPLOYEE_ID as a foreign key in any table related to the employee table.

If we call this thing EMPLOYEE_ID in the employee table and EMPLOYEE_ID whenever we use it as a foreign key (and yes, I am aware this thing could have another name when used in a different context, for example, MANAGER_ID), then it has the same name everywhere it is used.

It also goes the other way, in that you have hundreds of things named ID, but they are all different things in different tables.

This puts to rest the “It’s short and universal, you don’t have to think as much” argument because you end up thinking more every time you access the table joined to another table.

I strongly believe that the ID to EMPLOYEE_ID naming convention is a cognitive tax that must be paid over and over again.

However, there are a host of other reasons why table_id is the right way to go…

ID as a surrogate key means you are unable to join tables with a USING clause

I totally understand that some people don’t like the USING clause, but I find it very nice.

select ...
  from department d
  join employee e using(department_id);

The above is much cleaner, contains fewer opportunities for mistakes, and is easier to read compared to the below.

select ...
  from department d
  join employee e on d.id = e.department_id;

There are so many ways to make mistakes in the 2nd statement compared to the first statement. Now make this an 8-way table join, and the fun just compounds!

Clarity in queries

Surrogate key names now carry meaning without qualification, which is great in SQL, views, etc.

Self-joins are clearer

where manager_id = employee_id

reads better than

where manager_id = id

Bridge tables read naturally

Associative tables like employee_project(employee_id,project_id, ...) are self-documenting and enable USING clauses in both directions.

Grepability/searching/diagnostics is much easier

Searching logs or code for where a given entity flows is much harder if every table has a column named ID. Sometimes ID means EMPLOYEE_ID, sometimes ID means DEPARTMENT_ID, etc.

REST APIs and JSON payloads are going to read better

Your API contracts and/or JSON payloads are probably going to use EMPLOYEE_ID and DEPARTMENT_ID anyway.

But wait, what about our tables that already use ID?

We’ve already discussed the cognitive tax when using ID, and, given the choice, I’d remove that cognitive tax on all new tables in the database. With old tables that used ID, there are some choices:

  1. Bite the bullet and do the rename of the column, and then update all your applications (APEX makes this pretty easy to find and replace those references) and source code (ALL_SOURCE & ALL_VIEWS). This could be a big project, but the tools are all there. You could do a big bang approach and do everything at once, or do it over time, table by table.
  2. Create a view on top of the legacy tables that does the rename of the ID column to TABLE_ID, and have new applications and code use the view.
  3. Rename the table and the column, and then create a view on top with the old table name that maintains the legacy ID name.

In conclusion

The surrogate key naming decision may seem small at first, but in practice, it touches every query, every log, every API contract, and more. Prefix your surrogate keys (and ONLY your surrogate keys). Your future self, your team, and your database will thank you.


About those auditing columns… a change in perspective.

I’m a bit of a database pureist. I believe that if you are building a system that will capture information for a business or organization, and that information matters to the business, then you should build the tables that will support the insert, update, and delete of information in third normal form (at least, there are additional normal forms) and the best way to figure out how those tables should be built is to work directly with the business by building an Entity Relationship Diagram (note: a Table Diagram is NOT an ERD!) and getting business by-in before you start creating tables.

A frequent business requirement is “We’d like to know by who and when information was created or updated.” Traditionally, we have added four columns to our tables to do this: CREATED, CREATED_BY, UPDATED, and UPDATED_BY. As an aside (I have a feeling there will be a lot of them in this post), I despise when folks do CREATED_DATE, and UPDATED_DATE. You don’t do NAME_VARCHAR2!

Oracle Quick SQL, built into Oracle APEX, builds these same traditional columns on your tables if you ask it to, and then they build a BUI (before update or insert) trigger on the table. Note, I think that a BUI trigger is “wrong” since we’ve had compound triggers in the database since Oracle 11g, and that every table that has normal triggers (basically, if you are not using Edition Based Redefinition) should only have a single trigger, and it should be a compound trigger.

That said, quick SQL builds a trigger that looks like this:

create or replace trigger employee_biu
    before insert or update
    on employee
    for each row
begin
    if inserting then
        :new.created := sysdate;
        :new.created_by := coalesce(sys_context('APEX$SESSION','APP_USER'),user);
    end if;
    :new.updated := sysdate;
    :new.updated_by := coalesce(sys_context('APEX$SESSION','APP_USER'),user);
end employee_biu;
/

As a purist, I have always hated that the trigger always populates the UPDATED and UPDATED_BY columns even on the first insert. To me, those columns should only be populated if the row was actually updated, since that is the name of the columns. By populating them all the time, if you want to answer the question “How many rows have actually been updated?”, you will need to compare (let’s use a table with a hundred million rows) a hundred million values:

select count(*)
  from sometable
 where created != updated;

In addition, you are going to store extra stuff that you don’t really need. Instead, I believed (notice the past tense!), those columns should be nullable and should only be populated if you actually update the column. A hundred million row table with only seven updated rows and an index on updated would be worlds faster than a table that stores all these needless UPDATED and UPDATED_BY by values.

Now, because we have a nullable column, we do introduce the “issues” that come along with a nullable column. You’ll need to use NVL or, even better, COALESCE every time you look at the column, and if you forget, you can end up with bugs. The number of bugs that nulls give us in systems is pretty large. But, because you did things “right” (you didn’t needlessly populate a column that didn’t need to be populated), you would be aware of this, and you wouldn’t fall prey to those bugs.

However, a recent conversation with Anton Nielsen convinced me to update my perspective.

At first, Anton tried to make the argument that every insert is an update too. I wasn’t buying that argument at all. He, of course, brought up all the issues with a nullable column and indexes on virtual columns (Why not add a virtual column that is a COALESCE(UPDATED,CREATED) as LAST_TOUCHED?), etc. But the thing that really convinced me was this:

“In all my years of building applications the business people always ask ‘Who touched this row last?’, they never ask ‘How many rows were touched after they were created?’.”

Anton Nielsen

Maybe those columns should have been called LAST_TOUCHED and LAST_TOUCHED_BY (and, indeed, if you are going to populate them on insert, they really should be), but we can fix this by adding a comment to the UPDATED and UPDATED_BY columns:

comment on column employee.updated is 'Populated during row creation and whenever the row is updated. Effectively that makes this column the LAST_TOUCHED column.';

comment on column employee.updated_by is 'Populated during row creation and whenever the row is updated. Effectively that makes this column the LAST_TOUCHED_BY column. This is set to the current APEX user or the database user depending on the context.';

Let me know what you think.