Wednesday, March 24, 2010

Rendering Trees with Closure Tables

I got a comment from a reader about the Naive Trees section of my presentation SQL Antipatterns Strike Back. I've given this presentation at the MySQL Conference & Expo in the past.

I'd also like to mention that I've developed these ideas into a new book, SQL Antipatterns: Avoiding the Pitfalls of Database Programming. The book is now available in Beta and for pre-order from Pragmatic Bookshelf.

Here's the reader's question:
I would like to ask if there's a way I can dump all the hierarchies in a single query using a closure table? For example I have a following tree:

rootree
- 1stbranch
- midbranch
- corebranch
- leafnodes
- lastbranch
- lastleaf

and I want to display it like:

rootree -> 1stbranch
rootree -> midbranch
rootree -> midbranch -> corebranch
rootree -> midbranch -> corebranch -> leafnodes
rootree -> lastbranch
rootree -> lastbranch -> lastleaf

The Closure Table is a design for representing trees in a relational database by storing all the paths between tree nodes. Using the reader's example, one could define and populate two tables like this:
drop table if exists closure;
drop table if exists nodes;

create table nodes (
node int auto_increment primary key,
label varchar(20) not null
);

insert into nodes (node, label) values
(1, 'rootree'),
(2, '1stbranch'),
(3, 'midbranch'),
(4, 'corebranch'),
(5, 'leafnodes'),
(6, 'lastbranch'),
(7, 'lastleaf');

create table closure (
ancestor int not null,
descendant int not null,
primary key (ancestor, descendant),
foreign key (ancestor) references nodes(node),
foreign key (descendant) references nodes(node)
);

insert into closure (ancestor, descendant) values
(1,1), (1,2), (1,3), (1,4), (1,5), (1,6), (1,7),
(2,2),
(3,3), (3,4), (3,5),
(4,4), (4,5),
(5,5),
(6,6), (6,7),
(7,7);
What we need to do is find all the descendants of the root node 1, then for each of these descendant nodes, list its ancestors in order, separated by an arrow. We can use MySQL's useful GROUP_CONCAT() function to build this list for us.
select group_concat(n.label order by n.node separator ' -> ') as path
from closure d
join closure a on (a.descendant = d.descendant)
join nodes n on (n.node = a.ancestor)
where d.ancestor = 1 and d.descendant != d.ancestor
group by d.descendant;

Here's the output in the MySQL client. It looks like what the reader asked for:
+-------------------------------------------------+
| path |
+-------------------------------------------------+
| rootree -> 1stbranch |
| rootree -> midbranch |
| rootree -> midbranch -> corebranch |
| rootree -> midbranch -> corebranch -> leafnodes |
| rootree -> lastbranch |
| rootree -> lastbranch -> lastleaf |
+-------------------------------------------------+
I do assume for the purposes of ordering that all of a node's ancestors have a lower node number. You could alternatively use a pathlength column to the closure table and sort by that.

The Closure Table design is nice compared to the Nested Sets (or Preorder Traversal) design, because it supports the use of referential integrity. By using indexes, the EXPLAIN report shows that MySQL query optimizer does a pretty good job on it (I've omitted a few columns for brevity):
+-------+--------+-------------------+--------------------------+
| table | type | ref | Extra |
+-------+--------+-------------------+--------------------------+
| d | range | NULL | Using where; Using index |
| a | ref | test.d.descendant | |
| n | eq_ref | test.a.ancestor | |
+-------+--------+-------------------+--------------------------+

44 comments:

xxx said...

Wow. It seems your book is what I need. Ordered a PDF version -)

info said...

Hi Bill,

How would you do this and taking all the structure into an array instead of a big string?

Thanks!

Vani said...

I came across your slides while looking for better patterns for storing hierarchical data. It's definitely one of the best articles i came across.

Closure tables just got my attention. I am planning to use it for a folders design for my project.Folders in my project are heavily shared, moved, created, deleted in the system. They are also permission-ed.

Do you think closure tables is a good idea? or Path enumeration is better?

I really appreciate your feedback.

Thanks!

Bill Karwin said...

@Vani: Hi thanks for your comment.

Closure Table has many advantages, but if one of the common operations you want to do is to move a subtree, this may not be the best design.

Path Enumeration may be easier for that particular kind of change. To move a subtree, you can do it by doing a string-replace on the prefix of a path.

In the example in this blog, suppose I want to move 'corebranch' to be a child of 'lastbranch':

UPDATE Folders
SET path = REPLACE(path, 'rootree/midbranch/', 'rootree/lastbranch/)
WHERE path LIKE 'rootree/midbranch/corebranch/%'

The best tree design for your app depends on what kinds of operations you need to do with your tree.

Vani said...

@Bill: Thank you Sir for your prompt reply.

My app simply resembles the google docs for example. I let users organize their files in folders and share them with fellow users.I was looking at hierarchical tags(labels, as in gmail/google docs). But the problem with that is sharing, i am not sure if hierarchical tags works well with permissions.

Sheldon said...

Hi Bill

First, thanks a lot for the slides from your presentation on closure trees. They are great, finally I'll be able to get fast trees working with referential integrity.

Anyway, I wanted to ask, when you mention that you depend on the id order in this post, and the alternative would be to depend on some kind of length field, I'm really not sure what you mean, or how that could be used. Would you mind elaborating, I'm trying to get this working with preorder traversal like in the example, but as far as I can tell the order wouldn't be correct if I say, moved a subtree about?

Would really appreciate the help.

Regards
Sheldon

milehighrockstar said...

Bill, thanks for pointing me to your blog! I'm reading your book and it's very helpful (as well as Joe Celko's two SQL for Smarties books on hierarchical data). I'm sold on the versatility and integrity of closure tables; however, I'm finding the queries for returning the data I need to be less than intuitive.

Using the above example, how would one restructure the SELECT statement to list only rows with complete paths, i.e., exclude the rows containing sub-paths of succeeding rows?

Desired result:
rootree -> 1stbranch
rootree -> midbranch -> corebranch -> leafnodes
rootree -> lastbranch -> lastleaf

Thanks again for your help!

Jordi said...

Hi Bill, first of all, I am reading your and I think it is a treasure! Its full of real examples.

My question is how I can implement a threaded comment system, where the comments are paginated?

I don't know how I can get it done with closure tables.

For example: Show 5 comments and their replies

Comment 1
Reply
Reply
Reply
Comment 2
Reply
Comment 3
Comment 4
Comment 5
Reply
Reply
Reply
Reply

Jordi said...

I think I'll do 2 queries. The first to return all 5 comments and then make a query that returns me the answers to these 5 comments.

In a column of the table where comment information resides, will store if it is a comment or a response.

Sergio said...

Hi Bill. I've noticed in your closure table samples each node (even leaf nodes) are referencing to themselves. There's no way to distinguish root nodes from branches and last segments of the tree, if yo do it this way. I would rather let root nodes to have a reference to themselves only. In this case they can be distinguished from the rest by either length, which is 0 or equal ancestor and descendant values.

On another note, wondering if you have any suggestions regarding the most effective way to find all last segments of a particular tree?

Bill Karwin said...

Hi @Sergio, thanks for the questions.

The root node of a tree is the node that has no ancestors, besides itself.

SELECT c.* FROM closure AS c
LEFT OUTER JOIN closure AS anc
ON anc.descendant = c.descendant AND anc.ancestor <> c.ancestor
WHERE anc.ancestor IS NULL

Finding leaves is the complementary query, finding nodes with no descendent besides themselves.

SELECT c.* FROM closure AS c
LEFT OUTER JOIN closure AS des
ON des.ancestor = c.ancestor AND des.descendant <> c.descendant
WHERE des.descendant IS NULL

Russ said...

I've looked into the closure tables and implemented a version using path_length, and so far I love it.

However, there is one thing I have yet to solve. I can grab all ancestors or descendants of a node easily with path length. However, in my tree some nodes are in multiple subtrees, and I haven't figured out how to enumerate a single path. That is, when I say give me all ancestors and sort by path_length, this does not give me a single path to that particular node (as it would if the node was only in a single path). Instead, it's a mix of two or more paths, but I can't match a level 2 with a level 3 correctly (which ancestor of the level 3 is the correct match for that particular path?).

Is there a way around this in SQL, or must it be left up to the application layer. The path chosen isn't all that important - just that it's a correct path (or alternatively, return all paths and then leave it up to the application layer to pick one).

Altec said...

Hi, interesting topic.
I would like ask which models (for hierarchical data) it is best for which application.
(I mean shich model is the best for e-shop, blog, ,....)
Thanks.

Altec said...
This comment has been removed by the author.
Bill Karwin said...

Hi Altec, thanks for your question, but I can't answer it at that level. There are many types of tasks you might need to do with hierarchical data in any given application. The choice of what design to use for the data has more to do with the specific queries you need to run, not so much with the purpose of the application.

Niko Brauc said...

I've used closure tables in some of my recent apps and have mixed feelings. I still like nested set very much but
I wanted to implement multiple parents. That's ideal for closure tables. And now everything works simple and fast.

But I wonder if self referencing nodes are really needed.
Isn't it simpler using table with null-able anestors, eg:

CREATE TABLE `category_closure` (
`category_id` int(10) unsigned NOT NULL,
`ancestor_id` int(10) unsigned DEFAULT NULL,
`distance` tinyint(3) unsigned DEFAULT NULL,
UNIQUE KEY `category_id` (`category_id`,`ancestor_id`),
KEY `ancestor_id` (`ancestor_id`)
);

In this case we needn't (shouldn't) inserting self referencing nodes, eg. (2, 2, 0),
only childrend and other descendants.

I know that now we should by hand prevent inserting multimple inserting
of the same root category, eg.
(1, null, null)
(1, null, null)

But on the other hand root categories manipulating can be much simpler.

What's your opinion? Is it worth using null-able ancestor?

Bill Karwin said...

@Nico: I have at least three reasons why the self-referencing nodes are useful.

1. Primary key columns must be non-nullable, and the two columns of ancestor, descendant in the closure table serve as the best primary key.

2. The self-referencing row makes it easier when you add a new child node. For example, if you have a path A-B-C-D and you want to add a new child of D, you just run:

SELECT ancestor, E FROM closure WHERE descendant = D

If you didn't have a self-referencing row (D,D), you'd have to add the last path (D,E) by hand anyway.

3. Most of the time when you want to query either a chain of ancestors of D, you'd want to include D in the result too. E.g. a breadcrumbs query. If you didn't have the self-referencing row, you'd get the ancestors of D, but not D itself, from this query:

SELECT ancestor FROM closure WHERE descendant = D

With the self-referencing row, you get ancestors and also D itself. You get a similar benefit when you want to query for a subtree and include the top node of that subtree.

SELECT descendant FROM closure WHERE ancestor = B

So yes, I do think the self-referencing row gives several benefits, even though it looks superfluous at first.

Niko Brauc said...

More than enough reasons for the symmetrical structure :-) Thanks for the quick and comprehensive response.

Justin Zaun said...

Hi Bill,

First, thanks for all your different posts and reticles on Closure Tables. They've helped a lot. I have everything just about working but I can't seem to select my tree in the correct order out of my table. I get all the correct nodes, just not in the correct order, even if ordering by pathlength.

You can see my full SQL here:

http://stackoverflow.com/questions/8252323/mysql-closure-table-hierarchical-database-how-to-pull-information-out-in-the-c

Thanks.

Tukki said...

@Bill, thanks for your great posts on closure tables.

And I made a closure table demo base on your posts, and choosing another solution to render sub-tree without care about the parents. I'm just a newbie, and I don't know if this absolutely right, but it worked.

Here this my note: http://goo.gl/UIXPA

anietog said...

Hi Bill,
first of all thanks a lot for this article, it is really useful at least for me.

I was wondering this desing needs to be changed slightly if we want to store more than one tree? i mean to have N root nodes? Or as you have answered @Sergio each node with no ancestors will be the root of a particular tree? Thanks a lot for your time

anietog said...

I have already found an answer to my question. http://stackoverflow.com/questions/4958570/how-to-find-distinct-root-nodes-of-newest-nodes-in-trees-hold-in-closure-table

Thanks a lot

Kevin said...

Bill, thanks for all your great work on the subject. It's been helpful to me multiple times in the past and currently with your writings on closure tables.

I have replicated the model you prescribed in this post, but the query you use to generate the below...

+-------------------------------------------------+
| path |
+-------------------------------------------------+
| rootree -> 1stbranch |
| rootree -> midbranch |
| rootree -> midbranch -> corebranch |
| rootree -> midbranch -> corebranch -> leafnodes |
| rootree -> lastbranch |
| rootree -> lastbranch -> lastleaf |
+-------------------------------------------------+

I can't figure out how to also get the ID of each element in there. Once I extract that data from my database, I will manipulate the results and build a tree structure for my application. The user will select something from that tree and I will have to know the primary key (not the friendly name) of the node they selected for when I go to commit the changes back to the database... So I'd like something like the following

+-------------------------------------------------+
| path |
+-------------------------------------------------+
| 1.rootree -> 2.1stbranch |
| 1.rootree -> 3.midbranch |
| 1.rootree -> 3.midbranch -> 4.corebranch |
| 1.rootree -> 3.midbranch -> 4.corebranch -> 5.leafnodes |
+-------------------------------------------------+

That would allow me to parse out the ID into my applications rendered tree object so I can associate that ID with the user's selection (from the tree).

It's not obvious to me how I'd do this with the group_concat function providing some of the heavy lifting.

Bill Karwin said...

Kevin, thanks for your comment.

You can put expressions inside the GROUP_CONCAT. For example:

SELECT GROUP_CONCAT( CONCAT(id, '.', n.label) ORDER BY n.node SEPARATOR ' -> ') AS path

Kevin said...

Bill, thanks that worked well... Oddly, that's the first thing I tried, before I replied to you, but I was accessing my database through phpMyAdmin running on Ubuntu/Apache. The results of embedding the CONCAT() within the GROUP_CONCAT came up as the below

path
[BLOB - 13B]
[BLOB - 17B]
[BLOB - 12B]
[BLOB - 43B]
[BLOB - 20B]
[BLOB - 42B]
[BLOB - 56B]
[BLOB - 67B]

When I ran the query from the command line, it came up just as expected.

+---------------------------------------------------------------------+
| path |
+---------------------------------------------------------------------+
| 1.root>2.Form |
| 1.root>3.Function |
| 1.root>6.Fit |
| 1.root>2.Form>4.Architectural Documentation |
| 1.root>2.Form>5.Code |
| 1.root>2.Form>5.Code>7.Software Components |
| 1.root>2.Form>5.Code>7.Software Components>8.Source Code |
| 1.root>2.Form>5.Code>7.Software Components>9.Compiled / Binary Code |
+---------------------------------------------------------------------+

Kevin said...

Ahhh.. in PhpMyAdmin, there's an expansion above the results that allows you to select "Show BLOB Contents"... that did it.

Bill Karwin said...

Paul d'Aoust asked me: "You allude to using a pathlength column to assist in sorting, in a blog post; can you elaborate on that?"

insert into closure (ancestor, descendant, pathlength) values
(1,1,0), (1,2,1), (1,3,1), (1,4,2), (1,5,3), (1,6,1), (1,7,2),
(2,2,0),
(3,3,0), (3,4,1), (3,5,2),
(4,4,0), (4,5,1),
(5,5,0),
(6,6,0), (6,7,1),
(7,7,0);

select group_concat(n.label order by a.pathlength desc separator ' -> ') as path
from closure d
join closure a on (a.descendant = d.descendant)
join nodes n on (n.node = a.ancestor)
where d.ancestor = 1 and d.descendant != d.ancestor
group by d.descendant;

This allows you to order the nodes correctly even if the node id is not usable for sorting.

JDelage said...

Bill,

This is super useful. I'm wondering if you'd have a suggestion for adding a sort order to a closure table. That is to say, what if I want to list all the leaves of a given tree in order (e.g., starting with the first subtree on the left and so on).

How would you do that? The best I can come up with is an additional table for all the relationships where depth=1, and include a sort_order column...

Unknown said...
This comment has been removed by the author.
Gerryjun said...

Sir, can you help me, how can i get the top most ancestor of a given child, the child can be of any depth and i need only the ancestor with no parent or is self referencing.

Thanks

Danilo said...

To create whole tree, even if you have more than one root node (as in my case), you can do this:

SELECT * FROM `nodes` a
join closure b
on a.id=b.descendant
where ancestor in (
SELECT c.ancestor FROM closure AS c
LEFT OUTER JOIN closure AS anc
ON anc.descendant = c.descendant AND anc.ancestor <> c.ancestor
WHERE anc.ancestor IS NULL
)

Thanks for this BrILLiant technique!

Antonin Januska said...

You've got a pretty amazing answer there. I wonder if you could help me with my dilemma: http://stackoverflow.com/questions/11790108/what-table-structure-to-use-for-nested-data

It's basically a hierarchical structure like the one you describe but each branch has numerous sub-branches that can be moved around (id stays the same position is different) and I can't wrap my head around calling up (for example):

the first top-level branch, the second mid-level branch within that top-level branch, and the fourth bottom level branch within that mid-level branch. Does that make sense?

I can imagine adding a position column to the closure table to simplify this but i still can't wrap my head around getting this done.

Bill Karwin said...

Hi Antonin,

Order of siblings within a tree isn't handled well in any design for hierarchical data that I've seen. The Nested Sets design is the closest one that allows you to order siblings implicitly, since the left/right values have a progressive sequence. But figuring out a given ordinal child from that is not supported.

If you add a position column in a closure table, you'd have to store the same position in many rows, because a given node could have many paths leading to it. Storing the values redundantly like that would create a risk for data anomalies.

I'm still struggling with this problem, but I think in closure table it would require an extra table to record the sibling order for each entry in the hierarchy.

P.S.: I don't answer questions on StackOverflow anymore.

NexTech IT Solutions said...

@Bill,

It was really a valuable resource for me to get to know about Closure Tables at the right time. I am not novice, but yes, I am new to the field, and I am doing an MLM business website.

Closure tables are good for it, I suppose. But do you suggest it's use for a very deep structure? A tree that may go down up to say 500 steps? At next step, there will be 500 inserts to the closure table per node insertion. According to you, is Closure table a good way in that situation?

Thanks.

Bill Karwin said...

@NexTech IT Solutions,

No algorithm or pattern is best for all cases. To find out if any given solution is the best for your application, you need to test it yourself.

The answer, I would guess, depends on how frequently you insert versus how frequently you query the tree. So it depends on your application's use of the tree data. Even if it's expensive or inconvenient to add entries to the tree, but you only need to do that once per day, maybe it's worth it.

I'm not going to be able to answer that for you.

nayyer said...

Hi Bill, do you know of a generic sql query for finding all the leaves which would work on all databases?

Bill Karwin said...

@nayyer,

I can't speak for all databases, because there are some databases that fail to support standard SQL.

But the following solutions should both find leaf nodes, and they are both standard SQL:

SELECT leaf.ancestor
FROM closure AS leaf
LEFT OUTER JOIN closure AS subleaf
ON leaf.ancestor = subleaf.ancestor
AND subleaf.ancestor <> subleaf.descendant
WHERE subleaf.ancestor IS NULL;

SELECT leaf.ancestor
FROM closure AS leaf
GROUP BY leaf.ancestor
HAVING COUNT(*) = 1;

NZ said...

Hi Bill, thanks for the answer above. Do you know of a better way using standard sql to display the closure tree. Right now I get all the leaves and then build using all the paths from leaves to root. Any other better way?

Bill Karwin said...

NZ,

I haven't found a way to get this kind of output except by using GROUP_CONCAT(). That function is MySQL-specific, but there are ways to simulate it with other brands of database.

There could also be a way to use standard SQL recursive CTE syntax to get this output, but I haven't experimented with that, since MySQL doesn't support it.

You might find some interesting tips here: http://explainextended.com/2010/06/21/group_concat-in-sql-server/

ken said...

Hi Bill,

Did not know where to post a question.

I am using a closure tree to describe the relationship of questions within sections of a dynamically created question air. I am able to build the questions just fine. Now I want to traverse the nodes. When I get the end of a branch, I want to move to the next branch. Since sibling nodes are not necessarily sequential how do I do this.

for example
Q1 leads to Q1.1 which leads to Q1.1.1, now it is time to move to Q2, how do I do this?

I am using php and mysql

Bill Karwin said...

Hi ken, thanks for your question.

This question comes up frequently with discussions of rendering trees. All tree designs have the same challenge, it's not just a Closure Table thing.

The problem is to define a preferred order when rendering a tree, and make a query that can fetch the tree entries in that order.

The solution is to generate a string that contains the "breadcrumbs" of ancestors of a given node, and then sort by that string.

If you want the sort order to be independent of the node id's or their names, then you have to store the sort order in another table. Basically, these would represent ordinal position of siblings in any subtree. Then generate a breadcrumbs string based on those values.

I would like to write another blog to demonstrate this technique.

ken said...

I for one would love to read it :)

Webnet said...

I would also enjoy reading it :)

Nate C said...

I would also be quite interested in reading about the sorting method