Sometimes The API Just Won't Do It

So I've been having lots of fun (read: horrible pain) this week thanks to some quirks of Drupal that only really present themselves when you are looping through, loading, saving and manipulating nodes quickly. The scenario for this sort of thing is normally (yup, you've probably guessed it) importing.

I had an ugly problem. I had to import 64,000 XML documents I received from a client in to Drupal as nodes. Doesn't sound too bad? If it were one XML document per node, everything I needed contained within each document, it wouldn't be. But actually there are more like four documents per node.

Why? The way the IT team at the client did the export of data from their system they produced one copy of an article for each category it was in.

As a result I had to parse the first document I came to, save *their* unique document ID somewhere - an ID found in all XML documents relating to that node - then continue on to the next document. I looped through the documents until I found another one with that ID, but this time I was only interested in the taxonomy data. Now this is where the fun starts.

The first problem was with node_load(). The node object is cached inside the function using a static variable. I didn't realise this, so I spent a good deal of time wondering where the hell some (not all) CCK data had gone - specifically file and nodereference fields. Fortunately, and thanks to some help in IRC, someone pointed out a little-known feature of the node_load() function. It has a $reset parameter that, when set to TRUE, resets the cache of the node.

I changed my function call to look like this and, finally, my nodes started coming out right:

<?php
  $my_node
= node_load($nid, NULL, TRUE);
 
krumo($my_node);
?>

So that was one static variable caching issue dealt with. I thought, at this point, I was home and dry. How wrong can you be!

Remember that from this point on all I needed to do was load subsequent matching documents, pull out their category data and attach it to the corresponding node as a taxonomy term. So I've loaded my reset node, I've loaded the category from the XML document, I've looked up the corresponding taxonomy term and I've added it to $node->taxonomy and done a node_save().

Yet when I load the same node again to add the next term, the taxonomy data I added on the previous loop was gone! What the deuce??

Turns out the taxonomy_node_get_terms() function, used to populate the $node->taxonomy property of the node object, *also* caches the terms from the previous run in a static variable. However, it does not respect the reset from node_load() and worse, it has no reset parameter of its own.

(You don't want to know how many hours it took me to work this out.)

So what was happening? My terms were being successfully saved, but when I went to re-load the node object to apply more terms an old, cached version of the applied terms was persisting. Without my update, the term I saved previously was overwritten, giving the impression it was never saved.

How to get around this? Ditch the API! =(

Here's how I got my taxonomy terms back, ignoring the node object since it contained an incorrect set and I couldn't change that:

<?php
  $node
= node_load($nid, NULL, TRUE);

 
$values = array();
 
 
// existing taxonomy terms must be loaded from the db
  // node_load can't get around caching problems with taxonomy.module
 
$result = db_query("SELECT * FROM {term_node}
        WHERE nid = %d
        AND vid = %d"
,
       
$node->nid, $node->vid);
  while (
$term = db_fetch_object($result)) {
   
$existing_terms[] = $term;
  }
 
 
// save back the terms we just rescued
 
if (is_array($existing_terms)) {
    foreach (
$existing_terms as $term) {
     
$values['taxonomy'][] = $term->tid;
    }
  }

 
$node->taxonomy = $values['taxonomy'];
?>

Apparently this whole thing is going to be a whole lot smarter in Drupal 7, but for now if you have static variable caching problems, specifically with taxonomy, there's not much you can do except forget about the API. It won't help you. In fact, quite the opposite.

So, lessons learned:

1. Odd behaviour where data appears to be missing/overwritten in Drupal 5 and 6 could well be a static variable issue - search for statics that might be in the way.

2. If you need to save a node, load it again and then manipulate the taxonomy, you *must* get the terms manually from the database.

3. Thankfully, node caching can be prevented by using the node_load() function's $reset parameter.

*phew*

What a week.

Great timing...

Funny you should post this article during a week we were having similar issues. Thank you for the heads up and your solution. Not exactly our problem but this will help.

A unified static caching API went into D7 in May 2009

http://drupal.org/node/156281

So you won't have to deal with this pain in the future. Many of us have dealt with this at one time or another.

Yeah... Drupal sort of

Yeah... Drupal sort of assumes you really only want to load a node once per request. This sort of work-around is one way to take care of it... but I wonder if the batch api would have been a way to work with it? That way each operation would be a new request to the server, so the static variables would have been cleared?

In any case, good job working around Drupal's little quirks!

Possibly

I actually intended to use the Batch API when I started looking at the job, but someone put me off saying it was very slow and I had 64,000 documents to process. However you may well be right - the Batch API might have made this go away.

seperate HTTP request per document

I was that someone that said the batch API could add a lot of overhead...

I hadn't run into the caching problem as the bulk data loads I worked on used neither CCK nor taxonomy.

I'd say Gregs work around is going to be far faster and is in many ways a cleaner approach than working around the cache by making new requests.

A fast running import is much easier to debug and test - as well as having a lower impact on the live site.

Sean

I worked around the caching

I worked around the caching problem using a crafty drupal_http_request() call to a MENU_CALLBACK which processed just one node at a time, hence nothing got cached as the page load finished.

My calling page would then loop through and keep requesting a hidden page.

This took a long time, so in the end I implemented a nifty progress bar and used a page refresh to do a batch. This avoided the PHP timeout too.

Can't remember the exact details of it all now, was last summer...

Sneaky!

Like your style. Probably would've taken too long in this case, but nice workaround. =)

D6 has some definite inconsistencies

There are definitely lots of inconsistencies in D6, especially around taxonomies.. lack of hooks, lack of cache clearing (like you found), etc.. I'm looking forward to seeing how D7 improves it.

It sounds to me like all

It sounds to me like all developers have a lesson to learn from your painful week.

If you are creating a caching mechanism, always provide a way to reset the cache at any time, in operations that return cached data.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <blockquote>
  • Lines and paragraphs break automatically.
  • You may post code using <code>...</code> (generic) or <?php ... ?> (highlighted PHP) tags.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Copy the characters (respecting upper/lower case) from the image.