Tuesday, August 23, 2011

Migrating a Django Postgres DB from Concrete Inheritance to Abstract Inheritance

Django comes with several ways of implementing model inheritance, and specifying which one you would like to use takes only a line or two of code. Setting up your database for the very first time is extremely easy, but migrating between types after you have existing data in infinitely harder. I have a live site whose needs have changed since I built the database and the concrete inheritance I set up is no longer needed. In my case, the site should be using Abstract Inheritance

There are a number of gotchas in this process, and I will outline exactly how I navigated my way through this process. First, I should note that I use django-south in my project and the migrations rely heavily on using this. I asked for advice on Stackoverflow  and I'll use a similar example here. This article will walk you through the process of migrating from the before to after schema while keeping all your data intact.

Before:

app1/models.py:

    class Model1(base_app.models.BaseModel):
        field1 = models.CharField(max_length=1000)
        field2 = models.CharField(max_length=1000)
    
app2/models.py:

    class Model2(base_app.models.BaseModel):
        field1 = models.CharField(max_length=1000)
        field2 = models.CharField(max_length=1000)
    
base_app/models.py:

    class BaseModel(models.Model):
        user1 = models.ForeignKey(User, related_name="user1")
        user2 = models.ForeignKey(User, related_name="user2")
        another_field = models.CharField(max_length=1000)

        objects = CustomManager()

After:

app1/models.py:

    class Model1(base_app.models.BaseModel):
        field1 = models.CharField(max_length=1000)
        field2 = models.CharField(max_length=1000)

        objects = CustomManager()
    
app2/models.py:

    class Model2(base_app.models.BaseModel):
        field1 = models.CharField(max_length=1000)
        field2 = models.CharField(max_length=1000)

        objects = CustomManager()

    
base_app/models.py:

    class BaseModel(models.Model):
        user1 = models.ForeignKey(User, related_name="%(class)s_user1")
        user2 = models.ForeignKey(User, related_name="%(class)s_user2")
        another_field = models.CharField(max_length=1000)
    
        class Meta:
            abstract = True

These are the models I will use to walk you through this migration process. Essentially what we are doing here is removing the one-to-one relationships between Model1/BaseModel and Model2/BaseModel and instead placing each of the fields of BaseModel into Model1 and Model2 so they actually reside in the tables for Model1 and Model2. The tricky part is migrating all your data while we work through this.

Make a Copy of BaseModel

When you add the "abstract = True" property to the BaseModel (don't do this yet), south will delete the entire table BaseModel. If you don't make a copy of the data in BaseModel, after you add the "abstract = True" property you will have no way of copying the fields in the old BaseModel to the respective new models. Making a copy isn't bad:
  1. Add BaseModelCopy within base_app/models.py that has an identical schema to BaseModel.
  2. Run a schema migration (python manage.py schemamigration base_app --auto; python manage.py migrate base_app).
  3. Run a data migration to copy the existing objects in BaseModel to BaseModelCopy (python manage.py datamigration base_app copy_contents; (then edit the contents of the data migration); python manage.py migrate base_app).
Add a Field on the Child Classes to Store the ID

The way django deals with concrete inheritance is a little interesting when you look closely at it. The Child classes Model1 and Model2 do not have 'id' columns in the database, and their primary_key is actually a column named "basemodel_ptr_id". Thus, the unique id is stored in the corresponding BaseModel object. When you eventually add the "abstract = True" property, south will naturally delete this "basemodel_ptr_id" column and you will lose all references to the original BaseModel object. This is clearly very bad, and we would have no way of recovering those relationships if we did that. Thus, we need to add a field to Model1 and Model2 that stores the id of the corresponding BaseModel object so we can reference it in the data migrations after we move to abstract base classes. Doing this is also straightforward:
  1. Add a field "tmp_id = models.IntegerField()" on both Model1 and Model2.
  2. Run a schema migration on app1 and app2
  3. Run a data migration on app1 and app2 copying the id of the corresponding base model into the "tmp_id" field.
Prepare the BaseModel to be Abstract

There are two things you will need to change in your BaseModel class before you can add the "abstract = True" property. First, you will need to fix the related_name for every ForeignKey field in BaseModel. If you don't do this, there will be multiple tables with ForeignKeys on the User object with the same related_name and this conflict will cause an error. Adding the name of the class to the related name as shown in the "after" state of the database will solve this issue.

The second issue is that you will have to remove any managers on BaseModel. These managers will throw an error and instead need to be bound to the child classes.

Mark the BaseModel as Abstract and Create a New Primary Key

There is yet another gotcha with this step. When we add the property "abstract = True" to BaseModel, after django deletes this model, it will automatically add an "id" field to Model1 and Model2 and it will make this field a primary_key. This is fine if we have no data, but since we do have data, we have no way of specifying what to set these initial "id"s to, and since it is a primary_key and there can't be any overlaps, we simply can't do this. Conveniently, we have this "tmp_id" field full of unique values, and we can use this as our primary_key. We can specify this when we make our migration:
  1. Add the property "abstract = True" on BaseModel.
  2. Add a field "id = models.IntegerField(primary_key=False)" on BaseModel
  3. Change the "tmp_id = models.IntegerField()" on Model1 and Model2 to be "tmp_id = models.IntegerField(primary_key=True).
  4. Run schema migrations on app1, app2, and base_app.
  5. Run a datamigration that copies the relevant data from BaseModelCopy to every corresponding object in Model1 and Model2. To do this, just make use of the "tmp_id" field we created and the new BaseModelCopy class. Make sure this datamigration also copies the "id" field.
Remove all our Temporary Models and Fields

Our database should be near-perfect now and you should be able to run the development server to view your site. Inserts will not work yet because we have not specified a way to auto-increment the primary key of objects when we insert them, but we'll get to that later. Right now, if we see all our data has migrated successfully, we are ready to remove all this temporary stuff we created.
  1. Remove the "tmp_id" columns from Model1 and Model2. 
  2. Change the "id" field on BaseModel to "id = models.IntegerField(primary_key=True)".
  3. Remove the model BaseModelCopy.
  4. Run schema migrations on app1, app2, and base_app.
Build the AutoField Functionality on Your Primary Key

The final gotcha of this process is that Postgres handles the django field models.AutoField() a bit strangely. You can inspect this using pgadmin or whatever gui you have to look at how it adds a "serial" property to the field. Because of this peculiarity, south will not allow you to migrate from an IntegerField() to an AutoField() out of the box. There is a workaround for this migration, but if that seems to messy to you, you can simply handle the auto-increment manually. Handling it manually only takes a couple of lines, but make sure you implement some sort of database locking to ensure you don't run into race conditions.

If you are not using Postgres, you may be able to completely skip this step and instead just change the id field on BaseModel to "id = models.AutoField(primary_key=True)" and run a migration, but I can't confirm this.

Push Changes to your Production Server

I'm assuming you made all of these migrations on local servers and not a production server, so the final step is to make the jump to deploying the changes. We now have our new schema with all the data filled in, and it should be working nicely with our code base on our local server. However, our set of migrations poses some problems for when we migrate. We can see that each set of migrations we did in the steps above are reliant on the full set of migrations having been completed in the previous step. This means the migrations for one app are reliant on the migrations for another app so going through a full set of migrations for a single app would not work. 

To avoid this issue, we could have pushed each individual set of migrations along with the code to the production server right as we generated them. Alternately, we can simply clear out the migration history for each of our apps, dump the local datastore, and then rebuild the production database using these migrations. This is the method I used because I happened to be doing this at 4am when no one users were generating new data on the stie.

So, these are the steps for how to do this:
  1. On your local server, remove all of the migrations from from migrations/ folder for app1, app2, and base_app. The history is no longer important to us because we did not specify a way to do backwards migrations anyway.
  2. On your local server, create initial migrations using south for app1, app2, and base_app, clearing any ghost migrations (python manage.py schemamigration app1 --initial --delete-ghost-migrations).
  3. On your local server, run the migrations. South will tell you that nothing needs to be changed. This is good.
  4. Save all your changes to git or whatever you use.
  5. Dump the local database (pg_dump --no-owner --no-acl -U username postgres_db_dev > postgres_db_dev_dump.psql).
  6. Clone your working repository onto your production machine.
  7. Copy your sql dump to your production machine.
  8. Delete the production db (dropdb -U username postgres_db).
  9. Re-create the db (createdb -U username postgres_db).
  10. Load the new data (psql -U username postgres_db < postgres_db_dev_dump.psql)
Now your production server should look identical to your local_server that you had a working version on. Now that they are synced, you can go back to using whatever regular deployment scripts you use and everything should be just fine.

Conclusion

The methods used in this tutorial are useful if you have a live server with data you don't want to lose, but also have a span of time where you can be confident that users will not be adding new data. If users had entered new data into the production site during this process, their data would be lost during the final deployment to the production server. My site has negligible traffic from 1am-7am when I did this, and the database is small (<50,000 objects of type BaseModel) so I could do this quickly. To overcome this on a medium-high traffic site that uses user generated data, you should really schedule some "read-only" downtime and alert your users of this to ensure they don't enter data that gets overwritten.