Drupal is my database -- importing 80,000 convicts with Table Wizard and Migrate

One of the goals of the Founders and Survivors project (the one that concerns me the most) is to compile and publish data about the Van Diemen’s Land convicts from a variety of sources, and make links or cross-reference between records from different sources that relate to the same individual (or, potentially, the same family). Our sources include convict indents carried on the convict ships, conduct records, police gazettes, and registers of births, deaths and marriages. In addition, we have been collecting biographies and family histories submitted by descendants of convicts, which are valuable sources of information about the lives of convicts after they left the penal system.

Biographies have been submitted by members of the public through our Drupal website as a custom content type using the CCK module. The other data exists in paper records which are being transcribed and in some cases photographed. Transcriptions have been entered in a variety of desktop-oriented data collection systems, including Filemaker Pro, Microsoft Access and Microsoft Excel. My goal is to make this data available to other researchers and members of the public in open formats and through the web. As I have got to know Drupal better, I have come to see that it is a powerful way of representing strucutred data as well as free-form documents on the web, and it can also handle the privacy and access control we need to prevent unethical use of data.

The Archives of Tasmania have an online Index to Tasmanian Convicts and I was given a copy of this data in Microsoft Access. This index will form the basis of the entire Founders and Survivors database, so that a record of a convict in the index will contain links to all of the other data we have on that individual. After exporting the index data to a comma-separated text file (so that I don’t have to touch Access again) I hacked together a Perl script to extract data from this file and insert it directly into the node and content tables in Drupal’s database. It looked as if the script did what I wanted it to do, but I wasn’t confident about it and didn’t run it on our production database before I left for LCA2010 and DrupalSouth, where Angie Byron gave me a better solution. (Why yes, I do lose years off my life every time I attempt a bulk migration!)

Angie’s article is a good tutorial on using the Table Wizard and Migrate modules generic data migration process. This is the full process I needed to set up migration for the Archive index (which is still underway – 80,000 records don’t appear instantaneously).

First I also created a content type in Drupal with the same data structure. Every node must have a title, and I wanted the index records to have a composite title containing the convict’s index number and name and the name of the ship on which he or she arrived in Van Diemen’s Land. My first solution was to use the Automatic Nodetitles module which did just this for manually entered records. However, after running the migration process on a sample, I found that the whole batch of migrated records would have an automatic title made up of components of the first record in the batch. (I should probably report this as an issue and even try to fix it, but I needed to find a more immediate solution.) Instead I used the Rules module (which, like Automatic Nodetitles, depends on the Token module) to update the title of each node after is created.

Table Wizard makes any database table available to Drupal Views. It looks for tables in the default Drupal database (or another database) so I dumped the contents of the CSV file into its own table in the database. The Table Wizard administration page admin/content/tw lists tables managed by Table Wizard and other tables in the database that can be added to Table Wizard. Each exposed table has an analysis link which provides information about each field.

If you only want to view data in an external table, Table Wizard will suffice. The Migrate module is used to map the structure of the external table to data structures within Drupal and import the data into Drupal so that it can be searched, viewed and modified like any other Drupal node. The Migrate Extras module is needed to migrate to fields created in CCK. Under ‘Add a content set’ on the Migrate dashboard admin/content/migrate, I selected the index record type as the destination and the Table Wizard table as the source. Clicking on this content set allows me to map the source to destination fields and change other migration settings.

There are two ways to execute the migration: from the Migrate dashboard, or using drush, the Drupal Shell. From the dashboard I can import samples of data or clear all imported data; this is useful for testing the import settings.

By default, new nodes imported by Migrate are not published. After testing the migration process on small samples, I cleared the imported nodes, went back to edit the content set and set the default value for Node: Published to 1 so that imported nodes are published.

The index data is now being migrated at a rate of about 12 records per minute. It runs out of memory after (usually) 136 records and attempts to start a new batch, but terminates instead. (These issues require more investigation on my part. It seems to be related to drush permissions.) I am running drush migrate from a cron job so that the migration process can continue unattended. At this rate the index might all be online in a week.