Database Replication & Merging

Oooh that's a bit tricky!

Ok, so let's imagine that you have designed a Database and obeyed all my unreasonable rules:

Every table has a non-compound, surrogate, Integer Primary Key

Then you need to send a copy of this Database to another site. Not a problem, especially with SQLite as you can just copy the file - easy! But, what happens when you want to merge the data from these databases into one Super Database, or you need to update the Database in Site A with the data from Site B?

Now we have a problem! Each Database will create Primary Keys. Database A has no way of knowing which Keys Database B has issued, so the Databases will have Primary Key values that exist in both Databases. When we come to merge this data, the Database receiving the data will explode because Primary Keys MUST be unique!

Merging Explodes Database!

Site A has allocated a Primary Key value of 1 to a Person Called John Smith, Site B has also allocated Primary Key 1 to Harry Bloggs!

So HOW do we deal with this nastiness? There is no 'correct' way of doing this. I will attempt to outline a few ways you could deal with Replication / Merging, but you will have to decide which is the best way for you. I once worked for six months on a project involving a colossal database - just on the thorny RepliMerge ( that's a new word I just invented) issue, and it was a truly dreadful project to work on!

How About GUIDs?

First things first - what is a GUID? A GUID is defined as a Globally Unique IDentifier - you can read all the techie stuff at Wikipedia.

Basically a GUID is a 32 character Hexadecimal string, usually stored as a 128 bit integer. It is formed from a bunch of stuff including the device's MAC address and some time based algorithms. It almost guarantees that a unique identifier will be created. There is a very slim chance that time changes like daylight saving may cause issues, but you should be pretty safe with a GUID. There is a very good article on Aspfaq.com, detailing the pros and cons of using GUIDs as Primary Keys.

So why don't we all use GUIDs as Primary Keys?

For some applications it can be goo-id to use GUIDs, but they are bit on the complex side!

Compound Primary Keys?

OK I know I said that compound primary keys are Evil, but rules are meant to be ignored if they don't allow you to build your super database application!

To avoid using GUIDs we could create our tables using compound or composite Primary Keys - a bit like this:

create table people (
site_id integer ,
id integer ,
name text ,
primary key ( site , id ) )

OK, so that will allow us to use a SITE_ID value in our table and that combined with the ID column would provide us with a Unique Primary Key value - as the Primary Key would require both elements.

The SITE_ID value would probably be added as as constant value, so would not be a problem.

But.... we can not now use the automatic key generating function that an INTEGER PRIMARY KEY column would normally provide to populate our ID column, so we must generate our own ID value. This COULD be done via an extra table that stores the maximum value of 'Keys' used in each table in the database. We would then need to write a trigger to increment this value for the table we are inserting to, then update the extra table with the value we have just used.

create trigger People_Trigger
after insert on People
begin
update people set id = ( select last_key_value + 1 from myKeys where table_name= 'People' )
where rowid = new.rowid;
update myKeys set last_key_value = last_key_value+ 1 where table_name= 'People' ;
end;

Imagine a perfect World where SQLite allowed you to create a compound Primary Key that had an autoincremented value as part of the key - hmmmmm, deranm on!

I'm not a big fan of compound primary keys, but if they work for you ... then they work for you.

Again, we would need to write a similar trigger for every table in our database - yeeeurk!

Scheduled Data Cleaning

There may be a possibility of creating an overnight Data Cleaning type of routine.

All databases are sent to the central master database and all new values inserted into the 'Master' Database, the values of the resulting Primary Keys created in the Master Database would be Updated in the individual databases, which would then be returned to the various sites. I know this breaks one of my other rules about never updating a Primary Key - but that's life!

By adding Foreign Keys with an On Update Cascade clause like this ....

create table people_addresses (
id integer primary key,
people_id integer ,
address_id integer,
foreign key ( people_id) references people ( id ) on update cascade,
foreign key ( address_id) references addresses ( id ) on update cascade);

...we can make sure that changes to Primary Keys propagate to tables that have Foreign Key links to them.

The Data Cleaning method could lead to enormous problems if versions are not tightly controlled.

Vandalising the internal Sequence Table

This method will work for a one way merge - distributed databases into a central database.

The SQLite 'Integer Primary Key' function will generate the next available number ( without reusing any numbers if you use the AutoIncrement clause). You can use numbers right up to the highest possible integer value that SQLite is capable of storing - 9,223,372,036,854,775,807 - a VERY big number.

When you design your database and the distribution of database versions you may decide that each site will only ever have a maximum of say 1 million records. SO, you COULD alter the starting sequence number that SQLite uses for Autoincrement Primary Keys for each table in the various databases - YES it is possible!

When you use Autoincremented Primary Keys - SQLite will create a table called SQLITE_SEQUENCE. This table has two columns: name and seq. Unsurprisingly the 'name' column stores the table name and the 'seq' column stores the largest rowid that has ever been used for this table. We can change these numbers!

Site A Database:

update sqlite_sequence set seq = 1000000 where name ='People'
update sqlite_sequence set seq = 1000000 where name ='Addresses'
....

Site B Database

update sqlite_sequence set seq = 2000000 where name ='People'
update sqlite_sequence set seq = 2000000 where name ='Addresses'
...

So as long as we never have more than 1 million records in each database - the Primary Key values will always be unique.

With the HUGE integer value of up to 9 trillion you could make your sequence range much higher and have an astronomical number of databases! If these values are insufficient for your needs then you need to think about hiring a team of super techie DBAs and you really should not be using SQLite!

When the databases come home to be merged, the Master database could have a non autoincrement Primary key, so the data would slip in nicely!

This technique will work if you just want to merge the data from Site A and Site B into a central master database, but....if you want to do a two way data mege - Site A to Site B and Site B to Site A, then a different method would be needed! It would be nice if SQLite allowed us to update the SQLITE_SEQUENCE to a number lower than the actual highest number, then we could use this method for cross database merging - but it won't!

Roll Your Own Sequences

Using a similar technique as discussed above, you could fairly easilly create the same effect, only this time allowing multiple merges.

You could create your own version of a sequence table and use a Trigger remarkably similar to the one discussed in the 'Compound Primary Key' section:

create trigger People_Trigger
after insert on People
begin
update people set id = ( select last_key_value + 1 from myKeys where table_name= 'People' )
where rowid = new.rowid;
update myKeys set last_key_value = last_key_value+ 1 where table_name= 'People' ;
end;

If the initial values in each database's Sequence table were set to a range as discussed above then the need for a compound primary key would be removed.

create table myKeys ( name text, seq integer );

Site A:

insert into myKeys ( name , seq ) values ( 'People' , 1000000 );
insert into myKeys ( name , seq ) values ( 'Addresses' , 1000000 );

Site B:

insert into myKeys ( name , seq ) values ( 'People' , 2000000 );
insert into myKeys ( name , seq ) values ( 'Addresses' , 2000000 );

etc.

Which Method is Best?

You don't really think I'm going to answer that question do you? These are just a few ideas - I'm sure you'll think up better ways to RepliMerge yourself, let me know if you do.

Angry Caveman says 'Make your own mind up!'

Grrrrrr! You have to make your own mind up!

For one way merging into a central database, I like the SQLITE_SEQUENCE cludging method - BUT you MUST test, test and test again that it works for you. These are only ideas and if you break your data using any of these methods then please don't throw lawyers at me - I don't have any money and you should really investigate futher and make sure you are safe and happy before using ANY Data Repication / Merging technique!

Good Luck!