Magnus' blog


Data cloning - Part 5: Recloning

What happens when you clone data that has already been cloned before? This is what I call “recloning”. In the initial solution you would have to manually wipe the data, because otherwise it would conflict on insert. Instead, the solution cleans the previous entity data and allows recloning the same data twice. This allows the developer to make changes to test something and then reclone if they want to reset.

In the optimistic scenario this could happen by doing an on conflict (id) do update set ..., which allows you to update the row when it conflicts on the id. However, this would not work in cases where a previously inserted row was deleted. In such cases only updating would cause it to still be present in the database, possibly duplicating lists. The issue becomes: how can we know which rows were deleted and delete them?

You could delete all related data before cloning, but this presents an edge case where you have a debug row that points to the entity. In that case the deletion might not be possible due to a foreign key constraint. To allow for such scenarios an alternative is to use the on conflict for updated rows and delete the deleted rows.

It is not possible to insert the deleted rows into the cloning data, as the exporter does not know what is in the importer’s database. Additionally, if the imported database contains an existing entity with the same id that differs from the cloned one, it should still replace it. Therefore, the information has to be encoded elsewhere.

This is where the builder becomes useful again. The builder knows the relations between the different tables. The builder simply needs to be passed as an argument to both the exporter and importer, so that the information is known in both cases.

The importer has two important properties:

  • Tables are always imported in the order where a table’s dependencies are imported before it. The foreign key to the table it depends on is therefore known.
  • A temporary table is used for importing, allowing the ids to be visible in the database before the data is inserted into the main table.

The importer can use this to execute a delete statement between the insertion into the temporary table and the copy into the main table:

copy temp_main_table from stdin with csv header;

delete from main_table
where id not in (select id from temp_main_table)
and foreign_key in (select foreign_key from temp_main_table)

insert into main_table 
select * from temp_main_table
on conflict (id) do update set
    x = excluded.x,
    y = excluded.y

This will delete all rows not being updated by the temp_main_table.

A small optimization is that do update set requires all columns to be listed. These can be gathered from the database directly by executing a string concatenation over the columns in the schema:

select 
  'on conflict (id) do update set ' ||
  string_agg(column_name || ' = excluded.' || column_name, ', ')
from information_schema.columns
where table_name = 'main_table'
  and column_name != 'id';

Verifying the configuration

Since this change requires both the exporter and importer to know the table configuration, a misconfiguration can occur if they use different versions. Ideally the configuration could be part of the transferred data, but this would make serialization and deserialization across versions cumbersome to maintain. Instead, the approach of passing the configuration as arguments is chosen.

To verify that the correct configuration is being used, the configuration is toString()‘ed and appended as the first file in the ZIP. This way the importer can verify that the table configuration it is executing matches the one the exporter used, while no serialization happens — so the configuration remains developer data rather than user data.

Conclusion

With the delete-then-upsert approach, recloning the same entity is safe regardless of what has changed since the last clone. Developers can freely modify cloned data to test a fix and reset to a known state whenever needed. The next blog post will explore how to remove sensitive data from a clone.

View the source on github.com/HHMagnus/PostgresqlJavaEntityCloner

View next or previous post: