Magnus' blog


Data cloning - Part 6: Masking data

Since the process will take all rows for the specified entities, it could also include sensitive information. Imagine you want to pick an entity that is in some way related to the user. In that case, the user rows will also be included in the export. However, you might not want, for example, password information leaving the production server. This is where masking comes in and allows the developer to specify masking functions to remove any sensitive information that is not needed in the export.

To allow the implementor to specify how a masking operation should happen, the interface for masking is exposed in the configuration builder:

CloneConfigurationBuilder<T> masker(String tableName, TableMasker tableMasker);

Where the TableMasker is a simple interface for masking a row:

public interface TableMasker {
    Row mask(Row row);
}

Using this, a constant masking can easily be applied to a configuration:

CloneConfiguration CLONE_CONFIGURATION = CloneConfiguration
    .builder("main_table", "id", Long.class)
    .masker("main_table", row -> row.mask("password", "not cloned"))
    .build();

Setting all passwords to “not cloned”. This solves the problem for the configuration builder.

Streaming through CSV rows

One of the goals of the project is to be able to move larger amounts of entities at a time, and to do that, the rows are streamed. This works perfectly with the normal flow where the COPY command copies into a stream one row at a time. However, in order to modify the rows, they have to be read in a way that allows for transformation.

The postgresql Java library has a CopyOut object that allows reading the rows one at a time, providing a byte[] for each row. This structure is a good one, but the Apache Commons CSV is built around Reader, which pulls information x bytes at a time. Therefore, they don’t work well together. This leaves two solutions: either a fake Reader that wraps the CopyOut and reads from it, or constructing a CSVParser for each row. The CSVParser seems like a simpler implementation, so constructing it for each row should be performant enough and seems more straightforward than implementing a fake Reader.

The code expands the COPY such that it only does the parsing when a masking function is configured:

final var maskerOpt = cloneConfiguration.tableMasker(tableName);
if (maskerOpt.isEmpty()) {
    copyManager.copyOut(copyOutSql, zos);
    return;
}
final var masker = maskerOpt.get();

final var copyOut = copyManager.copyOut(copyOutSql);
maskCopyOut(zos, copyOut, masker);

The maskCopyOut function then parses and calls the mask function for each row:

private static void maskCopyOut(
            final ZipOutputStream zos,
            final CopyOut copyOut,
            final TableMasker masker
    ) throws SQLException, IOException {
        final var headerRow = copyOut.readFromCopy();

        if (headerRow == null) {
            throw CloneError.COPY_OUT_HEADER_EMPTY.toException();
        }
        
        zos.write(headerRow);

        final var headers = parseCsvValues(headerRow);

        byte[] row;
        while((row = copyOut.readFromCopy()) != null) {
            final var values = parseCsvValues(row);

            final var rowToBeMasked = new TableMasker.Row(headers, values);
            final var maskedRow = masker.mask(rowToBeMasked);

            final var maskedCsvRow = printCsvRow(maskedRow);
            zos.write(maskedCsvRow.getBytes(StandardCharsets.UTF_8));
        }
    }

The code is simplified, where the real code also contains more error handling, but it shows how the parsing is done in a streaming way.

Alternative use cases

Some databases contain unique constraints that are sequences — for example, a counter of the total number of X rows in a database. In that case, the importer could conflict on a row containing duplicate values, since the sequence generation is only local to the specific environment. In such a scenario, the masking function can be used to null a counter for a specific environment.

For example, if you have a counter that is a unique sequence, you could null it:

CloneConfiguration CLONE_CONFIGURATION = CloneConfiguration
    .builder("main_table", "id", Long.class)
    .masker("main_table", row -> row.mask("counter", null))
    .build();

There is no definitive way to do this, so it is something that should be considered on a case-by-case basis. For example, if the counter is used in another place, it might be inappropriate to null it, and the implementor might need to manually remove the other row.


Another alternative is that you want to add something to a String or increase a number, which can also be done with the masking function:

CloneConfiguration CLONE_CONFIGURATION = CloneConfiguration
    .builder("main_table", "id", Long.class)
    .masker("main_table", row -> row.mask("counter", counter ->
        Integer.toString(Integer.parseInt(counter) + 1)))
    .joinByJoinTableForeignKey("sub_table", "main_id")
    .masker("sub_table", row -> row.mask("note", value ->
        value + " This value has been cloned from another environment."))
    .build();

Conclusion

The masking layer completes the tool. Sensitive values can be scrubbed before leaving the source environment, and the same API handles broader transformation needs like clearing unique sequences or annotating cloned rows. Over six posts, the project has grown from a single COPY call into a self-contained, configurable, and safe way to reproduce production issues locally.

This marks the last blog post for the data cloning project. Thank you for reading!

View the source on github.com/HHMagnus/PostgresqlJavaEntityCloner

View next or previous post: