Migration guide - 0002_fill_person_distinct_id2

Last updated: Oct 25, 2022

0002_fill_distinct_id2 is an async migration added to migrate the data from the old person_distinct_id table to the new person_distinct_id2 table.

This is needed for faster person_distinct_id queries as the old schema worked off of (distinct_id, person_id) pairs, making it expensive for our analytics queries, which need to map from distinct_id to the latest person_id.

The new schema works off of distinct_id columns, leveraging ReplacingMergeTrees with a version column we store in postgres.

We migrate teams one-by-one to avoid running out of memory.

The migration strategy:

1. Write any new updates to both tables
2. Insert all non-deleted (`team_id`, `distinct_id`, `person_id`) rows from `person_distinct_id` into `person_distinct_id2` (this migration)
3. Once migration has run, we only read/write from/to pdi2.

FAQ

Is it dangerous for this migration to be in an errored state?

No, the migration copies data to the new table, but that new table is not used until the migration has successfully completed.

Questions?

Was this page useful?

ClickHouse

ClickHouse® is an open-source, high performance columnar OLAP database management system for real-time analytics using SQL. We use it to store information like: event person person distinct id / session and to power all our analytics queries. This is a guide for how to operate ClickHouse with respect to our stack. Metrics As with any database it is important to keep an eye on metrics to make sure everything is in ship shape. Most of these metrics shouldn't be a surprise. The metrics you should…