Missing Data while reading data from database

Our team uses Apache NiFi to offload data from source database transform it and send it to our stakeholders with Apache Kafka. Recently Swiss Regulators as one of our stakeholders claimed that there were missing records for some customers.

Comments

relational databases are tough to work with if you need real time data. Also, its best practice to not include the names of companies or stakeholders when asking for help publicly online :)
Ondrej Brichta - Mon, 12/20/2021 - 11:06 :::
1 answer

The first check I intended to do was the comparison of data for the mentioned customer between our source database and Apache Kafka, but then realized this is not possible since the retention time for our Kafka topics are only two days and the mentioned customer data is older. So I had to test other customers. And there I witnessed missing data for a few customers and my assumption was true, the way NIFI reads the data ("read committed") and tries to prevent dirty reads we should not see missing data, but as the source database commits both the creation of new record (creation of new customer ID) and optional data (birthday of a customer) together - and "reading" or selecting that data is only possible AFTER BOTH are committed therefore we lose some records where the customer still has not entered the birthdate which is a mandatory registration field and takes a few seconds after the initial registration.

Even though my team aim to provide real-time data but the solution here was to add a 500 milliseconds delay which was tested which is covers the difference between the registration time of customer and updating time of its record to add the the birthdate.

Comments

Thanks for this solution! Relational databases can be quite tricky in that regard. Other than the delay maybe using a higher isolation level for the DB might work, since I fear a bigger delay might be needed later on with a growing DB.

Sascha Pleßberger - Mon, 12/13/2021 - 14:51 :::