The first check I intended to do was the comparison of data for the mentioned customer between our source database and Apache Kafka, but then realized this is not possible since the retention time for our Kafka topics are only two days and the mentioned customer data is older. So I had to test other customers. And there I witnessed missing data for a few customers and my assumption was true, the way NIFI reads the data ("read committed") and tries to prevent dirty reads we should not see missing data, but as the source database commits both the creation of new record (creation of new customer ID) and optional data (birthday of a customer) together - and "reading" or selecting that data is only possible AFTER BOTH are committed therefore we lose some records where the customer still has not entered the birthdate which is a mandatory registration field and takes a few seconds after the initial registration.
Even though my team aim to provide real-time data but the solution here was to add a 500 milliseconds delay which was tested which is covers the difference between the registration time of customer and updating time of its record to add the the birthdate.
Thanks for this solution! Relational databases can be quite tricky in that regard. Other than the delay maybe using a higher isolation level for the DB might work, since I fear a bigger delay might be needed later on with a growing DB.
Comments