The different techniques of Change Data Capture
The change data capture techniques are used in order to identify changes and Oracle CDC is usually the baseline system used to synchronize another system with usually the same incremental changes or to store different trials of changes. The said trail may be used for other uses such as to update a data warehouse or in order to run analyses across the changes like in order to identify different patterns of changes. Through this article, you will learn about the four most common ways to perform CDC. The said four ways are as follows:
1. The date modified method
There are a lot of transactional applications that keep track of metadata in every row which includes the names of those who created or modified the row, as well as when the row was created and modified recently. The usual approach to oracle CDC in such a scenario is that of keeping a track of when changes are extracted, and in a subsequent filter on the date modified column to only retrieve rows that were modified since the most recent time data was extracted. This is an approach that has a few challenges that are usually of no concern but then there are those that may be a cause of concern. These are as follows:
- Data deletes are a big challenge because there is usually no date modified for a deleted row. In an extreme case of delete, the truncate table which is usually very uncommon in transactional applications, there are chances of an occurrence.
- The date modified has to be available on all tables and must be set in a reliable manner. In order to do so, the database triggers is usually the right method and these may introduce an overhead on the transactional application.
- The date modified can be indexed in order to lower the impact of the select statement at the cost of storing and also updating the additional index.
2. The Diff method
The diff method for CDC compares the current state of the data with previous state of the data in order to identify what all was changed. The usual challenges with this approach include:
- In order to perform the diff method, the most important requirement is a presence of a lot of resources in order to compute the differences between the data, and also the resource consumption grows at least linearly with the growth in data volume.
- With this method the CDC cannot be performed in real-time because it extracts a lot of resources in order to perform.
3. The triggers method
In shadow tables, the triggers method is used to perform CDC. The shadow tables might store an entire row in order to keep track of every single column change. It may also store the primary only apart from the operation type like insert or delete etc. this method is also prone to a few challenges like:
- Usually the lower-overhead alternative might only store the primary key of the table and requires a join back to the source table in order for it to retrieve the changes which primarily, increases the load to retrieve the changes, and secondly, loses intermediate changes if multiple changes took place on the same row.
- Firing the trigger, and storing the row changes in a shadow table, introduces overhead most of the time. In extreme cases, 100% overhead on the transaction i.e. instead of .1 second it may take .2 seconds to complete a transaction.
4. Last but not the least, the Log-Based Change Data Capture method
All changes in a transaction log are stored in transactional databases so that the committed state of the database is recovered, should the database crash. This method takes advantage of this aspect of the transactional database in order to read the changes from the log. In this method, a few challenges are involved which are as follows:
- It is difficult to interpret the changes in the transaction log as there are no documented standards on how the changes are stored like transaction logs from different database vendors, and there are many scenarios that must all be considered and tested for example to consider clustered databases, rollbacks and savepoints, updates and deletes, etc.).
- A lot of databases are optimized to only use internal identifiers in order to recover database row changes which is insufficient to perform CDC and record the changes on an altogether different system. The primary key columns are required to retrieve the context of the updates and the introduction of supplemental logging will increase the volume of data that is written to the transaction logs in a small percentage and there is very little measurable performance impact on the transactional application.
These methods are tried and tested and if one can overcome all these challenges, the CDC may be performed in the best manner.