NIFI-4521 MS SQL CDC Processor #2231

patricker · 2017-10-27T09:11:11Z

A new Processor + new Bundle for Microsoft SQL Server CDC reading.

Processor works similar to a multi-table QueryDatabaseTable. It stores state on each table, and has some features for handling first run, where it can load snapshots of existing tables, and for handling large change sets.

I created Unit Tests also.

You can read more about Microsoft's table schemas for storing CDC data here:
https://docs.microsoft.com/en-us/sql/relational-databases/track-changes/about-change-data-capture-sql-server

For all changes:

Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically master)?
Is your initial contribution a single, squashed commit?

For code changes:

Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder?
Have you written or updated unit tests to verify your changes?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly?
If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly?
If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties?

For documentation related changes:

Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.

mattyb149 · 2017-12-15T17:31:50Z

...-bundle/nifi-cdc-mssql-processors/src/main/java/org/apache/nifi/cdc/mssql/MSSQLCDCUtils.java

+import java.util.ArrayList;
+import java.util.List;
+
+public class MSSQLCDCUtils {


I didn't see any (non-constant) member variables, seems like all the variables and methods could be static?

Yes and no. The main code would be fine with just static's, but the unit test needs to override certain methods. (ref MockCaptureChangeMSSQL in CaptureChangeMSSQLTest)

mattyb149 · 2017-12-15T17:34:11Z

...-bundle/nifi-cdc-mssql-processors/src/main/java/org/apache/nifi/cdc/mssql/MSSQLCDCUtils.java

+                MSSQLColumnDefinition col = new MSSQLColumnDefinition(jdbcType, columnName, columnOrdinal, isColumnKey==1);
+                tableColumns.add(col);
+            }
+        } catch (SQLException e) {


You don't really need this catch block to rethrow the same exception, maybe cast it to CDCException or replace the catch block with an empty finally block?

mattyb149 · 2017-12-15T17:36:06Z

...ifi-cdc-mssql-processors/src/main/java/org/apache/nifi/cdc/mssql/event/TableCapturePlan.java

+        }
+    }
+
+    public void ComputeCapturePlan(Connection con, MSSQLCDCUtils mssqlcdcUtils) throws SQLException {


Nitpick on leading capital letter in method name

mattyb149 · 2017-12-15T17:37:11Z

...-mssql-processors/src/main/java/org/apache/nifi/cdc/mssql/processors/CaptureChangeMSSQL.java

+    public static final String INITIAL_TIMESTAMP_PROP_START = "initial.timestamp.";
+
+    public static final PropertyDescriptor RECORD_WRITER = new PropertyDescriptor.Builder()
+            .name("record-writer")


Very awesome that you are using the Record API here!

...-mssql-processors/src/main/java/org/apache/nifi/cdc/mssql/processors/CaptureChangeMSSQL.java

mattyb149 · 2017-12-15T17:42:10Z

...-mssql-processors/src/main/java/org/apache/nifi/cdc/mssql/processors/CaptureChangeMSSQL.java

+        }
+
+        return new PropertyDescriptor.Builder()
+                .name(propertyDescriptorName)


Nitpick to add displayName() as well, I know they're the same but this way it won't show up in any searches for PropertyDescriptors that have name() but not displayName()

mattyb149 · 2017-12-15T17:44:32Z

...-mssql-processors/src/main/java/org/apache/nifi/cdc/mssql/processors/CaptureChangeMSSQL.java

+    protected List<PropertyDescriptor> descriptors;
+    protected Set<Relationship> relationships;
+
+    protected final Map<String, MSSQLTableInfo> schemaCache = new ConcurrentHashMap<String, MSSQLTableInfo>(1000);


Should this be configurable due to memory concerns? If each MSSQLTableInfo is likely to be small (just a few short strings or whatever), then this number is probably fine.

There actually isn't really a reason to have an initial value at all. I copied this from the revolving cache used in other processors, but removed the cleaning part of the code that limits the size of the cache. Since this processor does not allow inputs, the size of the schema cache doesn't need to worry about growing forever like some other processors that allow expression language for the table name, and an input flowfile, do.

mattyb149 · 2017-12-15T17:45:38Z

...-mssql-processors/src/main/java/org/apache/nifi/cdc/mssql/processors/CaptureChangeMSSQL.java

+        final DBCPService dbcpService = processContext.getProperty(DBCP_SERVICE).asControllerService(DBCPService.class);
+
+        final boolean takeInitialSnapshot = processContext.getProperty(TAKE_INITIAL_SNAPSHOT).asBoolean();
+        final int fullSnapshotRowLimit = processContext.getProperty(FULL_SNAPSHOT_ROW_LIMIT).asInteger();


Would have to add ".evaluateAttributeExpressions()" here if you add EL support to this field (see comment above)

mattyb149 · 2017-12-15T17:47:43Z

...-mssql-processors/src/main/java/org/apache/nifi/cdc/mssql/processors/CaptureChangeMSSQL.java

+@InputRequirement(InputRequirement.Requirement.INPUT_FORBIDDEN)
+@Tags({"sql", "jdbc", "cdc", "mssql"})
+@CapabilityDescription("Retrieves Change Data Capture (CDC) events from a Microsoft SQL database. CDC Events include INSERT, UPDATE, DELETE operations. Events "
+        + "for each table are output as Record Sets, ordered by the time, and sequence, at which the operation occurred.")


Should probably mention here that in a cluster, this processor is recommended to be run on the Primary Node only.

mattyb149 · 2017-12-15T17:49:22Z

...ifi-cdc-mssql-processors/src/test/java/org/apache/nifi/cdc/mssql/CaptureChangeMSSQLTest.java

+    @Test
+    public void testRetrieveAllChanges() throws SQLException, IOException {
+        setupNamesTable();
+


Line 215 is not blank, so it throws a CheckStyle violation

I've never been able to get CheckStyle to run on my unit tests. It's like IntelliJ doesn't recognize them as valid targets for the plugin. Thanks.

GJPVN · 2017-12-29T04:20:44Z

...-bundle/nifi-cdc-mssql-processors/src/main/java/org/apache/nifi/cdc/mssql/MSSQLCDCUtils.java

+            "  OBJECT_NAME(object_id) AS [tableName], \n" +
+            "  SCHEMA_NAME(OBJECTPROPERTY(source_object_id, 'SchemaId')) AS [sourceSchemaName],\n" +
+            "  OBJECT_NAME(source_object_id) AS [sourceTableName] \n" +
+            "FROM [cdc].[change_tables]";


Microsoft recommends not to refer system table like change_tables instead use sys.sp_cdc_help_change_data_capture. https://docs.microsoft.com/en-us/sql/relational-databases/system-tables/cdc-change-tables-transact-sql

Yes, this is true. Two reasons why I did not go this route:

I could not find any way to build unit tests that required SQL stored procedures in a reliable way. I'm running this processor in our production environment and it's quickly becoming more and more important. I need to ensure I can run unit tests on it. If you can provide me with some guidance on building ApacheDB stored procedures and integrating them into unit tests I can take a look, but then there is the permissions issue...

This stored procedure requires additional permissions beyond select permissions. I use a read-only account from NiFi with limited permissions, and when I run this procedure with no arguments it returns no rows of data because the account is not able to retrieve some of the required data this SP needs. This same account is able to run with this processor. A lot of ETL accounts are severely limited when it comes to permissions for obvious reasons, it felt best to me to support those scenarios.

patricker · 2018-02-06T02:44:51Z

...mmons/nifi-record/src/main/java/org/apache/nifi/serialization/record/ResultSetRecordSet.java

@@ -175,10 +175,12 @@ private static DataType getDataType(final int sqlType, final ResultSet rs, final
                    return RecordFieldType.RECORD.getDataType();
                }

-                final String columnName = rs.getMetaData().getColumnName(columnIndex);


@markap14 In #2386 you added a readerSchema to the ResultSetRecordSet constructor. I was working with this class and do not need this functionality. I've updated the code so that a null readerSchema can be passed, as in my case there is no record reader, just a record writer.

Let me know if you have any concerns. I ran the unit tests for QueryRecord and they all ran without failure.

patricker · 2018-02-06T02:45:52Z

@mattyb149 Code updated.

patricker · 2018-04-10T07:32:22Z

@mattyb149 Rebased to cleanup version number issues.

MikeThomsen · 2018-04-23T21:37:11Z

@patricker if you can get this working with their Docker image of the Linux build, I might have time to help out.

patricker · 2018-04-24T02:14:06Z

@MikeThomsen I appreciate that. I've looked at it this docker image, and it looks like it will probably work. I'm not sure if I'll have time, I'll wait and see.

mattyb149 · 2018-07-25T13:01:42Z

nifi-assembly/pom.xml

+        <dependency>
+            <groupId>org.apache.nifi</groupId>
+            <artifactId>nifi-cdc-mssql-nar</artifactId>
+            <version>1.7.0-SNAPSHOT</version>


Sorry this has taken so long, can you rebase against the latest master and update the version(s) to 1.8.0-SNAPSHOT? Please and thanks!

Rebased and version numbers updated. I had some files that I had failed to update from 1.6.0 to 1.7.0, so it wasn't even building correctly as it was... Builds and seems OK now.

patricker · 2018-08-06T18:25:34Z

@mattyb149 Ready when you are. I have plans to add in new functionality to support 'SQL Server Change Tracking', which is a simpler form of change tracking. Would really like to see this PR merged in prior to making any future changes.

tle-totalwine · 2019-07-04T20:51:07Z

can you suggest step by step to build jar file from your NIFI-4521 MS SQL CDC Processor?

patricker · 2019-07-12T19:04:23Z

@tle-totalwine Just got back from vacation. It looks like in the last 5 months a conflict has arisen, so I'm not sure which version of NiFi this will work with specifically. The conflict looks small, but I would need some time to resolve it.

You can build this by checking out the code from Git and running a mvn build for this module. From your command line, navigate to the folder, I'd suggest the root cdc folder: nifi/nifi-nar-bundles/nifi-cdc. From there run a mvn clean package and see if it builds OK.

wiardvanrij · 2019-08-20T14:22:35Z

any update?

patricker · 2019-08-20T16:17:02Z

@wiardvanrij I've updated the branch. I also have a user who is trying to test this change for me in his environment.

wiardvanrij · 2019-08-20T16:50:32Z

@wiardvanrij I've updated the branch. I also have a user who is trying to test this change for me in his environment.

Thank you, I will start testing it too :)

patricker · 2019-08-20T17:17:18Z

I will probably need to clean up a Checkstyle violation. I don't get the violation in IntelliJ, but I think I see it and can push a fix.

marcelojscosta · 2019-08-29T20:47:07Z

Hi folks, I will just try to test this branch in my environment too.

I will inform news at the next days

marcelojscosta · 2019-09-01T00:42:26Z

Hi @patricker. I am trying to build branch NIFI-4521 and received the errors as showed below.

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.22.0:test (default-test) on project nifi-persistent-provenance-repository: There are test failures.
[ERROR]
[ERROR] Please refer to /Users/marcelojscosta/workstation/nifi/nifi-nar-bundles/nifi-provenance-repository-bundle/nifi-persistent-provenance-repository/target/surefire-reports for the individual test results.
[ERROR] Please refer to dump files (if any exist) [date]-jvmRun[N].dump, [date].dumpstream and [date]-jvmRun[N].dumpstream.
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR] mvn -rf :nifi-persistent-provenance-repository

Any tips to pass thru this error?
Regards,
Marcelo

readl1 · 2020-01-08T20:21:06Z

I have been using the MS CDC processor for ~4 months now. I have noticed what I think is a bug in the cdc processor. When I use more than one table in the CDC table list I get an error and no CDC tables are read. The data in the queue below is from removing one of the tables from the list, both tables work independently. I am using nifi v1.10.0, CDC MSSQL v1.9.2

patricker · 2020-01-08T20:28:32Z

@readl1 Out of curiosity, and to maybe make the error message more confusing, can you lower case your table names in the Processor configuration so they will match the ones in the error message?

readl1 · 2020-01-08T21:05:26Z

I figured it out. The issue is not with the case of the table names but when you have a space in the comma separated list.

table1, table2, table3 won't work

table1,table2,table3 will work

mattyb149 · 2020-01-09T02:42:45Z

...-mssql-processors/src/main/java/org/apache/nifi/cdc/mssql/processors/CaptureChangeMSSQL.java

+        final String[] allTables = schemaCache.keySet().toArray(new String[schemaCache.size()]);
+
+        String[] tables = StringUtils
+                .split(processContext.getProperty(CDC_TABLES).evaluateAttributeExpressions().getValue(), ",");


To the other comment, this might be better handled with a stream split -> trim -> collect in order to handle whitespace between the delimiter and table names

@mattyb149 I've updated the code, and tweaked a unit test to cover it. What are your thoughts on working towards merging this :) I feel like it's gotten quite a bit of external testing by users at this point.

@mattyb149 @patricker We still trying to get this merged into master? Anything I can do from the testing standpoint to help that along?

felipemartim · 2021-02-22T03:42:13Z

Nice work! When should we expect this to get merged?

pvillard31 · 2021-02-22T08:12:42Z

@patricker - it sounds like it would be a nice feature to get in based on all the interactions we had here.

Can you rebase the pull request against 1.14.0-SNAPSHOT? Are there still open items to fix/rework on this pull request?

@mattyb149 - any thoughts?

patricker · 2021-02-23T22:18:37Z

@pvillard31 There is one open issue related to very large transaction id's not fitting in BigInt. I have not researched it very much, but am already feeling motivated to check it out if this open and active PR might get merged after 3.5 years... I would LOVE to get this merged.

I'll take some time tomorrow to check out this one open issue tomorrow.

pvillard31 · 2021-02-24T04:07:48Z

Yeah I know the feeling, I also have very old PRs that would be nice to have merged. Let me know how it goes with your investigation and we can include it for the next release.

patricker · 2021-03-01T16:16:00Z

@ravitejatvs @readl1 I've been trying to get binary(10) working, I spent a lot of time on Friday. It's not that it's actually difficult, I found a way to store the binary(10) value as hex. It's that my unit test framework uses Apache DB, and the same functions don't exist in Apache DB that I can tell. Still researching.

readl1 · 2021-03-10T22:48:17Z

@ravitejatvs @readl1 I've been trying to get binary(10) working, I spent a lot of time on Friday. It's not that it's actually difficult, I found a way to store the binary(10) value as hex. It's that my unit test framework uses Apache DB, and the same functions don't exist in Apache DB that I can tell. Still researching.

Let me know if I can help at all. I have a DB where I can test this against, a mssql box with the values larger than bigint.

readl1 · 2021-03-15T21:15:04Z

I am starting to see this processor delivery duplicate records. Has @patricker @ravitejatvs seem something similar? Its like the state is not being updated on the next run and it pulls the same data again.

readl1 · 2021-03-16T15:19:07Z

I am starting to see this processor delivery duplicate records. Has @patricker @ravitejatvs seem something similar? Its like the state is not being updated on the next run and it pulls the same data again.

I wonder if a different timezone is used on the sql server compared to the nifi cluster there could be a loop

patricker · 2021-03-17T01:41:11Z

@readl1 Hmm. State is one of the last things we save, and if state save fails then we remove the whole file.

                    stateManager.setState(statePropertyMap, Scope.CLUSTER);
                    session.commit();
                } catch (IOException e) {
                    session.remove(cdcFlowFile);

readl1 · 2021-03-17T02:43:07Z

What if the cdc data hasn't changed? I have a 2 hour run frequently and it's not uncommon to have the same data pushed 5 or 6 times.

…

On Tue, Mar 16, 2021, 9:41 PM Peter Wicks ***@***.***> wrote: @readl1 <https://github.com/readl1> Hmm. State is one of the last things we save, and if state save fails then we remove the whole file. stateManager.setState(statePropertyMap, Scope.CLUSTER); session.commit(); } catch (IOException e) { session.remove(cdcFlowFile); — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2231 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACHJPBPQH4L5SNF7OHMABT3TEACERANCNFSM4EBDBPRA> .

readl1 · 2021-03-17T14:18:12Z

@patricker
Ok I have found a way to reproduce this. Let me know if we can get on a screenshare so I can show you the issue. I have it running on a 30 second interval and its pull data from yesterday on each run.

SELECT max(sys.fn_cdc_map_lsn_to_time ( __$start_lsn )) FROM
returned: 2021-03-16 18:17:34.163

Attribute in nifi: maxvalue.tran_end_time = 2021-03-16 18:17:34.163

Value stored in the state: 2021-03-16 18:17:34.163

How do you compare the current time stored in the state to the time coming from the cdc table?

patricker · 2021-03-19T03:50:41Z

@readl1 I need to review the code to see what debug options I left in place. Worst case we can add in an option to put the SQL statements into the NiFi log.

My email is available on my profile page. Shoot me a message and lets see if we can figure it out.
https://github.com/patricker.

patricker · 2021-03-19T17:03:46Z

Had a great meeting with @readl1, and we found the root cause of the issue. It turns out this is caused by a change server-side in MS SQL. Here are the documentation notes from Microsoft:

https://docs.microsoft.com/en-us/sql/relational-databases/system-tables/cdc-lsn-time-mapping-transact-sql?view=sql-server-ver15

Note that java.sql.Timestamp values can no longer be used to compare values from a datetime column starting from SQL Server 2016. This limitation is due to a server-side change that converts datetime to datetime2 differently, resulting in non-equitable values. The workaround to this issue is to either change datetime columns to datetime2(3), use String instead of java.sql.Timestamp, or change database compatibility level to 120 or below.

github-actions · 2021-07-18T00:06:35Z

We're marking this PR as stale due to lack of updates in the past few months. If after another couple of weeks the stale label has not been removed this PR will be closed. This stale marker and eventual auto close does not indicate a judgement of the PR just lack of reviewer bandwidth and helps us keep the PR queue more manageable. If you would like this PR re-opened you can do so and a committer can remove the stale tag. Or you can open a new PR. Try to help review other PRs to increase PR review bandwidth which in turn helps yours.

kuleshov01 · 2022-05-17T04:53:36Z

Is this version considered working?

patricker · 2022-05-17T12:44:58Z

@djshura2008 as far as I know it works great still. I changed companies and haven't used it in about a year, but no known issues I'm aware of

patricker mentioned this pull request Oct 27, 2017

NIFI-4521 MS SQL CDC Processor #2230

Closed

mattyb149 reviewed Dec 15, 2017

View reviewed changes

GJPVN reviewed Dec 29, 2017

View reviewed changes

patricker force-pushed the NIFI-4521 branch from dde340e to 29363aa Compare February 6, 2018 02:25

patricker commented Feb 6, 2018

View reviewed changes

patricker force-pushed the NIFI-4521 branch from e3accaa to 436caed Compare April 10, 2018 07:31

mattyb149 reviewed Jul 25, 2018

View reviewed changes

patricker force-pushed the NIFI-4521 branch from 436caed to 0ab7f31 Compare July 27, 2018 20:52

patricker force-pushed the NIFI-4521 branch from 0ab7f31 to 8cc2842 Compare February 1, 2019 16:55

patricker force-pushed the NIFI-4521 branch from 8cc2842 to 11bd3d6 Compare August 20, 2019 16:05

mattyb149 reviewed Jan 9, 2020

View reviewed changes

patricker force-pushed the NIFI-4521 branch 3 times, most recently from 135eb80 to 0dd0d6d Compare January 24, 2020 15:54

patricker and others added 11 commits February 25, 2021 15:46

NIFI-4521 MS SQL CDC Processor

cda6874

NIFI-4521 MS SQL CDC Processor

4613af9

NIFI-4521 MS SQL CDC Processor

de2fa79

NIFI-4521 MS SQL CDC Processor Version Update

a53f6e4

NIFI-4521 MS SQL CDC Table Name Trimming

36a6d33

NIFI-4521 NiFi Version Update

ffcf0a8

NIFI-4521 Logging and cleanup

e7d5fb1

NIFI-4521 POM Version Update

b2aa70b

NIFI-4521 Fixing Checkstyle Issues

b9f78b8

NIFI-4521 Fix Tablename Parsing

b8b44d1

NIFI-4521 Update versions, prepare tests to match real binary(10)

15684f1

patricker force-pushed the NIFI-4521 branch from 0f82da0 to 15684f1 Compare February 26, 2021 16:45

github-actions bot added the Stale label Jul 18, 2021

github-actions bot closed this Aug 11, 2021

NIFI-4521 MS SQL CDC Processor #2231

NIFI-4521 MS SQL CDC Processor #2231

Conversation

patricker commented Oct 27, 2017

For all changes:

For code changes:

For documentation related changes:

Note:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patricker commented Feb 6, 2018

patricker commented Apr 10, 2018

MikeThomsen commented Apr 23, 2018

patricker commented Apr 24, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patricker commented Aug 6, 2018

tle-totalwine commented Jul 4, 2019

patricker commented Jul 12, 2019

wiardvanrij commented Aug 20, 2019

patricker commented Aug 20, 2019

wiardvanrij commented Aug 20, 2019

patricker commented Aug 20, 2019

marcelojscosta commented Aug 29, 2019

marcelojscosta commented Sep 1, 2019 • edited

readl1 commented Jan 8, 2020

patricker commented Jan 8, 2020

readl1 commented Jan 8, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felipemartim commented Feb 22, 2021

pvillard31 commented Feb 22, 2021

patricker commented Feb 23, 2021

pvillard31 commented Feb 24, 2021

patricker commented Mar 1, 2021

readl1 commented Mar 10, 2021

readl1 commented Mar 15, 2021

readl1 commented Mar 16, 2021

patricker commented Mar 17, 2021

readl1 commented Mar 17, 2021 via email

readl1 commented Mar 17, 2021

patricker commented Mar 19, 2021

patricker commented Mar 19, 2021

github-actions bot commented Jul 18, 2021

kuleshov01 commented May 17, 2022

patricker commented May 17, 2022

marcelojscosta commented Sep 1, 2019 •

edited