After years of heavy Reversal Data Lineage implementations on complex and heterogeneous IT environments solving the most different criticalities, let us share some considerations. We hope it will help those who are about to take this initiative.

In this IT historical period we have met many customers who, forced by specific regulations or determined to undertake new strategic initiatives (Migrations and Digital/Data-driven Transformation first of all), felt the need to take back the deep knowledge of their physical IT processes through the adoption of a Reversal Data Lineage system (desired, but complicated to obtain).

This need arises, on the one hand, as, during the last 20+ years, those customers paid to develop and implement their physical IT processes while, on the other hand, the knowledge of those processes is (or was) owned by the employees, who then left, changed job, retired, etc..

These needs arose significantly in those customers who in the last 20+ years have invested to implement/increase their physical IT processes, knowing that the knowledge of these processes was entirely in the hands of an IT “individual” who operated manually on each of them. The big risk was losing all the enterprise’s knowledge of physical IT processes as soon as this individual would change company, function or would retire.

Reading the characteristics of some Data Governance tools, or Data Governance articles, the Reversal Data Lineage seems like a simple goal that is fairly obvious to achieve. Actually, this is not the case.

Sometimes, during meetings we had to reply to customer statements like:

.” We are interested in reversing Data Lineage only from jobs. Possible SQL content does not matter;

…” We know we have “Stored Procedures” calls, there are just a few transformation rules in it??? them, and we think they are not so important”;

…” Data lineage obtained by technical metadata is sufficient for CDO, transformations contained in parametric jobs and dynamic queries are not in the scope now”;

….”We know that the Data Lineage is not 100% complete, we will address those issues and repair “holes” manually”;

To recover physical IT processes knowledge to satisfy Data Governance requirements or, more simply, to regain the enterprise’s “lost knowledge” is a fundamental investment for any Company. Enterprises invest in the Reversal Data Lineage because it is the basis from which a Company starts building both its “Policies” and its “Business Process” definitions (with the related workflows), which are both closely related one to the other.

What is the point in investing so much money and time in defining Policies Business Processes, modifying the Corporate Structure, defining new Roles and Workflow without having a 100%, because partial or incomplete, and non- error-proof reversed Data Lineage?

A Reversal Data Lineage, if 100% complete and “certified”, will answer not only CDOs’ requests for Regulatory, but also questions like: “What’s useful to migrate?”; “What cannot we migrate (because it is obsolete)?”; “Which real impact will a specific transformation rule ‘change’ have?”; “How can we save more money from our cloud migration initiative? “; “Which was exactly the physical process generating the “YY” result on 24th February at 12.34?”; “Why and where did we have a certain specific Data Quality decrease?”…etc..

These are all questions with a strong impact on a Company’s business results and, not negligible, they have immediate positive consequences on the success of those Business Initiatives running on a Reversal Data Lineage solution.

We often read articles talking about Data Quality initiatives or tools, where the scope is mainly to define metrics and rules in order to evaluate Data Quality through the entire lifecycle and support business decisions correctly (we are not talking only about names and addresses, Data Quality issues).

Let us take this example: Data are generated by application “A”, which is subject to a transformation and loaded into application “B”; then, after a final transformation, it is loaded also into application “C”.

“When our “metric evaluation system” raises a “red flag” because Data under the application “C” have not the expected quality level, how can we find out where the problem is?”

“How can we understand if the metric rule is right or wrong, or if there was a development error, or again, if the last “transformation rule” applied was wrong?”

To avoid this, a Company needs to know the entire end-to-end physical process transformation rules to understand the issue, or, alternatively, can decide to risk spending a lot more money and time on looking for the right answer (but in the time necessary to identify the error, nobody can expect decisions to be taken within the company on that information).

It is advisable to give the same importance to Metadata and its Reverse Data Lineage Quality, and to the Data Quality: in fact nobody would like to base their decisions on wrong metadata or wrong or incomplete or not “Certified” reversed Data Lineage.

Let us state, then, the 5 main features a Reversal Data Lineage solution MUST HAVE to be “Useful”:

100% End-to-End Complete: (parametric Jobs and Dynamic queries included) and extended at the deepest level (column level) to the entire technology-mixed Application Chain. The “Metadata Quality” concept cannot be neglected, in particular in the Data Governance area, because in the end, Business Processes and Physical processes will be closely related and consistent.

Certified: Reversal Data Lineage, when Complete and Certified through Metadata collected from running IT application chains, will avoid any misleading information and, in addition to satisfying regulatory needs, being a “certified” reversed Data Lineage it will help save money and dramatically reduce risks, duly answering questions like: “What do we really need to migrate and what not?”; “What can we off-load and save money?”; “How can we reduce Cloud Costs by migrating only useful IT physical processes?”.

Automatic: Any sort of Manual Intervention to complete a specific Data Lineage will prove very heavy in the first instance and still more so in the medium/long term to keep it aligned, often with unsustainable costs and, in the end, without “error proof” results.

Historicized: This feature is quite different from “versioning” and helps to answer Audit Questions like: “What was the physical process generating an xxx result in column yyy at 5.34 pm on 3rd March 2019?” Any run IT Process needs to be uniquely historicized, even if its logic does not change, because it may run again 5 minutes later and could generate a different result due to a probable different data content. It will then be easier for the customer to retrieve if needed for Audit purposes.  

Intelligible: If based only upon Technical Metadata, the evaluation can often occur as a reversed Data Lineage presented as an enormous “constellation” of objects and connections, which is quite impossible to understand “on first instance”. The tool should be able to apply any possible “filter” to give the simplest and clearest representation of that reversed Data Lineage. “Labelling” is a manual activity and not an option, as it is too burdensome to maintain and to keep aligned over time. Some examples of “automatic” filtering are: 1) discarding any object not matching the related column in the DB; 2) discarding all “not run” objects and connections (where the time “not run” is a customer’s parameter; this probably means they are obsolete objects); 3) discarding all objects and connections not strictly related with THIS run IT physical process.

From our experience – many Reversal Data Lineage implementations for important customers – a Reversal Data Lineage that does not have the 5 fundamental requirements above, will not fully support the decision process activity…so it is not really useful!

We know sometimes the above-mentioned features might seem as a “nice to have” or close to a “utopia”; this is why we are open to any discussion to explain how such a result can be a possible saving of money and time.