How to Get Data Lineage by Analyzing Oracle’s SQL Statements and Stored Procedures?
An important foundational work in data governance is to analyze the data lineage in an organization. With complete data lineage, we can use it for data traceability, table and column change impact analysis, data compliance certification, data quality inspection, and more. The main methods of analyzing data lineage are automatic parsing, system tracking, machine learning and manual collection. Automatic parsing mainly refers to using tools to parse files such as SQL statements, stored procedures, and ETL. In this article, we’ll introduce how to get data lineage by analyzing Oracle’s SQL statements and stored procedures, so as to better conduct data governance.
The SQL statements that generate the data lineages are as follows:
- SELECT
- INSERT
- UPDATE
- MERGE
- CREATE VIEW
- CREATE TABLE
- STORED PROCEDURE
SELECT
You might be wondering, since the SELECT statement doesn’t add or modify data, how is data lineage generated? The secret lies in the select list in the SELECT statement, where the data can be transformed. Take the following SELECT statement as an example:
We can see that the data for totalSal comes from emp.sal and emp.commission, where the data is transformed once. This data lineage generated inside the SELECT is temporary, but when this SELECT statement is combined with CREATE VIEW or CREATE TABLE, this data lineage is truly formed.
E.g:
This CREATE VIEW statement forms the data lineage from emp.sal and emp.commission to v_sal.mySal via SELECT.
Therefore, we can see that analyzing the SELECT statement is the basis for the data lineage analysis of the SQL statement. In general, it is relatively intuitive and simple to obtain data lineage by analyzing SQL statements, but the key to the problem is that the efficiency of manual analysis is too low. For such a large amount of SQL in the enterprise, manual analysis is almost impossible.
STORED PROCEDURE
Stored procedures can contain complex logic processing, such as conditional judgments, loop branches, etc. Therefore, it is often used to complete tasks such as data extraction, transformation, loading, and cleaning. Among them, a large number of data lineage are generated. In order to better manage the data in the enterprise, it is essential to sort out the data lineage in the storage process.
To analyze the data lineage in stored procedures, the cursor is a key factor, and the data flow is generally processed around the cursor.
In the following Oracle PL/SQL stored procedure, the cursor CURSOR cur_stclerk is first defined.
Then through this cursor, use LOOP to insert data from the employees table into the tables emp_temp and emp_detls_temp.
So we can establish a data lineage from the employees table to the tables emp_temp and emp_detls_temp.
The complete PLSQL stored procedure is as follows:
Automated data lineage analysis tools
SQLFlow supports analyzing SQL statements from more than 20 mainstream databases. It supports bigquery, couchbase, dax, db2, greenplum, hana, hive, impala, informix, mdx, mysql, netezza, odbc, openedge, oracle, postgresql, redshift, snowflake, sparksql, sqlserver, sybase, teradata, vertica.
It also supports analyzing stored procedures and dynamic SQL statements.
Analyze SQL statements through the UI
Through the UI of SQLFlow, the data lineage of a SQL can be quickly obtained, and the visual results can be obtained to help users quickly understand the data lineage in a SQL.
Analyze SQL data lineage through API
Sometimes, we need to store the data lineage obtained from the analysis as metadata in our own data governance platform and integrate it with other metadata. At this time, we can use the Restful API provided by SQLFlow to automate the data lineage analysis using scripts such as shell and python.
Here we use curl to show how to use API to access SQLFlow for data lineage analysis.
- Prepare the userid and secret code required to connect to SQLFlow with the API
- Use userid and secret code to get token. Note that replace the corresponding parts of the command with your own userid and secret code. curl -X POST “https://api.gudusoft.com/gspLive_backend/user/generateToken” -H “Request-Origion:testClientDemo” -H “accept:application/json;charset=utf-8” -H “Content-Type:application/x-www-form-urlencoded;charset=UTF-8” -d “secretKey=YOUR SECRET KEY” -d “userId=YOUR USER ID HERE”
- Submit the SQL statement to be analyzed to SQLFlow for analysis, and return the processing result. Note, replace the corresponding part of the command with your own userid and the token returned in the command above. curl -X POST “https://api.gudusoft.com/gspLive_backend/sqlflow/generation/sqlflow?showRelationType=fdd” -H “Request-Origion:testClientDemo” -H “accept:application/json;charset=utf-8” -H “Content-Type:multipart/form-data” -F “sqlfile=” -F “dbvendor=dbvoracle” -F “ignoreRecordSet=true” -F “simpleOutput=false” -F “sqltext=create view v_sal(mySal) as select sal + commission as totalSal from emp;” -F “userId=YOUR USER ID HERE” -F “token=YOUR TOKEN HERE”
- The returned json contains the following data lineages: “relations”: [ { “id”: “3”, “type”: “fdd”, “effectType”: “create_view”, “target”: { “id”: “11”, “column”: “mySal”, “parentId”: “9”, “parentName”: “v_sal”, }, “sources”: [ { “id”: “3”, “column”: “sal”, “parentId”: “2”, “parentName”: “emp”, }, { “id”: “4”, “column”: “commission”, “parentId”: “2”, “parentName”: “emp”, } ], “processId”: “10” } ]
For more detailed information, see the official website: SQLFlow Restful API.
Conclusion
This article introduces how to obtain data lineage by analyzing Oracle’s SQL statements and stored procedures, so as to better conduct data governance. At the same time, it also introduces how to use the SQLFlow tool to automate the work of analyzing data lineage in SQL statements to improve the efficiency and level of data self-care.
If you want to learn more about data lineage, we would like to advise you to visit Gudu SQLFlow for more information. As one of the best data lineage tools on the market today, it can not only analyze SQL script files, obtain data lineage, and perform visual display, but also allow users to provide data lineage in CSV format and perform visual display.