Instantly get the data lineage from complex SQL statements
Data lineage plays an important role in Data Governance, you can refer to this blog for why data lineage is so important for an enterprise and how can we correctly use the data lineage. In this article, we will introduce how to use Gudu SQLFlow UI, SQLFlow Rest API and SQLFlow Java library to retrieve data lineage from complex SQL statements with which you can potentially integrate in your own data governance platform.
A complex SQL statement example
select data.*, location.remote
from (
select e.last_name, e.department_id, d.department_name
from employees e
left outer join department d
on (e.department_id = d.department_id)
) data inner join
(
select s.remote,s.department_id
from source s
inner join location l
on s.location_id = l.id
) location on data.department_id = location.department_id;
Above SQL has some complexities and let’s take this SQL statement as an example source of retrieving data lineage with SQLFlow. Note that in the real situation, our SQL statements might be far more complicated than this example especially when we have views or stored procedures to deal with.
To generate the correspond data lineage, we need to know which fields or columns are included in the top-level select list, and what is the data source for these fields or columns comes from. After some analysis, an ideal result should be as shown in the figure below:
To archive the above results, we have to:
1.Expand the asterisk (*
in the SQL statement) in data.*
to find the corresponding fields
The asterisk in the Select
list stands for all fields of the relation, we will need to retrieve the exact fields to generate the data lineage. In this example, Gudu SQLFlow is able to get the fields info based on the contextual information provided in the SQL statement. Nevertheless, most of SQL statements are not able to provide sufficient information for us to detect what are the fields in the relation of the asterisk. In that case, giving metadata would be necessary to generate the result. You can refer to this article for how to get metadata.
2.Perform data tracing for each field in the select
list to find the original tables and fields. This process may require multi-level tracing until we find the final data source as the result.
Gudu SQLFlow UI
You can check our online Gudu SQLFlow Cloud. This is probably the most easiest way to instantly archive data lineage cause no toolkit or software is required on your side. The usage is quite simple, directly paste the SQL statement to be processed to the input area, and then click Visualize button.
If you prefer to not display the intermediate processing steps in the data lineage, you can show the source data with target data by config the setting as:
Download the data lineage in Json for further processing:
Gudu SQLFlow Rest API
Calling SQLFlow Rest API in your code to instantly get the data lineage from SQL statements. You can embed the request in your code and processing the result after getting the Rest Api response.
A Gudu SQLFlow Cloud Premium Account is required to use the Rest Api feature. Check our docs here to get the way of calling our Rest Api.
You will also need to install SQLFlow On-Premise on your server if you would like to use SQLFlow internally.
You can also check our sample code, the sample code is in Python:
- Sample code of connecting to the Gudu SQLFlow server ( Gudu SQLFlow on-premise server ) deployed on your own server.
- Sample code with SQLFlow public cloud server
Check here for the data lineage generated by the above sample code.
Gudu SQLFlow Java library
Use the Gudu SQLFlow Java library to analyze the SQL statements and get the data lineage. The advantage of using the Gudu SQLFlow Java class library is that you don’t need to install the Gudu SQLFlow server software and it can be executed independently without relying on any third-party Java class library, easy to integrate into your own project.
We’ve prepared a Java file for you which you can directly use without compilation as long as you have Java install on your machine. To get the select
list in the above example, you can run the Java file with the following parameters:
/s /topselectlist
The result by default is in XML. Check here for the result of the execution. You can config the result in Json as well by adding the /json
flag.