Gudu SQLFlow : What’s it and why it is important?
What’s Gudu SQLFlow?
In large data warehouses and data lakes, complete data lineage can be used for data traceability, impact analysis of table and column changes, proof of data compliance, data quality checks, and more. So what exactly is data lineage? According to Wikipedia, data lineage includes where the data came from, what happened, and where it moved over time. Gudu SQLFlow gives complete data lineage by analyzing SQL scripts. It can handle SQL statements of more than 20 kinds of databases, including complex stored procedures and dynamic SQL statements. With the powerful SQL processing capability of the self-developed general-purpose SQL parser, Gudu SQLFlow is one of the most popular data lineage tools available on the market today.
Gudu SQLFlow generates data lineages by analyzing SQL queries and stored procedures. Entities in the data lineage model include tables, columns, functions, relationships, and other entities. The combination of entities and dataflows shows the lineage from one table/column to another.
1. Data flow unit
A data flow unit includes source entities, target entities, and the type of data flow between them.
This is the data flow generated for the above SQL query.
1.1 Source and target entities
Source and target entities often reference tables, views, and other relationships, such as common table expressions, result sets generated during query execution. It may also refer to files in HDFS systems etc.
1.2 Types of data flow
There are two types of data flow: direct data flow and indirect data flow.
1.2.1 Direct data flow
Direct data flow means that the data of the target entity comes directly from the source entity. In the diagram above, RS-1.FirstName directly from Person.FirstName.
Arrows are used to represent direct data flow in the diagram:
1.2.2 Indirect data flow
Indirect data flow means that the data of the target column does not come from the source column, but the data of the source column/table affects the result data of the target column.
Dashed arrows are used to represent indirect data flow in the diagram:
Source columns in indirect data flow typically appear in the following clauses:
- Group by clause
- Where clause
- Winddows function
- Join condition
The values of COUNT() and SUM(SAL) are affected by the value of the deptno column in the group by clause. Therefore, the indirect data flow will be created as follows:
For other indirect data flows, we will discuss later.
1.2.3 Join operator
Join establishes a link between two or more columns in a join condition. Strictly speaking, this relationship is not a data flow type.
A join relationship will be created after parsing the above SQL. It represents the connection relationship between tbl, between tbl.key and TT.key.
2. Entities in the data flow
When building a data flow between two entities: a source entity and a target entity. They can be column-to-column, table-to-column, or table-to-table.
2.1 Column to Column
This is the most common case where both entities in the dataflow are columns.
2.2 Table to Column
When we say a table affects the value of a column, we usually mean that the total number of rows in a table affects the value of a column. Typically, this column is derived from the COUNT() function.
Table-to-column data flow is represented by using the psedou column of table RelationRows, which will build an indirect data flow from scott, relation row RS-1.num_emp.
2.3 Table to table
Sometimes, there will be data flow between two tables. For example, in an alter table rename SQL statement, a table-to-table data flow will be created. In effect, table-to-table data flow is determined by using the RelationRows table pseudo column.
3. Data lineage
Data lineage consists of many basic data flow units.
Data lineage analysis diagram:
The output can also be in XML or JSON format.
Conclusion
Thank you for reading our article and we hope it can help you to have a better understanding of Gudu SQLFlow. If you want to know more about Gudu SQLFlow, we would like to advise you to visit Gudu SQLFlow for more information. With the powerful SQL processing capability of the self-developed general-purpose SQL parser, Gudu SQLFlow is your first choice for analyzing SQL data lineage.