Metadata Management System in Data Warehouse
I don’t think many people have heard of the term metadata management system. Of course, people who work in non-data warehouses rarely come into contact with this system, and even those who are working in this area may still don’t know much about it, so today I’m going to talk about metadata management system in data warehouse. Before diving into our article, let’s figure out what is metadata.
Definition of Metadata
According to the traditional definition, metadata is data about data. In a data warehouse system, metadata can help data warehouse administrators and data warehouse developers find the data they care about very easily. Metadata is the data that describes the structure and establishment method of the data in the data warehouse. It can be divided into two categories according to different purposes: technical metadata and business metadata.
Technical metadata is the data that stores the technical details of the data warehouse system and is used to develop and manage the data used by the data warehouse. It mainly includes the following information:
- A description of the data warehouse structure, including the definition of the warehouse schema, views, dimensions, hierarchies, and derived data, and the location and content of the data mart;
- Architectures and patterns of business systems, data warehouses and data marts;
- Algorithms for aggregation, including metric and dimension definition algorithms, data granularity, subject areas, aggregations, summaries, predefined queries and reports;
- Mapping from operating environment to data warehouse environment, including source data and their contents, data segmentation, data extraction, cleaning, transformation rules and data refresh rules, security (user authorization and access control).
Business metadata describes the data in the data warehouse from a business perspective. It provides a semantic layer between the user and the actual system, so that business personnel who do not understand computer technology can also “read” the data in the data warehouse. The business metadata mainly includes the following information: the data model, object name and attribute name expressed by the user’s business terms; the principle of accessing the data and the source of the data; the analysis method provided by the system and the information of the formula and report. It specifically includes the following information:
- Enterprise concept model: This is the important information that business metadata should provide. It represents the high-level information of the enterprise data model, the business concepts and interrelationships of the entire enterprise. Based on this enterprise model, business personnel who do not understand database technology and SQL statements can also know the data in the data warehouse.
- Multidimensional data model: This is an important part of the enterprise conceptual model, which tells the business analyst what dimensions are in the data mart, the categories of dimensions, the data cube, and the aggregation rules in the data mart. The data cube here represents the multidimensional organization of the business fact table and dimension table for a subject area.
- Dependency between business conceptual model and physical data: The business metadata mentioned above only represents the business view of the data, and these business views are related to the actual data warehouse or database, tables, fields, dimensions, hierarchies in multidimensional databases, etc. The correspondence between them should also be reflected in the metadata knowledge base.
The Role of Metadata
1. Metadata is required for data integration.
The biggest characteristic of data warehouse is its integration. This feature is not only reflected in the data it contains, but also in the process of implementing a data warehouse project. On the one hand, the data extracted from each data source should be stored in the data warehouse according to a certain pattern, and the corresponding relationship and transformation rules between these data sources and the data in the data warehouse should be stored in the metadata knowledge base; on the other hand, in the process of implementing a data warehouse project, it is often time-consuming and labor-intensive to directly build a data warehouse. Therefore, in practice, people may first build a data mart according to a unified data model, and then build a data warehouse on the basis of each data mart.
However, when the number of data marts increases, it is easy to form a “spider web” phenomenon, and metadata management is the key to solving the “spider web”. If you pay attention to metadata management in the process of building a data mart, the integration into the data warehouse will be smoother; on the contrary, if you ignore metadata management in the process of building a data mart, then the final integration process would be difficult, or even impossible.
2. The semantic layer defined by metadata can help users understand the data in the data warehouse.
End users are unlikely to be as familiar with database technology as data warehouse system administrators or developers, so there is an urgent need for a “translator” that enables them to clearly understand the meaning of the data in the data warehouse. Metadata can realize the mapping between the business model and the data model, so the data can be “translated” in the way users need, so as to help end users understand and use the data.
3. Metadata is the key to ensuring data quality.
After the data warehouse or data mart is established, users often have doubts about the data when using it. These doubts are often due to the fact that the underlying data is not “transparent” to the user, and the user is naturally suspicious of the results. With the help of the metadata management system, the end users can easily obtain the ins and outs of each data and the rules for data extraction and conversion, so they will naturally have confidence in the data; of course, they can also easily find the quality problems of the data.
4. Metadata can support changing requirements.
With the development of information technology and the change of enterprise functions, the needs of enterprises are constantly changing. How to construct a software system that changes smoothly with the change of requirements is an important problem in the field of software engineering. Traditional information systems often adapt to changes in demand through documents, but relying solely on documents is far from enough. A successful metadata management system can effectively manage the workflow, data flow and information flow of the entire business, so that the system does not depend on specific developers, thereby improving the scalability of the system.
The Current State of Metadata Management
From the above sections, we learned that metadata can almost be called the “soul” of a data warehouse and even a business intelligence (BI) system. Just because metadata plays an important role in the entire data warehouse life cycle, the data warehouse solutions of various manufacturers have mentioned the management of metadata.
But unfortunately, for the management of metadata, each solution has not explicitly proposed a complete management model; they only provide management of specific partial metadata. Data warehouse tools related to metadata currently on the market can be roughly divided into four categories:
- data extraction tools: Extract, transform, and integrate data from business systems into data warehouses, such as Ardent’s DataStage, Pentaho’s open source ETL product Kettle, and ETI’s Extract. These tools only provide technical metadata, with little support for business metadata.
- pront-end presentation tools: Including OLAP analysis, reporting and business intelligence tools, such as Cognos’ PowerPlay, Business Objects’ BO. They support multidimensional business views by mapping relational tables into business-relevant facts and dimensions, which in turn enable multidimensional analysis of the data in the data warehouse. These tools all provide a semantic layer corresponding to business metadata and technical metadata.
- Modeling tools: Business modeling tools for non-technical people, these tools can provide higher-level business-specific semantics, such as CA’s ERwin, Sysbase’s PowerDesigner, and Rational’s Rose.
- Metadata storage tools: Metadata is usually stored in a dedicated database, which is like a “black box”, and there is no way for outsiders to know how the metadata used and generated by these tools is stored. There is also a class of tools called Metadata Repositories that provide a centralized repository for metadata, independent of other tools. These tools include Microsoft’s Repository, Ardent’s MetaStage and Sybase’s WCC, among others.
- Metadata management tools: There are roughly three categories of current metadata management tools. One is the specialized tools provided by companies such as IBM and CA, such as MetaStage obtained by IBM’s acquisition of Ascential, and DecisionBase of CA. The second is DAG’s MetaCenter, the open source product Pentaho Metadata, which does not rely on a BI product, but is a third-party metadata management tool. Third, integrators like Primeton also have their own metadata management tools, such as Primeton’s MetaCube.
Metadata Management Standard
No rules, no standards. A very important reason why metadata management is difficult is the lack of a unified standard. In this case, metadata management solutions vary from company to company. In recent years, with the gradual improvement of the OIM (Open Information Model) standard of the Meta Data Coalition and the CWM (Common Warehouse Model) standard of the OMG organization, and the merger of the MDC and OMG organizations , which provides a unified standard for data warehouse vendors, thus paving the way for metadata management.
From the development history of metadata, it is not difficult to see that there are two main methods for metadata management:
- For relatively simple environments, build a centralized metadata repository according to common metadata management standards.
- For a more complex environment, each part of the metadata management system is established separately to form a distributed metadata knowledge base, and then, through the establishment of a standard metadata exchange format, the integrated management of metadata is realized.
At present, the CWM (Common Warehouse MetaModel) standard of the OMG family has become a unified standard in the metadata management community:
OMG is an international standardization organization with more than 500 members, and the famous CORBA standard comes from this organization. The main purpose of the Common Warehouse Metamodel is to help different data warehouse tools, platforms and metadata knowledge bases to exchange metadata in a heterogeneous environment. In March 2001, OMG issued the CWM 1.0 standard.
The CWM model includes both metadata storage and metadata exchange, and is based on the following three industry standards:
- UML: It models the CWM model.
- MOF (Meta Object Facility): It is a storage standard for OMG metamodels and metadata, and provides an access interface to the metadata repository in a heterogeneous environment.
- XMI (XML Metadata Interchange): It enables metadata to be exchanged in the form of XML file flows.
CWM develops a set of syntax and semantics specifications for sharing metadata between data warehouses and business intelligence (BI) tools. It mainly includes the following four specifications:
- CWM metamodel (Metamodel): A model that describes a data warehouse system;
- CWM XML: XML representation of the CWM metamodel;
- CWM DTD: Interchange Format for DW/BI Shared Metadata;
- CWM IDL: Application Program Access Interface (API) for DW/BI Shared Metadata;
Metadata Management Function
1. Data Map
Data map display is a hierarchical graphical display of all kinds of data entities and metadata of data processing in the form of topology graph, and the granularity control of the display of graphs at different levels can meet the needs of graphic query and auxiliary analysis in different application scenarios of development, operation and maintenance or business.
2. Metadata Analysis
Data lineage analysis: It refers to starting from an entity and tracing back its processing process to the data source interface of the data system. For different types of entities, the transformation process involved may have different types, such as: for the underlying warehouse entity, the ETL processing process is involved; for the warehouse summary table, it may involve both the ETL processing process and the warehouse summary processing process; For indicators, in addition to the above processing process, it also involves the processing process of indicator generation.
The data source interface entity is provided by the source system, as the data input of the data system, other data entities have undergone one or more different types of processing. Data lineage analysis provides such a function, which allows users to understand different processing processes according to their needs, what each processing process does, what kind of input is required, and what kind of output is produced.
Impact analysis: Impact analysis refers to starting from a certain entity and looking for processing entities or other entities that depend on this entity. All dependent process entities or other entities can be found recursively if desired. This feature supports assessing the scope of an entity’s influence when certain entities change or need to be modified.
Entity association analysis: Entity association analysis is to view the usage of specific data from the perspectives of other entities associated with an entity and the processing process it participates in, forming a network of entities and participating processing processes, so as to further understand the importance of the entity. This function can be used to support the application of requirement change impact assessment.
Entity difference analysis: Entity difference analysis is to examine the different entities of the metadata, and show the differences between them in the form of graphs and tables, including the differences in names, attributes, data lineage and effects on other parts of the system. Many similar entities exist in data systems. These entities (such as data tables) may only have minor differences in name or in attributes, or even have some attributes with the same name, but in different applications. For various reasons, these small differences directly affect the statistical results of the data, and the data system needs to clearly understand these differences. This function helps to further unify the statistical caliber and evaluate the differences of similar entities.
Indicator consistency analysis: Indicator consistency analysis refers to graphically analyzing and comparing whether the data flow graphs of two indicators are consistent, so as to understand whether the indicator calculation process is consistent. This function is a specific application of index lineage analysis. Indicator consistency analysis can help users to clearly understand whether the data objects and transformation relationships involved in the two indicators to be compared in each stage of the business analysis data flow diagram are consistent, help users better understand the ins and outs of indicators, and clearly understand the distribution differences between indicators with the same name in different departments, thereby increasing user trust in the value of the indicator.
3. Assisted Application Optimization
Metadata provides an accurate description of the data of the data system, the data processing process and the relationship between the data. Using the metadata analysis functions such as lineage analysis, impact analysis and entity association analysis, the technical resources related to the system application can be identified and combined with the application. The life cycle management process assists in the application optimization of the data system.
4. Auxiliary Safety Management
The data stored in the enterprise data platform and the various analysis applications provided involve various sensitive information in the company’s operations. Therefore, in the process of data system construction, comprehensive security management mechanisms and measures must be adopted to ensure the data security of the system.
The data system security management module is responsible for the data sensitivity, customer privacy information and audit log record management of the data system, and effectively monitors the data access and function usage of the data system. In order to control the access of the data system to sensitive data and customer privacy information and further refine permissions, the security management module should take metadata as the basis. The metadata management module provides the definition of sensitive data and customer privacy information to assist the security management module to complete relevant security control operations.
5. Metadata-based Development Management
The main links of data system project development include: requirement analysis, design, development, testing and launch. The development management application can provide corresponding functions, and provide management and support for the workflow, related resources, rule constraints, and input and output information of the above links.