Research Problems in Data Warehousing
Nikunj Kakadiya, LDRP Institute of Technology and Research, Gandhinagar
Abstract- The topic of data warehousing encompasses architectures, algorithms and tools for bringing together selected data from multiple databases into a single repository called data warehouse, suitable for direct querying and analysis. In recent years data warehouse has become prominent buzzword in the database industry but lacking attention from database research community. In this paper I have motivated the concept of data warehouse along with its architecture and propose a number of technical issues arising from the architecture that I believe are suitable topics for exploratory research.
I. INTRODUCTION
Providing integrated access to multiple distributed heterogenous databases or other information sources has become one of the leading issues in the database research and industry. Data integration follows two well known approaches:
1) Lazy or on-demand approach: It is a two step process which is as under:
(1) Accept the query, determine the appropriate set of information sources to answer the query, and generate the appropriate sub queries or command for each information source.
(2) Obtain results from the information sources, perform appropriate translation, filtering, and merging of the information, and return the final answer to the user or application.
2) Eager or in-advance approach: This is an alternative to lazy approach. It works as under:
(1) Information from each source that may be of interest is extracted in advance, translated and filtered as appropriate, merged with relevant information from other sources and stored in centralized repository.
(2) When a query is posed, query is evaluated directly at the repository, without accessing the original information sources.
This approach is commonly referred as data warehousing.
A lazy approach of integration is appropriate for information that changes rapidly, for clients with unpredictable needs, and for queries that operate on large amounts of data from very large number of information sources. However, lazy approach may incur inefficiency and delay in query processing. Whereas warehousing is appropriate for clients requiring specific, predictable portions of the available information. Secondly, for clients requiring high query performance not requiring the most recent state of the information.
The lazy and warehousing approaches are each viable solutions to the data integration problem, and each appropriate for certain scenarios. In this paper I will consider research problems associated with the warehousing approach.
II. INDUSTRIAL PERSPECTIVE
Before considering the research problems associated with the data warehousing, we note that there has been great interest in the topic within the database industry over the last several years. I personally believe that a truly general, efficient, flexible and scalable data warehousing architecture requires a number of technical advances, outlined below.
The importance of data warehousing in the commercial segment appears to be due to a need for enterprises to gather all of their information into a single place for in-depth analysis, and the desire to decouple such analysis from the online transaction processing system. Analytical processing that involves very complex queries and few or no updates – usually termed decision support – is one of the primary uses of data warehouses.
III. ARCHITECTURE OF A DATA WAREHOUSING SYSTEM
Figure 1 illustrates the basic architecture of a data warehousing system. The bottom of the diagram shows the information sources. Connected to each information source is a wrapper/monitor. The wrapper component of this module is responsible for translating information from the native format of the source into the format and the data model used by the warehousing model, while the monitor component is responsible for automatically detecting the changes of interest in the source data and reporting it to the integrator.
When a new information source is attached to the warehousing system, or when relevant information at the source changes, the new or modified data is propagated to the integrator. The integrator is responsible for installing the information in the warehouse which may include filtering the information, summarizing it, or merging it with information from other sources.
The data warehouse itself can use an off-the-shelf or special purpose data management system. Although in Figure 1 we illustrate a single, centralized warehouse, the warehouse certainly may be implemented as a distributed database system, and in fact data parallelism or distribution may be necessary to provide the desired performance.
The architecture and basic functionality we have described is more general than provided by most commercial data warehousing systems. In particular, current systems usually assume that the sources and the warehouse subscribe to a single data model, that propagation of information from the sources to the warehouse is performed as a batch process, and that queries from the integrator to the sources are never needed.
IV. RESEARCH PROBLEMS
Based on the general architecture for the data warehouse described in section 3, we now outline a number of research problems that arise from the warehousing approach.
Wrapper/Monitors
The wrapper/monitor component shown in figure 1 has two interrelated responsibilities:
1. Translation: Making the underlined information source appears if it subscribe to the data model used by the warehousing system. For example, if the information source consists of a set of flat files but the warehouse model, then the wrapper/monitor must support an interface that presents the data from the information source as if it were relational. The translation problem is inherent in almost all approaches to integration – both lazy and eager – and is not specific to data warehousing. Typically a component that translates an information source into a common integrating model is called a translator or wrapper.
2. Change Detection: Monitoring the information source for changes to the data are relevant to the warehouse and propagating those changes to the integrator. Note that this functionality relies on translation since, like the data itself, changes to the data must be translated from the format and model of information source into the format and the model used by the warehousing system.
One approach is to ignore the change detection issue altogether and simply propagate entire copies of relevant data from the information source to the warehouse periodically. The integrator can combine this data with existing warehouse data from other sources, or it can request complete information from all sources and recompute the warehouse data from scratch. Ignoring change detection may be acceptable in certain scenarios, for example when it is not important for the warehouse data to be current and it is acceptable for the warehouse to be off-line occasionally. However, if currency, efficiency, and continuous access are required, then we believe that detecting and propagating changes and incrementally folding the changes into the warehouse will be the preferred solution.
In considering the change detection problem, we have identified several relevant types of information sources:
• Cooperative sources: Sources that provide triggers or other active database capabilities, so that notifications of changes of interest can be programmed to occur automatically.
• Logged sources: Sources maintaining a log that can be queried or inspected, so changes of interest can be extracted from the log.
• Query able sources: Sources that allow the wrapper/monitor to query the information at the source, so that periodic polling can be used to detect changes of interest.
• Snapshot sources: Sources that do not provide triggers, logs, or queries. Instead, periodic dumps, or snapshots, of the data are provided offline, and changes are detected by comparing successive snapshots.
Each type of information source capability provides interesting research problems for change detection. For example, in cooperative sources, although triggers and active databases have been explored in depth, putting such capabilities to use in the warehousing context still requires addressing the translation aspect; similarly for logged sources. In query able sources, in addition to translation, one must consider performance and semantic issues associated with polling frequency: If the frequency is too high, performance will degrade, while if the frequency is too low, changes of interest may not be detected in a timely way. In snapshot sources, the challenge is to compare very large database dumps, detecting the changes of interest in an efficient and scalable way. An important related problem in all of these scenarios is to develop appropriate representations for the changes to the data, especially if a non-relational model is used. Finally, we note that a different wrapper/monitor component is needed for each information source, since the functionality of the wrapper/monitor is dependent on the type of the source (database system, legacy system, news wire, etc.) as well as on the data provided by that source. Clearly it is undesirable to hard-code a wrapper/monitor for each information source participating in a warehousing system, especially if new information sources become available frequently. Hence, a significant research issue is to develop techniques and tools that automate or semi-automate the process of implementing wrapper/monitors, through a toolkit or specification-based approach
Integrator
Assume that the warehouse has been loaded with its initial set of data obtained from the information sources. The ongoing job of the integrator is to receive change notifications from the wrapper/monitors for the information sources and reflect these changes in the data warehouse; see Figure 1.
• In most data warehousing scenarios, the views stored at the warehouse tend to be more complicated than conventional views. For example, even if the warehouse and the information sources are relational, the views stored in the warehouse may not be expressible using a standard relational view definition language (such as SQL) over the base data.
• Data warehouses also tend to contain highly aggregated and summarized information. So, efficient view maintenance in the presence of aggregation and summary information appears to be an open problem.
• In a data warehouse, the views may not need to be refreshed after every modification or set of modifications to the base data. Rather, large batch updates to the base data may be considered, in which case efficient view maintenance techniques may involve different algorithms than are used for conventional view maintenance.
• In a data warehousing environment it may be necessary to transform the base data (sometimes referred to as data scrubbing) before it is integrated into the warehouse. Transformations might include, for example, aggregating or summarizing the data, sampling the data to reduce the size of the warehouse, discarding or correcting data suspected of being erroneous, inserting default values, or eliminating duplicates and inconsistencies.
Finally, we note that although integrators can be based purely on the data model used by the warehousing system, a different integrator still will be needed for each data warehouse, since a different set of views over different base data will be stored.
Warehouse Specification
In the previous section we drew an analogy between maintenance of a data warehouse and materialized view maintenance.
For conventional view maintenance, algorithms have been developed to automatically generate active database rules for maintaining SQL-defined views. Each rule is \triggered" by the notification of an update that may affect the view, and the rule modifies the view appropriately. A similar approach may be applied to data warehousing if a rule-driven integrator is used. Each integrator rule is triggered by a change notification (possibly of a specific type) from a wrapper/monitor. Similar to the view maintenance rules, integrator rules must update the warehouse to reflect the base data updates.
Thus, the research challenge in realizing the ideal
Architecture is to devise a warehouse specification language, rule capabilities, wrapper/monitor interfaces, and appropriate algorithms to permit developers of a data warehousing system to generate the integrator and the relevant change detection mechanisms automatically.
Optimizations
In this section we outline three optimizations that can improve the performance of the architecture described in Section 3: filtering irrelevant modifications at the sources, storing additional data at the warehouse for “self-maintainability," and efficiently managing multiple materialized views.
Update Filtering
Related techniques allow distributed integrity constraints to be checked at a single site when certain types of modifications occur. We believe that these classes of techniques can be adapted to data warehousing, whereby as many changes as possible are filtered at the source rather than propagated to the integrator.
Self Maintainability
When the integrator receives a change notification, in order to integrate that change into the warehouse the integrator may need to fetch additional data from the same or different sources. Issuing queries to sources can incur a processing delay, the queries may be expensive, and such queries are the basis of the warehouse maintenance anomalies". Even worse, when information sources are highly secure or when they are legacy systems, ad-hoc queries may not be permitted at all. Consequently, it may be desirable to ensure that, as much as possible, queries to the sources are not required in order to keep the warehouse data consistent.
Multiple View Optimization
Data warehouses may contain multiple views, for example to support different types of analysis. When these views are related to each other, e.g., if they are defined over overlapping portions of the base data, then it may be more efficient not to materialize all of the views, but rather to materialize certain shared “sub views," or portions of the base data, from which the warehouse views can be derived. When applicable, this approach can reduce storage costs at the warehouse and can reduce the effort required to integrate base data modifications into the warehouse. However, these savings must be balanced against slower query response at the warehouse, since some views may not be fully materialized.
Miscellaneous issues
Other issues that arise in the data warehousing environment are warehouse management, source and warehouse evolution, duplicate and inconsistent information, outdated information etc.
V. Conclusion
In the area of integrating multiple, distributed, heterogeneous information sources, data warehousing is a viable and in some cases superior alternative to traditional research solutions. Traditional approaches request, process, and merge information from sources when queries are posed. In the data warehousing approach, information is requested, processed, and merged continuously, so the information is readily available for direct querying and analysis at the warehouse. Although the concept of data warehousing already is prominent in the database industry, we believe there
are a number of important open research problems, described above, that need to be solved to realize the flexible, powerful, and efficient data warehousing systems of the future
No comments:
Post a Comment