Curino Carlo Aldo
Present position: Post-doc at UCLA
|Thesis title:||Panta Rhei: Database Evolution and Integration from Practice to Vision|
|Advisor:||Letizia Tanca, Carlo Zaniolo|
|Research area:||Web, multimedia e database|
``Panta Rhei''---everything is in a state of flux. Does this impact Information Systems? Does this impact real-life DB management? To answer these interrogatives, we analyzed the evolution history of popular Web Information Systems including the well known portal Wikipedia. Beyond confirming the initial hunch about the severity of the problem, our analysis suggests the need for developing better methods and tools to support graceful schema evolution and data archiving under schema evolution.
Schema evolution represents an unsolved problem for traditional information systems that is further exacerbated in web information systems, and public scientific databases. As of today, schema evolution remains an error-prone and time-consuming undertaking, because the DB Administrator (DBA) lacks the methods and tools needed to manage and automate this endeavor by (i) predicting and evaluating the effects of the proposed schema changes, (ii) rewriting queries and applications to operate on the new schema, and (iii) migrating the database. The PRISM system, that we developed, takes a big first step toward addressing this pressing need by providing: (i) a language of Schema Modification Operators (SMOs) to express concisely complex schema changes, (ii) tools that allow the DBA to evaluate the effects of such changes, (iii) optimized translation of old queries to work on the new schema version, (iv) automatic data migration, and (v) full documentation of intervened changes as needed to support data provenance, database flash back, and historical queries. PRISM solves these problems by integrating recent theoretical advances on mapping composition and invertibility, into a design that also achieves usability and scalability. PRISM has been tested against the Wikipedia schema history with significant and encouraging results.
Furthermore, the accountability obligations of companies and organizations made even more urgent the old problem of managing the history of database information, particular taxing for database subject to schema changes. The other system we developed, named PRIMA, addresses this difficult problem by introducing two key pieces of new technology. The first is a method for publishing the history of a relational database in XML, whereby the evolution of the schema and its underlying database are given a unified representation. This temporally grouped representation makes it easy to formulate sophisticated historical queries on any given schema version using standard XQuery. The second key piece of technology provided by PRIMA makes schema evolution transparent to the user, that can write queries against the current schema while retrieving the data from one or more schema versions. The system then substitues the user in the expensive and difficult task of rewriting such queries into equivalent ones for the appropriate versions of the schema. The latter one is realized by (i) exploiting the same SMOs language we defined in for PRISM to represent the mappings between successive schema versions and (ii) an XML integrity constraint language (XIC) to capture the mapping, needed by the rewriting engine, in the XML representation of the schema history. Two implementations are presented, one relying on the native XML query engine, and another exploiting mature relational technology to overcome the XML performance bottleneck. Temporal-specific query optimization techniques are presented, both for complex temporal query answering and to support efficient temporal coalescing.
PRISM and PRIMA systems are integrated into the Panta Rhei Framework, which is designed to provide powerful tools that: (i) facilitate schema evolution and guide the Database Administrator in planning and evaluating changes, (ii) support automatic rewriting of legacy queries against the current schema version, (iii) enable efficient archiving of the histories of data and metadata, and (iv) support complex temporal queries over such histories.
Another crucial aspect needed to fully address the problem of Information Systems evolution is the support of data integration. The current business reality induces frequent acquisitions and merges of companies and organizations, and more and more tight interaction between information systems of cooperating companies. We present our contributions to Context-ADDICT , a framework supporting on-the-fly (semi-) automatic data integration, context modeling and context-aware data filtering. Data Integration is achieved in Context-ADDICT by means of automatic ontology-based data source wrapping and integration. While this approach lays its foundations in a solid theoretical background, it also provides heuristics to solve the practical aspects of data integration in a dynamic environment. The context management aspects of Context-ADDICT are outside the scope of this thesis and our contributions on this topics will be only briefly summarized when presenting the overall framework. Context-ADDICT is still under active development, while various components have already been developed and tested with encouraging results. We conclude proposing a unified framework, combining the contributions on schema evolution and data integration discussed so far. The framework we propose, named ADAM (Advanced Data And Metadata Manager), seeks to deliver: (i) practical solutions to the problems of automatic schema mapping and assisted schema evolution, by exploiting the synergies of an integrated approach, and (ii) enable Semantic Web applications to operate on such data.