The Impedance Imperative
Tuples + Objects + Infosets =Too Much Stuff!

Dave Thomas, Bedarra Corp., Carleton University and University of Queensland

COLUMN

THINGS ARE SO COMPLEX YOU NEED AN M.SC. TO PROGRAM CRUD!

Once upon a time it was possible for every new programmer to quickly learn how to write readable programs to Create, Read, Update and Delete business information. These so-called CRUD applications, along with reporting, were pervasive throughout business and essentially defined IT or MIS as it was called in those days.

These IT design patterns are so pervasive that they were incorporated into numerous useful application program generators for COBOL, PLI and RPG (Synon for example) and dominated the RAD market for well over a decade. Fourth generation languages such as Adabas Natural, PowerHouse, and Mapper were designed for CRUD applications operating over complex file structures and databases. Using these tools it was straightforward for a businessperson with minimal training to develop useful robust applications.

ALL YOU NEED IS TUPLES

The plethora of proprietary languages and file structures was a major enabler for the late Ted Codd’s [1] relational model. SQL promised a single uniform abstraction called a “relational table” and three simple and powerful operations for all applications: “Select, Project and Join”. The expressive power and uniformity of the relational model and SQL appeared very attractive to vendors who could then compete on engineering, rather than on the expressive power of their language and database structure.

SQL is quite good for simple CRUD applications on normalized tables. Unfortunately, SQL isn’t computationally complete and often needs to be embedded within another programming language in order to build a CRUD application. These resulted in several embedded SQL dialects for different vendors that extended SQL beyond the standard. In order to be able to navigate through relations embedded SQL programming often requires an alternative interface using cursors (relation relative pointers) and materialized tuples in memory as record structures. -- Fortunately this hybrid embedded SQL was only used on the server but it requires special skills and often makes applications expensive to port from one vendor to another.

Despite the popularity of the relational model, it was often used as an implementation technology largely due to the need to have normalized tables. Information architects and business modellers used the more expressive and simpler Entity Relationship (ER) model (born again in UML). In order to implement a relational schema and associated operations, analysts and designers who work at the application level are forced to mechanically translate to an underlying relational model. It is surprising, given the utility of the Entity Relationship (ER) model, that there were few ER databases or at least ER languages other than ZIM.

For over a decade, relational technology remained unable to compete with the strong specific solutions of previous generations. This is not an uncommon lesson and one that often gets ignored by new technology zealots. Consider how long it will be before performance of a J2EE server meets or exceeds a previous generation TP monitor.

However, after many years of engineering, the relational databases can finally claim the performance and flexibility of keyed files (blobs); network databases (special support for transitive closures such as recursive bill of materials or CAD); and at least rudimentary support for text (XML in a Blob). Unfortunately each of these database and SQL extensions is quite proprietary, much in the way the early 4GLs were.

MINIMAL AFFORDANCES FOR OBJECTS AND CONTENT

In order to accommodate the increasing demand for objects and content, the DBMS vendors replied with the Third Generation Database Manifesto [3]. In particular they added new native types to the database to support objects (called user defined data types) and large text types. Both of these extended types were syntactic extensions on Blobs, which were added largely to support images, and documents. SQL was extended to allow query operations over Blobs using special content selector objects. Recently text types have been enhanced to support XML schemas or DTDs.

ALL YOU NEED IS OBJECTS

In the mid 80s, programming languages moved away from procedural languages to object languages. At this point the technical seam between objects and tuples became clearly exposed since objects always encapsulated their data and hence tuples or tables needed to be explicitly materialized as objects or tuples (objectified or tuplified). Object zealots pontificated the Object-Oriented Database Manifesto [2] touting the expressive power and uniformity of objects (starting to sound familiar?). Object databases, it was claimed, solved the impedance mismatch between the object programming language and the database since everything was an object.

In the hopes of becoming the next Oracle, VC funded OODBMS vendors rushed to bring out object only databases such as Gemstone, Object Design, and Objectivity etc. Unfortunately for OODBMS advocates, while there are some solutions (AS/400 and Gemstone persistent stores) that have been very successful for application development, for the most part relational databases do CRUD and offer a simplicity and performance that has been impossible to match except in niche applications such as CAD.

WRAP IT AND MAP IT

In order to deal with the entrenched nature of relational data, the OO language community opted for wrapping the relational database as an object. This still left a world far too complex for most developers who were required to objectify or tuplify data as it moved between “object land” and “tuple land”. Suffice to say that it is a non-trivial exercise to write an object program, which accesses a relational database. To address this issue, a number of researchers and vendors developed relational mapping frameworks such as TopLink, ADO etc. These mapping frameworks with their associated tools and wizards reduced the need for developers to understand the mapping details. However, to use them properly developers required an intimate knowledge of the framework.

The constant need to materialize and dematerialize objects is extremely inefficient. -- A lot of machine cycles are wasted using these wrappers and mappers. Indeed, we are fortunate that disks are so slow and CPUs are so fast. However, memory resident relational data is becoming commonplace with 64 bit machines and I doubt the overhead will be tolerable in the long term. In principle it should be possible to optimize through these frameworks, however this requires that the compiler have intimate knowledge of the mapper, the wrapper and potentially the database itself.

ALL YOU NEED IS <INFOSETS>

XML is rapidly moving beyond DTDs and documents to the brave new world of XML schemas and Infosets [9]. All of the major vendors have announced support for Xquery [10] product offerings of one form or another. XML Schema adds another declarative form to the impedance stack while Xquery introduces another sophisticated language for processing semi-structured information. Both the XML Infoset and Object proponents argue convincingly that the world is a tree, or potentially a digraph e.g. a purchase order contains the supplier and purchaser information followed by a variable length list of line items being purchased. Those who have just come up to speed on DTDs and XSLT will need to keep running to keep up with the XML train.

XQUERY – YET ANOTHER QUERY LANGUAGE

In order to provide a sound foundation for XML query and update processing of semi-structured information, the XQuery activity has defined a semantic model for XML based on Infoset [9]. It is too early to tell how well XQuery will be accepted and used, although major vendors are all showing products in plan which implement XQuery in one form.

This means that a developer will need to know one language for walking the tuples e.g., SQL, another for navigating the objects e.g., Java or C#, and a third for the Infoset e.g., XQuery. XQuery also assumes the knowledge of XPath that is being refined in parallel to XQuery. While XQuery should provide the ability to directly query both tuples and infosets since in principle infosets are a superset of tuples it isn’t clear that this can be done as elegantly as with SQL. It can be argued that given the ability to directly query both relational and XML data one can handle lots of problems without needing objects. The frustration with XQuery is the fact it is not a complete language, which means that once again developers must use multiple languages to achieve a simple CRUD application with a decent UI.

Unfortunately the lack of explicit XML values mean that parsing and generation tools always need to be present to interface between infosets and current OO programming languages. Nowhere is this more evident in SOAP and XML-RPC, the XML equivalents of procedural RPC and OO RMI. The complexity of serialization and unserialization of objects, and tuples to and from XML itself requires a special purpose toolkit.

While there are many critics who claim that XML processing can never be efficient, the same was said of relational databases and object technology. There are well known techniques for efficient serialization. Further recent research has demonstrated that most XML processing can be done without an inefficient memory based DOM model. Both events driven parsing and streaming are becoming the rage, although it has existed for years in content engineering languages [11] such as Omnimark [5], as alternatives to inefficient and the error prone debugging of XSLT.

A TOOL OPPORTUNITY

It is clearly unnecessary to have three different ways to declare, navigate, communicate and convert values from these three solitudes. It is even worse if one chooses to use multiple OO languages, different DBMS vendors, and different XML schema processors. But this is not an unrealistic scenario for development of an enterprise application. Clearly the impedance problem gives rise to a tool opportunity.

Meta programming [11] or generative programming is clearly the least offensive way to cope with this mess. A model driven generator can clearly address the syntactic redundancy and associated mappings. The generator handles the syntactic redundancy. This however is the easy part. Processing is still far too complicated. Unfortunately, most generative tools do not support debugging at the level of the abstraction, forcing programmers to have deep knowledge of the generated code and the underlying framework.

SURELY WE CAN DO BETTER!

The impedance of incompatible type systems imposes a constant runtime overhead in addition to the syntactic burden. We need to move beyond the three solitudes and go beyond the gratuitous complexity that exists today. There is no reason we can’t develop languages that are as productive and easy to use as 4GLs and which have underlying execution semantics based on integrated type system where tuples, objects and infosets are all first class. Microsoft .NET has already demonstrated that it is possible to have an OO language that can contain tuples and records as native types.

There are some early exemplars that show lots of promise. Xduce [6] provides an interesting semantics for Infosets and related research. Xtatic investigates XML as a native type in an OO language like C#. The recent research paper on the unification of tables, objects and documents [7] provides an interesting example of how an existing OO language such as C# or Java can be semantically and syntactically enhanced to address the problem.

We need to step back and consider the accidental complexity that arises when all of the various technical components are presented to a business developer. In isolation each of these technologies has clear merits but even a simple business application is far too complex when we compose the parts. We need to apply our considerable efforts to developing languages/tools as simple and useful for business users as 4GLs have been and continue to be. We need a computationally complete end user programming language, which will allows a mere mortal to create and deploy applications across a federated collection of semi-structured information.

REFERENCES

[1] F. Codd, “A Relational Model of Data for Large Shared Data Banks”, Communications of the ACM, Vol. 13, No. 6, June 1970, pp. 377-387.

[2] Malcolm Atkinson, François Bancilhon, David DeWitt, Klaus Dittrich, David Maier, Stanley Zdonik, “The Object-Oriented Database System Manifesto”, Proceedings of the First International Conference on Deductive and Object-Oriented Databases (1989)

[3] C. J. Date, Hugh Darwen, Foundation for Future Database Systems: The Third Manifesto, Second Edition, Addison-Wesley, 2000 (ISBN: 0-201-70928-7).

[4] Eric van der Vlist, Comparing XML Schema Languages, http://www.xml.com/pub/a/2001/12/12/schemacompare.html

[5] Mark Baker, Internet Programming with OmniMark, Kluwer Academic Publishers, Boston, ISBN 0-7923-7237-9, October 2000, 412 pp.

[6] Haruo Hosoya and Benjamin C. Pierce, Xduce, http://xduce.sourceforge.net/papers.html

[7] Erik Meijer and Wolfham Schulte, Unifying Tables, Objects, and Documents, http://research.microsoft.com/~emeijer/Papers/XS.pdf

[8] Dave Thomas, Content Engineering – Time To Get Serious About Semi-Structured Information, Otland September 2003.

[9] The XML Infoset, http://www.w3.org/TR/xml-infoset/

[10] D. Chamberlin, XQuery: “An XML query language”, tutorial overview, IBM Systems Journal 41(4), 2002.

[11] Dave Thomas: “Reflective Software Engineering - From MOPS to AOSD”, in Journal of Object Technology, vol. 1, no. 4, September-October 2002, pp. 17-26. http://www.jot.fm/issues/issue_2002_09/column1

About the author

Dave Thomas is cofounder/chairman of Bedarra Research Labs (www.bedarra.com), www.Online-Learning.com and the Open Augment Consortium (www.openaugment.org) and a founding director of the Agile Alliance (www.agilealliance.com). He is an adjunct research professor at Carleton University, Canada and the University of Queensland, Australia. Dave is the founder of and past CEO of Object Technology International (www.oti.com) creators of the Eclipse IDE Platform, IBM VisualAge for Smalltalk, for Java, and MicroEdition for embedded systems. Contact him at dave@bedarra.com or www.davethomas.net.

Cite this column as follows: Dave Thomas: “The Impedance Imperative – Tuples+Objects+Infosets=Too Much Stuff!”, in Journal of Object Technology, vol. 2, no. 5, September-October 2003, pp. 7-12. http://www.jot.fm/issues/issue_2003_09/column1

Next column

The Impedance Imperative Tuples + Objects + Infosets =Too Much Stuff!