The Birth of the Relational Model - Part 1 (go to Part 2)
Note - Applied Information Science has copied article from its source at http://www.intelligententerprise.com/9811/frm_online2.shtml in fear that the original posting may be removed. Certain sections have been emphasized and/or commented by the editor.
A look back at Codd's original papers and how the relational model has evolved over time in today's leading database systems
It was thirty years ago today / Dr. Edgar showed the world the way ...
-- with apologies to Lennon and McCartney
Excuse the poetic license, but it was 30 years ago, near enough, that Dr. Edgar F. Codd began work on what would become The Relational Model of Data. In 1969, he published the first in a brilliant series of highly original papers describing that work -- papers that changed the world as we know it. Since that time, of course, many people have made contributions (some of them quite major) to database research in general and relational database research in particular; however, none of those later contributions has been as significant or as fundamental as Codd's original work. A hundred years from now, I'm quite sure, database systems will still be based on Codd's relational foundation.
The First Two Papers
As I already mentioned, Codd's first relational paper, "Derivability, Redundancy, and Consistency of Relations Stored in Large Data Banks," was published in 1969. Unfortunately, that paper was an IBM Research Report; as such, it carried a Limited Distribution Notice and therefore wasn't seen by as many people as it might have been. (Indeed, it's since become something of a collector's item.) But the following year a revised version of that first paper was published in Communications of the ACM, and that version was much more widely disseminated and received much more attention (at least in the academic community). Indeed, that 1970 version, "A Relational Model of Data for Large Shared Data Banks," is usually credited with being the seminal paper in the field, though that characterization is perhaps a little unfair to its predecessor. Those first two papers of Codd's are certainly unusual in one respect: They stand up very well to being read -- and indeed, repeatedly reread -- nearly 30 years later! How many papers can you say that of? At the same time, it has to be said that they're not particularly easy to read. The writing is terse and a little dry, the style theoretical and academic, the notation and examples rather mathematical in tone. As a consequence, I'm sure I'm right in saying that to this day, only a tiny percentage of database professionals have actually read them. So I thought it would be interesting and useful to devote a short series of articles to a careful, unbiased, retrospective review and assessment of Codd's first two papers.
As I began to get involved in writing that review, however, I came to realize that it would be better not to limit myself to just the first two papers, but rather to take a look at all of Codd's early relational publications. Over the next few months, therefore, I plan to consider the following important papers of Codd's in addition to the two already mentioned: "Relational Completeness of Data Base Sublanguages;" "A Data Base Sublanguage Founded on the Relational Calculus;" "Further Normalization of the Data Base Relational Model;" "Interactive Support for Nonprogrammers: The Relational and Network Approaches;" and "Extending the Relational Database Model to Capture More Meaning."
One last preliminary remark: I don't mean to suggest that Codd's early papers got every last detail exactly right, or that Codd himself foresaw every last implication of his ideas. Indeed, it would be quite surprising if this were the case! Minor mistakes and some degree of confusion are normal and natural when a major invention first sees the light of day; think of the telephone, the automobile, or television (or even computers themselves; do you remember the prediction that three computers would be sufficient to serve all of the computing needs of the United States?). Be that as it may, I will, of course, be liberally applying the "20/20 hindsight" principle in what follows. Indeed, I think it's interesting to see how certain aspects of the relational model have evolved over time.
Codd's Fundamental Contributions
For reference purposes, let me briefly summarize Codd's major contributions here. (I limit myself to relational contributions only! It's not as widely known as it ought to be, but the fact is that Codd deserves recognition for original work in at least two other areas as well -- namely, multiprogramming and natural language processing. Details of those other contributions are beyond the scope of this article, however.)
Probably Codd's biggest overall achievement was to make database management into a science; he put the field on solid scientific footing by providing a theoretical framework (the relational model) within which a variety of important problems could be attacked in a scientific manner. In other words, the relational model really serves as the basis for a theory of data. Indeed, the term "relational theory" is preferable in some ways to the term "relational model," and it might have been nice if Codd had used it. But he didn't.
Codd thus introduced a welcome and sorely needed note of clarity and rigor into the database field. To be more specific, he introduced not only the relational model in particular, but the whole idea of a data model in general. He stressed the importance of the distinction (regrettably still widely misunderstood) between model and implementation. He saw the potential of using the ideas of predicate logic as a foundation for database management and defined both a relational algebra and a relational calculus as a basis for dealing with data in relational form. In addition, he defined (informally) what was probably the first relational language, "Data Sublanguage ALPHA;" introduced the concept of functional dependence and defined the first three normal forms (1NF, 2NF, 3NF); and defined the key notion of essentiality.
The 1969 Paper
Now I want to focus on the 1969 paper (although I will also mention points where the thinking in the 1970 paper seems to augment or supersede that of the 1969 version). The 1969 paper --which was, to remind you, entitled "Derivability, Redundancy, and Consistency of Relations Stored in Large Data Banks" -- contains an introduction and the following six sections:
1. A Relational View of Data
2. Some Linguistic Aspects
3. Operations on Relations
4. Expressible, Named, and Stored Relations
5. Derivability, Redundancy, and Consistency
6. Data Bank Control.
The paper's focus is worthy of note. As both the title and abstract suggest, the focus is not so much on the relational model per se as it is on the provision of a means of investigating, in a precise and scientific manner, certain notions of data redundancy and consistency. Indeed, the term "the relational model" doesn't appear in the paper at all, although the introduction does speak of "a relational view ... (or model) of data." The introduction also points out that the relational "view" enjoys several advantages over "the graph (or network) model presently in vogue: It provides a means of describing data in terms of its natural structure only (that is, all details having to do with machine representation are excluded); it also provides a basis for constructing a high-level retrieval language with Ĺ’maximal [sic] data independence'" (that is, independence between application programs and machine data representation -- what we would now call, more specifically, physical data independence). Note the term "retrieval language," by the way; the 1970 paper replaced it with the term "data language," but the emphasis throughout the first two papers was always heavily on query rather than update. In addition, the relational view permits a clear evaluation of the scope and limitations of existing database systems as well as the relative merits of "competing data representations within a single system." (In other words, it provides a basis for an attack on the logical database design problem.) Note the numerous hints here of interesting developments to come.
A Relational View of Data
Essentially, this section of Codd's paper is concerned with what later came to be called the structural part of the relational model; that is, it discusses relations per se (and briefly mentions keys), but it doesn't get into the relational operations at all (what later came to be called the manipulative part of the model).
The paper's definition of relation is worth examining briefly. That definition runs more or less as follows: "Given sets S1, S2, ..., Sn (not necessarily distinct), R is a relation on those n sets if it is a set of n-tuples each of which has its first element from S1, its second element from S2, and so on. We shall refer to Sj as the jth domain of R ... R is said to have degree n." (And the 1970 paper adds: "More concisely, R is a subset of the Cartesian product of its domains.")
Although mathematically respectable, this definition can be criticized from a database standpoint -- here comes the 20/20 hindsight! -- on a couple of counts. First, it doesn't clearly distinguish between domains on the one hand and attributes, or columns, on the other. It's true that the paper does introduce the term attribute later, but it doesn't define it formally and or use it consistently. (The 1970 paper does introduce the term active domain to mean the set of values from a given domain actually appearing in the database at any given time, but this concept isn't the same as attribute, either.) As a result, there has been much confusion in the industry over the distinction between domains and attributes, and such confusions persist to the present day. (In fairness, I should add that the first edition of my book An Introduction to Database Systems (Addison-Wesley, 1975) was also not very clear on the domain-vs.-attribute distinction.)
The 1969 paper later gives an example that -- at least from an intuitive standpoint -- stumbles over this very confusion. The example involves a relation called PART with (among others) two columns called QUANTITY_ON_HAND and QUANTITY_ON_ORDER. It seems likely that these two columns would be defined on the same domain in practice, but the example clearly says they're not. (It refers to them as distinct domains and then says those domains "correspond to what are commonly called ... attributes.")
Note, too, that the definition of relation specifies that the domains (and therefore attributes) of a relation have an ordering, left to right. The 1970 paper does say that users shouldn't have to deal with "domain-ordered relations" as such but rather with "their domain-unordered counterparts" (which it calls relationships), but that refinement seems to have escaped the attention of certain members of the database community -- including, very unfortunately, the designers of SQL, in which the columns of a table definitely do have a left-to-right ordering.
Codd then goes on to define a "data bank" (which we would now more usually call a database, of course) to be "a collection of time-varying relations ... of assorted degrees," and states that "each [such] relation may be subject to insertion of additional n-tuples, deletion of existing ones, and alteration of components of any of its existing n-tuples." Here, unfortunately, we run smack into the historical confusion between relation values and relation variables. In mathematics (and indeed in Codd's own definition), a relation is simply a value, and there's just no way it can vary over time; there's no such thing as a "time-varying relation." But we can certainly have variables -- relation variables, that is -- whose values are relations (different values at different times), and that's really what Codd's "time-varying relations" are.
A failure to distinguish adequately between these two distinct concepts has been another rich source of subsequent confusion. For this reason, I would have preferred to couch the discussions in the remainder of this series of columns in terms of relation values and variables explicitly, rather than in terms of just relations -- time-varying or otherwise. Unfortunately, however, this type of approach turned out to involve too much rewriting and (worse) restructuring of the material I needed to quote and examine from Codd's own papers, so I reluctantly decided to drop the idea. I seriously hope no further confusions arise from that decision!
Back to the 1969 paper. The next concept introduced is the crucial one --very familiar now, of course -- of a key (meaning a unique identifier). A key is said to be nonredundant if every attribute it contains is necessary for the purpose of unique identification; that is, if any attribute were removed, what would be left wouldn't be a unique identifier any more. ("Key" here thus means what we now call a superkey, and a "nonredundant" key is what we now call a candidate key -- candidate keys being "nonredundant," or irreducible, by definition.)
Incidentally, the 1970 paper uses the term primary key in place of just key. Observe, therefore, that "primary key" in the 1970 paper does not mean what the term is usually taken to mean nowadays, because: a) it doesn't have to be nonredundant, and b) a given relation can have any number of them. However, the paper does go on to say that if a given relation "has two or more nonredundant primary keys, one of them is arbitrarily selected and called the primary key."
The 1970 paper also introduces the term foreign key. (Actually, the 1969 paper briefly mentions the concept too, but it doesn't use the term.) However, the definition is unnecessarily restrictive, in that -- for some reason -- it doesn't permit a primary key (or candidate key? or superkey?) to be a foreign key. The relational model as now understood includes no such restriction.
Well, that's all I have room for this month. At least I've laid some groundwork for what's to come, but Codd's contributions are so many and so varied that there's no way I can deal with them adequately in just one or two articles. It's going to be a fairly lengthy journey.
No comments:
Post a Comment