Why The Century Dictionary?
A Preface to The Century Dictionary Online

A NUMBER of people have asked us why we have gone to all this trouble to put a hundred-year-old dictionary on the World Wide Web. The answer can be summed up in three words: it's free, it's big, and it's beautiful. Not long after I finished working for the Oxford English Dictionary and started working with the DjVu development group at AT&T Labs, I realized that the DjVu format offered exciting, new possibilities for publishing dictionaries and other large-scale reference works online. A handful of experimental samples that I created for the DjVu Zone website confirmed this, and my wife and I began to consider how, with limited personal resources and working in our spare time, something on this scale might actually be done. Obviously, we needed a text in the public domain, and several candidates immediately suggested themselves.
    We could have chosen (and indeed might still do someday) an early dictionary of chiefly historical interest, such as Dr. Johnson's, the early Webster's, or Joseph Worcester's dictionaries. All these texts are relatively small, which might have made the digitization process easier, but such works are of interest and use chiefly to scholars and would arouse only limited curiosity among the general public.
Why The Century Dictionary
Two of our goals have been to publish a large dictionary that people can still use with profit and to promote by example the adoption of the DjVu format for large-scale page image projects. Neither of these goals could have been achieved if we had begun with one of these early dictionaries. The early editions of Funk and Wagnalls New Standard Dictionary of the English Language and Webster's New International Dictionary suggested a second category of dictionary modern enough and comprehensive enough still to be used for most purposes. They were also relatively small, though in this case small size might have proved a liability, as these texts were both linked directly to larger, later editions not in the public domain. The Century Dictionary suddenly loomed before us, like a great ship rising on the horizon, as the obvious, indeed the only choice.
    In a way The Century Dictionary can be considered the Titanic of dictionaries, fabled in its day as the largest, most comprehensive dictionary yet completed. (In fact, with one exception, it is still the largest English language dictionary in existence.) Its chief editor, William Dwight Whitney, Professor of Comparative Philology and Sanskrit at Yale, was the most renowned English-speaking linguist of his time. Funded by the Century Company of New York, which also published the Century Magazine and St. Nicholas, Whitney, who had been an editor of the famed 1864 edition of Webster's, was able to assemble a staff of contributors that included such luminaries of American scholarship as Charles Peirce, Gifford Pinchot, and John Dewey. The printer of The Century Dictionary was Theodore Low De Vinne, the finest American printer of the time, and the contributing artists included the young Ernest Thompson Seton.
The Titanic of dictionaries
Indeed, The Century Dictionary is still considered by many scholars to be the greatest American dictionary of all time. After the 1914 edition, however, it was dropped by the Century Company and eventually lost to public view. Lexicographers, as a rule, tend to be a lot more modest than dictionary marketers and publicists, but it is the latter who create and maintain the "aura" of a dictionary in the public mind. Without an active publishing company to tout it with the kind of hype one hears constantly applied to certain other works, The Century Dictionary was consigned to what one might call a mute afterlife in the back rooms of second-hand bookshops and the dens of a few collectors. Of course, it has always had its admirers, as evidenced, among other things, by the fact that it is still actively cited, and perhaps more tellingly by the fact that over the years its contents (illustrative quotations as well as definitions) have been quietly pillaged by other dictionaries.


    From time to time, various people have floated schemes to revive The Century Dictionary, either by keying or scanning and OCR, but for one reason or another these came to naught. (I suspect that the Ur-Idee for The Century Dictionary Online came from one of my colleagues at Bellcore, Robert A. Amsler, who long ago dreamed of scanning the text and somehow creating an Internet dictionary.) Until quite recently, the use of digital page images in a project of any size was impractical, and so the only possibility was to extract ASCII text of some sort. The OCR engines of even a few years ago were notoriously unreliable with pages like those in a dictionary, that is, pages with arcane and complex typesetting arrangements. The other alternative, keyboarding as was done with the OED, was prohibitively expensive for anyone but a large publishing company with lots of money to lavish on its flagship product. Furthermore, extracting ASCII text by whatever means was only the beginning of a long and expensive editing process before such text could be considered fit for public consumption. In any event, an ASCII text would not have done justice to The Century Dictionary. The elegant typeface and lavish illustrations of The Century Dictionary are among its chief glories. Sacrificing these would mean sacrificing a good part of the work's essential character.
    DjVu offered us the perfect solution. It is arguably the best page image format currently available. Unlike PDF, which it resembles in certain superficial ways, it is not simply a "wrapper" for bulky page images, but a complex image compression format that uses shape clustering to create small, efficient images from large, high-resolution scans.
The perfect solution
Compression rates vary depending on a number of factors, but typically we were able to achieve compressed sizes less than 30% of the sizes of the original black and white TIFFs of The Century Dictionary pages without perceptible loss of quality. DjVu pages have the added advantage that they can be downloaded separately and display "progressively," so users do not experience the kind of long download waits they must endure when viewing similar sorts of files in PDF and other formats. DjVu images can also contain an OCR text chunk, which allows them to be searched with the plug-in's "Find" function or with an outside index. Thus DjVu documents can be said to combine the best features of page images and ASCII: the exact "look and feel" of the original pages and the search capabilities normally possible only with ASCII texts.
    Capturing the exact look and feel of the original pages of The Century Dictionary is actually more important than that phrase may make it sound. Quite apart from the somewhat "aesthetic" desire to have the illustrations and typeface mentioned above, there is also the issue of the many special characters that, far from being mere decorations, are of critical importance in accurately characterizing the lexicon. These include phonetic representations, obsolete letters or letters from different alphabets, special symbols and diacritical marks, and, of course, the illustrations that in many instances may be worth the proverbial thousand words. Most online dictionaries either eschew these special characters, thus throwing away valuable information, or resort to clumsy workarounds, such as tiny GIF images, to respresent them. DjVu's solution of this problem is both elegant and efficient.
    With advice from the DjVu gurus at AT&T Labs, we began somewhat timidly in the late Fall of 2000 by making a few test scans of the page with the entry for "buffalo" to determine, first of all, the kind of image to use: color, grayscale, or black and white. DjVu can do wonderful compressions of color images of old pages with lovely, sepia tones. Indeed, my own "boutique" samples of Dr. Johnson and Webster for DjVu Zone had used color quite effectively, and it was my natural instinct to attempt color as well for The Century Dictionary. The test results, however, were surprisingly disappointing. A color scan at 300 DPI (dots per inch), which is the standard resolution for files to be converted to DjVu, was satisfactory for the printed text but far too "lossy" for the illustrations. The tightly drawn lines suggesting the manes of the two breeds of buffalo, for instance, bled together into an unpleasant-looking, muddy tangle. Even at 300 DPI, the color TIFFs were already huge files requiring considerable memory to process, and it was not really feasible with our machines to increase the resolution sufficiently to comb clean, as it were, the buffalo's hair in color. (I might point out parenthetically that the better-looking color image used as a background for the Global Language Resources website is actually a "colorized" version of a black and white page.) The grayscale test produced somewhat smaller TIFF files, but did not really improve the images and left the pages looking simply dingy. The black and white tests proved in the end to be the most satisactory in all respects. At 400 DPI, they still produced the smallest TIFF files and yet sharp enough images to capture the detailed lines of the illustrations. Even the printed text came out sharper in black and white and yielded better optical character recognition. We tested black and white images at 600 DPI as well, but the images were larger in size and the gain in quality turned out to be minimal.
    So we decided to go with black and white images at 400 DPI. It may seem a trifle stark to some, but it is readable and it is, after all, the way new dictionaries look in print on fresh pages. Having made this decision, we proceeded to design the website with the black and white color-scheme in mind. We quickly found that the dictionary itself was full of artwork that could be used to give the website an appropriately special flavor. The initial prototype consisted of the HTML interface and the front matter and first sixteen pages of Volume I, which we scanned on a small "home" scanner hooked up to an iMac.
Halting first steps and an angel
The work, which commenced in mid January 2001, proved to be unacceptably slow. Even having cut the pages of the sample, without an automatic feeder we had to scan every other page upside down and rotate it 180 degrees. However carefully we seemed to hold the pages, these rotated images still came out rather "skewed," and in spite of our best efforts at routinizing the process, we estimated that we could only manage perhaps fifteen scans an hour. At this rate, assuming that we could work flat-out three hours a night, the best we might hope for was a single volume per month.
    At this point the Century Dictionary Project found what can best be described as its "angel." In the image processing group at AT&T Labs, the most trusted and respected specialist in high-volume scanning is Tom Johnson of Root Technologies in Princeton, New Jersey. Root Technologies had been an early DjVu partner, and was responsible for scanning such major DjVu projects as the beautiful Fishes of Wisconsin, The Journal of the Acoustical Society of America, and the collected proceedings of the Neural Information Processing Systems (NIPS) conferences. At the suggestion of Yann Le Cun of AT&T Labs, I called up Tom Johnson one afternoon in late January, showed him my own "amateur" prototype, and made an arrangement to have Root Technologies scan the volumes of The Century Dictionary. That weekend my wife carefully cut the bindings of the entire first volume, whose pages we wrapped in cellophane and paper and packed in an old Chianti box filled with "packing peanuts" which we mailed off to Princeton. A week later we received the first CD-ROM with "deskewed" TIFFs of every page of Volume I. In less than two months we were able to finish all eight volumes of the first edition. The results speak for themselves. The Century Dictionary rested securely on the expertise of the De Vinne Press; in a similar way, The Century Dictionary Online rests securely on the brilliant work of Root Technologies.
    Of course, there was still plenty of work to keep us busy. After each volume had been cut up and before it had been shipped out, my wife or daughter would go through it page by page checking for damaged pages that needed to be taped and creating the headword index that we use for the "Find Entry" function. Among other things, this allowed us to program this function and begin testing it before we had the actual page images. When the TIFFs for each new volume arrived (those were very exciting days), we subjected them to a multi-phase conversion process that lasted about twenty-four hours for each volume.
Step by step to DjVu
First the files were copied to the hard drive of a trusty, powerful, and very hard-working laptop computer running Linux. The next step was the actual compression and conversion to DjVu files. Even with the processes running in a loop, this would take several hours. The third and in some ways the most vital step was the OCR process, again run in a UNIX shell loop, which read each DjVu image, created an ASCII text version of that page, and inserted it as a compressed text chunk in the DjVu file itself so that the file could be searched using the plug-in. The OCR phase for each volume typically ran from early evening until the next morning. When the OCR was finished, the files were ready to upload to the website using ftp (file transfer protocol). As each volume represented about 80 MBs even compressed as DjVu, the transfer "upstream" through a DSL line took several hours. Once in place on the website, a DjVu multipage index was created for each volume and the Find Entry program adjusted to recognize the new pages. The process was not yet done, however. Back on the laptop, we now used the "djvutxt" utility to extract the ASCII text from each page into a metafile for that page, which was then indexed with all the others for the Full Text Search function. The new index was then uploaded and "staged" on the website.
Testing the work in progress

    One interesting side-effect of this process was that from the time we had the first volume ready we were able to run a beta test of the "work in progress." The small group of invited testers included some people who had been among my testers in the first official "Performance Trial" of the OED Online back in 1996, and who had chosen to stay with me later on. They knew about online dictionaries and what to expect from them. Most of them had not used DjVu before, but they quickly became familiar with the plug-in's features and began to make useful suggestions about the interface, the functionality, and the help pages. They also began contributing any flaws as they found them to a "revision file" which we will use to make corrections after the first phase of the Century Dictionary Project is complete.


    So what have we done then? Why The Century Dictionary Online? It needs to be emphasized, though the marketers and publicists referred to above are loathe to do so, that no dictionary can satisfy everyone's requirements completely. This is one reason that there are so many specialized dictionaries of various kinds: learner's dictionaries, technical dictionaries, dictionaries of slang, dictionaries of regional or historical English, bi-lingual dictionaries, dictionaries of new words. It is also why there are general dictionaries of different sizes: school dictionaries, pocket dictionaries, concise dictionaries, collegiate dictionaries, and, of course, the so-called unabridged dictionaries, though these expensive, flagship products seem to have become a rare, perhaps even a dying breed. With paper books, the decision about which kind of dictionary you actually needed was relatively easy to make. If you needed quick information about relatively common words, you needed a book you could hold in one hand and perhaps carry around with you, a pocket dictionary or a collegiate, but not a six-inch thick three-column affair or one in multiple volumes. If you were a scholar, on the other hand, and had to find out about rare or even obsolete words, a collegiate, in spite of the appellation, simply didn't have enough words (collegiates typically define fewer than 200,000 terms).
    The advent of online dictionaries has changed this "rule of thumb" in interesting ways. It is hard to tell from an online user interface and even from the look of some entries just what kind of dictionary you are dealing with, and thus whether it is really what you need. Another rule of thumb, unfortunately true for the most part with online dictionaries as with those sold as books, is that the cheaper it is, the less you have. To get access to the unabridged wordlist of the OED, even if you have no special taste for its Anglicized spellings, pronunciations, and general approach, you have to pay out a considerable amount of money. You can get the little Webster's and the little American Heritage free, on the other hand, but a little dictionary is all you are getting. There is a free version of the 1909 edition of Webster's New International Dictionary which has a few more, if older, terms than a modern collegiate, but it is a fairly sloppy affair and considerably smaller than the revised version of the same edition which appeared only a few years later. Unfortunately, its initial keyboaders, in addition to introducing errors, neglected to preserve such important information as the pronunciations and Greek words in the etymologies. This is a noble effort to provide a full-sized, free dictionary to the Internet community, but its flaws of design and execution make it unlikely to trouble any dictionary publisher's sleep. What might be called the Internet dictionary gap remains.
Why The Century Dictionary Online

    We expect The Century Dictionary Online to fill this gap. As I mentioned above, no dictionary is everything to everybody, and The Century Dictionary Online is no exception. You will not find terms like "bad hair day," "ribbit," "Rogernomics," or even "Reaganomics" here. After all, the last edition of The Century Dictionary came out when Reagan was a toddler. More seriously, you will not find terms like "AIDS," "motherboard," "television," "World Wide Web," and even some common words I have used here. Dictionary publicists always like to emphasize their new words, but we cannot hide the fact that The Century Dictionary Online, advanced as it was in its day, has no newer words yet. However, this should not obscure the more important fact that the vast majority of English words, and virtually all the words of the core vocabulary of English, were already known more than a century ago and are in fact beautifully, deeply defined in The Century Dictionary. The online interface, which allows the instantaneous traversal of thousands of pages in many volumes, may disguise for some users just how large this dictionary really is, how very much is here. It is quite simply the largest, most comprehensive dictionary freely available on the World Wide Web. Furthermore, its American orientation (excellent American pronunciations, preference given to American spelling forms, attention to words of American origin) gives it a special relevance in our time. When it was first published, it was hailed as a glory of American scholarship, yet at that time it was not at all obvious that American English would emerge, as it has in the past century, as by far the most important variety of English.
    To return at last to the premise with which this little preface began (in our beginnings, after all, we may find our ends), we created The Century Dictionary Online because it is free, it is big, and it is beautiful. I should add finally, a couple of more reasons: married to DjVu technology, it is innovative, and perhaps most important, it is still an American treasure.

JEFFERY A. TRIGGS.
    MADISON, N.J., March 20th, 2001.