NUMBER of people have asked us why we have gone to all this trouble
to put a hundred-year-old dictionary on the World Wide Web. The answer
can be summed up in three words: it's free, it's big, and
it's beautiful. Not long after I finished working for the Oxford
English Dictionary and started working with the DjVu
development group at AT&T Labs, I realized that the DjVu format
offered exciting, new possibilities for publishing dictionaries and other
large-scale reference works online. A handful of experimental samples
that I created for the DjVu Zone website confirmed this, and my wife and I
began to consider how, with limited personal resources and working in our spare
time, something on this scale might actually be done.
Obviously, we needed a text in the public domain, and
several candidates immediately suggested themselves.
We could have chosen
(and indeed might still do someday) an early dictionary of chiefly historical
interest, such as Dr. Johnson's, the early Webster's, or Joseph Worcester's
dictionaries. All these texts are relatively small, which might have made
the digitization process easier, but such works are of interest and use chiefly
to scholars and would arouse only limited curiosity among the general public.
Why The Century Dictionary |
---|
Two of our goals have been to publish a large dictionary that people can still
use with profit and to promote by example the adoption of the DjVu
format for large-scale page image projects. Neither of these goals could have
been achieved if we had begun with one of these early dictionaries.
The early editions of
Funk and Wagnalls New Standard Dictionary of the
English Language and
Webster's New International Dictionary
suggested a second category of dictionary modern enough and comprehensive
enough still to be used for most purposes. They were also relatively small,
though in this case small size might have proved a liability, as these texts
were both linked directly to larger, later editions not in the public domain.
The Century Dictionary suddenly loomed before us, like a great
ship rising on the horizon, as the obvious, indeed the only choice.
In a way
The Century Dictionary can be considered the
Titanic of dictionaries,
fabled in its day as the largest, most comprehensive dictionary yet completed.
(In fact, with one exception, it is still the largest English language dictionary in existence.)
Its chief editor, William Dwight Whitney, Professor of Comparative Philology
and Sanskrit at Yale, was the most renowned English-speaking linguist of his
time. Funded by the Century Company of New York, which also published the
Century Magazine and
St. Nicholas,
Whitney, who had been an editor of the famed 1864 edition of Webster's, was
able to assemble a staff of contributors that included such luminaries of
American scholarship as Charles Peirce, Gifford Pinchot, and John Dewey. The printer of
The
Century Dictionary was Theodore Low De Vinne, the finest American printer
of the time, and the contributing artists included the young Ernest Thompson
Seton.
The Titanic of dictionaries |
---|
Indeed,
The Century Dictionary is still
considered by many scholars to be the greatest American dictionary of all time.
After the 1914 edition, however, it was dropped by the Century Company and
eventually lost to public view. Lexicographers, as a rule, tend to be a lot
more modest than dictionary marketers and publicists, but it is the latter who
create and maintain the "aura" of a dictionary in the public mind.
Without an active publishing company to tout it
with the kind of hype one hears constantly applied to certain other works,
The Century Dictionary was consigned to what one might call a mute
afterlife in the back rooms of second-hand bookshops and the dens of a few
collectors. Of course, it has always had its admirers, as evidenced, among
other things, by the fact that it is still actively cited, and perhaps more
tellingly by the fact
that over the years its contents (illustrative quotations as well as
definitions) have been quietly pillaged by other dictionaries.
From time to time, various people have floated
schemes to revive
The Century Dictionary, either by keying or scanning
and OCR, but for one reason or another these came to naught. (I suspect that
the
Ur-Idee for
The Century Dictionary Online came from one
of my colleagues at Bellcore, Robert A. Amsler, who long ago dreamed of
scanning the text and somehow creating an Internet dictionary.)
Until quite recently, the use of digital page images in a project of any
size was impractical, and so the only possibility was to extract ASCII text
of some sort. The OCR engines of even a few years ago were notoriously
unreliable with pages like those in a dictionary, that is, pages with arcane
and complex typesetting arrangements. The other alternative, keyboarding as
was done with the
OED, was prohibitively expensive for anyone but a
large publishing company with lots of money to lavish on its flagship
product. Furthermore, extracting ASCII text by whatever means was only the
beginning of a long and expensive editing process before such text could be
considered fit for public consumption. In any event, an ASCII text would not
have done justice to
The Century Dictionary. The elegant
typeface and lavish illustrations of
The Century Dictionary
are among its chief glories. Sacrificing these would mean
sacrificing a good part of the work's essential character.
DjVu offered us the perfect solution. It is
arguably the best page image format currently available. Unlike PDF, which
it resembles in certain superficial ways, it is not simply a "wrapper" for
bulky page images, but a complex image compression format that uses
shape clustering to create small, efficient images from large,
high-resolution scans.
Compression rates vary depending on a number of
factors, but typically we were able to achieve compressed sizes less than
30% of the sizes of the original black and white TIFFs of
The Century
Dictionary pages without perceptible loss of quality. DjVu pages have
the added advantage that they can be downloaded separately and display
"progressively," so users do not experience the kind of long download waits
they must endure when viewing similar sorts of files in PDF and other
formats. DjVu images can also contain an OCR text chunk, which allows them
to be searched with the plug-in's "Find" function or with an outside index.
Thus DjVu documents can be said to combine the best features of page images and
ASCII: the exact "look and feel" of the original pages and the search
capabilities normally possible only with ASCII texts.
Capturing the exact look and feel of the original
pages of
The Century Dictionary is actually more important than that
phrase may make it sound. Quite apart from the somewhat "aesthetic" desire
to have the illustrations and typeface mentioned above, there is also the
issue of the many special characters that, far from being mere decorations,
are of critical importance in accurately characterizing the lexicon. These
include phonetic representations, obsolete letters or letters from different
alphabets, special symbols and diacritical marks, and, of course, the
illustrations that in many instances may be worth the proverbial thousand words.
Most online dictionaries either eschew these special characters, thus throwing
away valuable information, or resort to clumsy workarounds, such as tiny GIF
images, to respresent them. DjVu's solution of this problem is both elegant
and efficient.
With advice from the DjVu gurus at AT&T Labs,
we began somewhat timidly in the late Fall of 2000 by making a few test scans
of the page with the entry for "buffalo" to determine, first of all, the kind
of image to use: color, grayscale, or black and white. DjVu can do wonderful
compressions of color images of old pages with lovely, sepia tones. Indeed,
my own "boutique" samples of Dr. Johnson and Webster for DjVu Zone had used
color quite effectively, and it was my natural instinct to attempt color as
well for
The Century Dictionary. The test results, however, were
surprisingly disappointing. A color scan at 300 DPI (dots per inch), which
is the standard resolution for files to be converted to DjVu, was satisfactory
for the printed text but far too "lossy" for the illustrations. The tightly
drawn lines suggesting the manes of the two breeds of buffalo, for instance, bled
together into an unpleasant-looking, muddy tangle. Even at 300 DPI, the color
TIFFs were already huge files requiring considerable memory to process, and it
was not really feasible with our machines to increase the resolution sufficiently
to comb clean, as it were, the buffalo's hair in color. (I might point out
parenthetically that the better-looking color
image used as a background for the Global Language Resources website is actually
a "colorized" version of a black and white page.) The grayscale test
produced somewhat smaller TIFF files, but did not really improve the
images and left the pages looking simply dingy. The black and white tests
proved in the end to be the most satisactory in all respects. At 400 DPI, they
still produced the smallest TIFF files and yet sharp enough images to capture
the detailed lines of the illustrations. Even the printed text came out sharper
in black and white and yielded better optical character recognition. We tested
black and white images at 600 DPI as well, but the images were larger in size
and the gain in quality turned out to be minimal.
So we decided to go with black and white images at 400
DPI. It may seem a trifle stark to some, but it is readable and it is, after
all, the way new dictionaries look in print on fresh pages. Having made this
decision, we proceeded to design the website with the black and white
color-scheme in mind. We quickly found that the dictionary itself was full of
artwork that could be used to give the website an appropriately special flavor.
The initial prototype consisted of the HTML interface and the front matter
and first sixteen pages of Volume I, which we scanned on a small "home"
scanner hooked up to an iMac.
Halting first steps and an angel |
---|
The work, which commenced in mid January 2001,
proved to be unacceptably slow. Even having cut the pages of the sample,
without an automatic feeder we had to scan every other page upside down and
rotate it 180 degrees. However carefully we seemed to hold the pages, these
rotated images still came out rather "skewed," and in spite of our best
efforts at routinizing the process, we estimated that we could only manage
perhaps fifteen scans an hour. At this rate, assuming that we could work
flat-out three hours a night, the best we might hope for was a single
volume per month.
At this point the Century Dictionary Project
found what can best be described as its "angel." In the image
processing group at AT&T Labs, the most trusted and respected specialist in
high-volume scanning is Tom Johnson of Root Technologies in Princeton,
New Jersey. Root Technologies had been an early DjVu partner, and was responsible for
scanning such major DjVu projects as the beautiful
Fishes of Wisconsin,
The Journal of the Acoustical Society of America, and the collected
proceedings of the Neural Information Processing Systems (NIPS) conferences.
At the suggestion of Yann Le Cun of AT&T Labs, I called up Tom Johnson one
afternoon in late January, showed him my own "amateur" prototype, and made an
arrangement to have Root Technologies scan the volumes of
The Century
Dictionary. That weekend my wife carefully cut the bindings of the entire
first volume, whose pages we wrapped in cellophane and paper and packed in an
old Chianti box filled with "packing peanuts" which we mailed off to Princeton.
A week later we received the first CD-ROM with "deskewed" TIFFs of every page
of Volume I. In less than two months we were able to finish all eight volumes
of the first edition. The results speak for themselves.
The Century
Dictionary rested securely on the expertise of the De Vinne Press; in a
similar way,
The Century Dictionary Online rests securely on the
brilliant work of Root Technologies.
Of course, there was still plenty of work to keep
us busy. After each volume had been cut up and before it had been shipped out,
my wife or daughter would go through it page by page checking for damaged
pages that needed to be taped and creating the headword index that we use for
the "Find Entry" function. Among other things, this allowed us to program this
function and begin testing it before we had the actual page images. When the
TIFFs for each new volume arrived (those were very exciting days), we subjected
them to a multi-phase conversion process that lasted about twenty-four hours
for each volume.
First the files were copied to the hard drive of a trusty,
powerful, and very hard-working laptop computer running Linux. The next step
was the actual compression and conversion to DjVu files. Even with the processes
running in a loop, this would take several hours. The third and in some ways
the most vital step was the OCR process, again run in a UNIX shell loop, which
read each DjVu image, created an ASCII text version of that page, and inserted
it as a compressed text chunk in the DjVu file itself so that the file could be
searched using the plug-in. The OCR phase for each volume typically ran from
early evening until the next morning. When the OCR was finished, the files were
ready to upload to the website using ftp (file transfer protocol). As each
volume represented about 80 MBs even compressed as DjVu, the transfer "upstream"
through a DSL line took several hours. Once in place on the website, a DjVu
multipage index was created for each volume and the Find Entry program adjusted
to recognize the new pages. The process was not yet done, however. Back on the
laptop, we now used the "djvutxt" utility to extract the ASCII text from each
page into a metafile for that page, which was then indexed with all the others
for the Full Text Search function. The new index was then uploaded and "staged"
on the website.
Testing the work in progress |
---|
One interesting side-effect of this process was
that from the time we had the first volume ready we were able to run a beta test
of the "work in progress." The small group of invited testers included some
people who had been among my testers in the first official "Performance Trial"
of the
OED Online back in 1996, and who had chosen to stay with me later
on. They knew about online dictionaries and what to expect from them. Most of
them had not used DjVu before, but they quickly became familiar with the
plug-in's features and began to make useful suggestions about the interface, the
functionality, and the help pages. They also began contributing any flaws as
they found them to a "revision file" which we will use to make corrections after
the first phase of the Century Dictionary Project is complete.
So what have we done then? Why
The Century
Dictionary Online? It needs to be emphasized, though the marketers and
publicists referred to above are loathe to do so, that
no dictionary
can satisfy everyone's requirements completely. This is one reason that there
are so many specialized dictionaries of various kinds: learner's dictionaries,
technical dictionaries, dictionaries of slang, dictionaries of regional or
historical English, bi-lingual dictionaries, dictionaries of new words. It is
also why there are general dictionaries of different sizes: school dictionaries,
pocket dictionaries, concise dictionaries, collegiate dictionaries, and, of
course, the so-called unabridged dictionaries, though these expensive, flagship
products seem to have become a rare, perhaps even a dying breed. With paper
books, the decision about which kind of dictionary you actually needed was
relatively easy to make. If you needed quick information about relatively
common words, you needed a book you could hold in one hand and perhaps carry
around with you, a pocket dictionary or a collegiate, but not a six-inch thick
three-column affair or one in multiple volumes. If you were a scholar, on the
other hand, and had to find out about rare or even obsolete words, a collegiate,
in spite of the appellation, simply didn't have enough words (collegiates
typically define fewer than 200,000 terms).
The advent of online dictionaries has changed
this "rule of thumb" in interesting ways. It is hard to tell from an online
user interface and even from the look of some entries just what kind of
dictionary you are dealing with, and thus whether it is really what you need.
Another rule of thumb, unfortunately true for the most part with online
dictionaries as with those sold as books, is that the cheaper it is, the less
you have. To get access to the unabridged wordlist of the
OED, even if
you have no special taste for its Anglicized spellings, pronunciations, and
general approach, you have to pay out a considerable amount of money. You can get
the little
Webster's and the little
American Heritage free, on
the other hand, but a little dictionary is all you are getting. There is a
free version of the 1909 edition of
Webster's New International
Dictionary which has a few more, if older, terms than a modern collegiate,
but it is a fairly sloppy affair and considerably smaller than the revised
version of the same edition which appeared only a few years later.
Unfortunately, its initial keyboaders, in addition to
introducing errors, neglected to preserve such important information as the
pronunciations and Greek words in the etymologies. This is a noble effort to
provide a full-sized, free dictionary to the Internet community, but its
flaws of design and execution make it unlikely to trouble any dictionary
publisher's sleep. What might be called the Internet dictionary gap remains.
Why The Century Dictionary Online |
---|
We expect
The Century Dictionary Online to
fill this gap. As I mentioned above, no dictionary is everything to everybody,
and
The Century Dictionary Online is no exception. You will not find
terms like "bad hair day," "ribbit," "Rogernomics," or even "Reaganomics" here.
After all, the last edition of
The Century Dictionary came out when
Reagan was a toddler. More seriously, you will not find terms like "AIDS,"
"motherboard," "television," "World Wide Web," and even some common words I
have used here. Dictionary publicists always like to emphasize their new words,
but we cannot hide the fact that
The Century Dictionary Online, advanced
as it was in its day, has no newer words yet. However, this should not obscure
the more important fact that the
vast majority of English words, and virtually
all the words of the core vocabulary of English, were already
known more than a century ago and are in fact beautifully, deeply defined in
The Century Dictionary. The online interface, which allows the
instantaneous traversal of thousands of pages in many volumes, may disguise
for some users just how
large this dictionary really is, how very much is
here. It is quite simply the largest, most comprehensive dictionary freely
available on the World Wide Web. Furthermore, its American orientation
(excellent
American pronunciations, preference given to
American
spelling forms, attention to words of
American origin) gives it a
special relevance in our time. When it was first published, it was hailed as a
glory of American scholarship, yet at that time it was not at all
obvious that American English would emerge, as it has in the past century, as
by far the most important variety of English.
To return at last to the premise with which this
little preface began (in our beginnings, after all, we may find our ends), we
created
The Century Dictionary Online because it is
free, it is
big, and it is
beautiful. I should add finally, a couple of
more reasons: married to DjVu technology, it is
innovative, and perhaps
most important, it is still an
American treasure.
JEFFERY A. TRIGGS.
MADISON, N.J., March 20th, 2001.