This is the home page for the corpus building specifications used by the Cuneiform Digital Library. This page is under development: additional sections are planned which will cover the generation of various kinds of scholarly metadata--signlists, concordances, dictionaries and other tools--from corpora prepared according to the corpus specifications described here.
Our basic model for corpus and tool development is that text corpora are annotated at the source level so that they can be parsed by machine and merged with lists of varying complexity to produce tools for describing and exploring the corpora in many ways. Because textual instability is a basic operating condition for cuneiform corpora we assume that the tools must always be rebuildable programmatically from scratch based on the annotated texts and the lists of data; changes in texts or data lists are then naturally reflected in the working products.
The core CDL standard for entering textual data is known as ATF, the ASCII Transliteration Format. There are several parts to the ATF specification and they are described on the ATF home page.
ATF is converted to an XML form (named XTF) by a program known informally as "the ATF processor"; this program also does extensive validation of the ATF to ensure that the input is properly formed, and also validates the results against an XML schema which is documented in the GDL and XTF2 manuals.
Lemmatization is the process of annotating instances of forms of words according to their dictionary headword. CDL uses interlinear lemmatization in the ATF transliterations to enable lemmatization data to remain synchronized with textual changes.
Lemmatization conventions and other features for linguistic annotation of morphology and syntax are documented on the linguistic annotation page.
Lines in texts in the corpora can be linked to other lines using a simple interlinear notation. You can read the documentation on linking here.
This mechanism is utilized by DCCLT to manage the relationships between composite texts and individual sources; it can also be used to indicate citations of texts such as proverbs in literary or lexical compositions or omens in letters.
Textual metadata is maintained in catalogs, the most important of which is the CDLI main catalog of cuneiform documents. This catalog is exported nightly to XML and the XML version of the FileMaker dataset is the basis for subsequent use of the catalog in the CDL system.
The CSS used to display the documentation exposes inadequacies in Internet Explorer (6.x and below). You may need to use Safari or one of the Gecko-based browsers such as Firefox to browse this documentation tree.
Questions about this document may be directed to Steve Tinney.