XWF: XTF Word Forms

Steve Tinney, 9/30/04

Introduction

The XWF format is the output of a simplifying adapter that is applied to XTF files to provide a stream of words and discontinuities suitable for feeding to parsers which don't care about all the gory written details.

Data model

The data model is presently defined by cdl/tools/xwf.dtd:

The parser expects all parseable sequences to begin with a <d type="text"/> discontinuity.

Element types

stream
This wrapper element carries the namespace and text ID and contains a stream of w and d child nodes.

Attributes:

xmlns
The default namespace is set to http://emegir.info/xwf
xml:id
The xml:id of the XTF document which is the P or Q-id of the text in the CDL system.
w = Word
This is the formatted version of the word which may be represented in XTF by a sequence of graphemes, grapheme-groups and other features. The element's PCDATA is the form which should be used by the parser.

Attributes:

xml:id
The xml:id of the w element in the XTF file
xml:lang
The xml:lang of the w element in the XTF file (required).
rws
The rws attribute of the w element in the XTF file (optional).
d = Discontinuity
Several XTF features are emitted as discontinuities, including formatting layout features (e.g., line-breaks), punctuation (rare in cuneiform texts), content fields and physical damage to the manuscript. This element is empty.

Attributes:

type
The type of discontinuity which is computed based on the corresponding XTF element. Possible values: text; line; field; punct; break; blank.
xml:lang
When type="text" xml:lang contains the lang of the text as a whole; used by the parser to determine when to treat words as foreign.
form
The form of the discontinuity is empty for line breaks (intra-line line-breaks [ATF ;] are not reported). When type="field", form is set to the value of the XTF f element's type field. When type="break" or type="blank", form is the unit sign, word, or line. Breakage or anepigraphic regions represented in the XTF file as a number of columns is emitted with form="line" and with size set to a conventional 50 lines.
size
The extent of the break/anepigraphic region given as an integer to be used in connection with the form attribute.

Character Set

The output character set of the XWF stream is the same as the XTF input stream; this should normally be Unicode/UTF-8.

Adapter Script

The XTF to XWF transformation is implemented by cdl/tools/xtf2xwf.xsl:


Valid XHTML 1.0! Valid CSS! XWF questions can be directed to Steve Tinney.