XML CISWAB
This WIKI Page describes the STEPs to transfrom the XML-WAB Format into a restricted CISWAB format of Wittgensteins Nachlass pages.
our new ideas : Okt 2013
- Our Input file for our Search-Machine must be XML-TEI conformant.
- You change the TAGs of our CISWAB DTD into TEI-Conformant XML-Tags, so it is TEI-conformant.
- You transform the very complex XML-File from Alois (see TS213_100-103.xml) directly to our CISWAB-XML File.
- We don't need the .html File from before and can throw away our PERL Programs.
- We write a xslt-Stylesheet to output the HITS (it is now XML-Text) of our search machine into a HTML output, which is displayed in the browser.
Inhaltsverzeichnis
All Mathematical Notations
are signed specially with: <notation>, you call it: <seg type="notation">
Is <seg> exclusively for Notations, otherwise it needs an attribute?
I found a notation: <seg type="notation">~p</seg>,
ana Attribute
This is a very good idea from you:
ana="f:Ts-213,100r abnr:621 satznr:1545">
SPACE Problems
Spaces are a big problem: They should be present in the XML-File, but usually overridden from the Parser. So we decided former to replace every Space with the XML-Tag. <sp\>: For example: Ich<sp/><sp/>gehe.
I have no idea, if it is allowed in TEI, or they have better ideas. Maybe you find a better solution!
Choices
„denn es <choice> <seg>ist</seg><seg>klingt</seg></choice>“
We call it and can rename it! <alternative> <alt>ist</alt><alt>klingt</alt></alternative>
TEI compliant
I first changed the CISWAB output file to validate it with the “tei_all” dtd to see what needed to be changed. Which I then used to see what needed to be mapped by the XSL-stylesheet. The wab2cis has produced the TS-213-max.xml from Ts-213, and is compliant. I have tried to stick with the structure that CIS used, and hopefully this should be OK as a first draft. I’m not entirely sure what we should do with all the whitespace.
Do you want to make a template for a TEI-header for the CIS item? I.E metadata about the file being a stripped version of File_xx?
Teiheader
The teiheader has a minimum set of required components:
<teiHeader> <fileDesc> <titleStmt> <title></title> </titleStmt> <publicationStmt></publicationStmt> <sourceDesc></sourceDesc> </fileDesc> </teiHeader>
With some additional required elements for publicationStmt and sourceDesc.
Logic for Stylesheets
The logic for the stylesheet can be described as follows: For the transformation I mostly use a version of the xslt copy pattern. For CIS this means the generic hit of a element implies applying its children (without copying the element name). Then these rules overwrite the generic ones to fit wab-files to the cis-model.
- Ignore <facsimilie> –elements and their children.
- Ignore <fw> –element (Pagenumber, often outside of the structure. Do you want to keep this?).
- Copy- <body> and < text> element as is, apply templates to child elements. (Old CISWAB only has text, but it is my understanding that the Body is required for TEI.)
- <ab> element is copied with the @xml:id copied to @n, adds an @ana with the abnr: {count of preceding <ab>s and self<ab>} value. Apply templates to children.
element is copied. The ids {f:Ts-213,230r abnr:1178 satznr:3684 } is written to the @ana. (Some of these could probably find better homes in other attributes, but I don’t have the detail knowledge of TEI attributes). Child elements are applied.- <lb> and <pb> elements are copied with the @facs copied as well. (Do you want to keep the @facs?)
- <choice> element is copied. Child elements are applied.
- <choice> element within a<choice> are copied as is. Child elements are applied.
- <*> all child elements of <choice> (except <choice>) are changed into a <seg>, to keep the logic similar to CIS (<alternative><alt>). these <choice>s have the @type value = ‘stripped’ to imply that old the old element name was stripped away. Child elements are applied.
- When a <seg> with @type=’notation’ is met, it is copied as is, and it’s child elements are fired.