XML CISWAB
This WIKI Page describes the STEPs to transfrom the XML-WAB Format into a restricted CISWAB format of Wittgensteins Nachlass pages.
our new ideas : Okt 2013
- Our Input file for our Search-Machine must be XML-TEI conformant.
- You change the TAGs of our CISWAB DTD into TEI-Conformant XML-Tags, so it is TEI-conformant.
- You transform the very complex XML-File from Alois (see TS213_100-103.xml) directly to our CISWAB-XML File.
- We don't need the .html File from before and can throw away our PERL Programs.
- We write a xslt-Stylesheet to output the HITS (it is now XML-Text) of our search machine into a HTML output, which is displayed in the browser.
Inhaltsverzeichnis
All Mathematical Notations
are signed specially with: <notation>, you call it: <seg type="notation">
Is <seg> exclusively for Notations, otherwise it needs an attribute?
I found a notation: <seg type="notation">~p</seg>,
ana Attribute
This is a very good idea from you:
ana="f:Ts-213,100r abnr:621 satznr:1545">
SPACE Problems
Spaces are a big problem: They should be present in the XML-File, but usually overridden from the Parser. So we decided former to replace every Space with the XML-Tag. <sp\>: For example: Ich<sp/><sp/>gehe.
I have no idea, if it is allowed in TEI, or they have better ideas. Maybe you find a better solution!
Transform the SPACES in the text into a <sp\> Tag or an XML-Entity like:
Choices
„denn es <choice> <seg>ist</seg><seg>klingt</seg></choice>“
We call it and can rename it! <alternative> <alt>ist</alt><alt>klingt</alt></alternative>
TEI compliant
I first changed the CISWAB output file to validate it with the “tei_all” dtd to see what needed to be changed. Which I then used to see what needed to be mapped by the XSL-stylesheet. The wab2cis has produced the TS-213-max.xml from Ts-213, and is compliant. I have tried to stick with the structure that CIS used, and hopefully this should be OK as a first draft. I’m not entirely sure what we should do with all the whitespace.
Do you want to make a template for a TEI-header for the CIS item? I.E metadata about the file being a stripped version of File_xx?
Teiheader
The teiheader has a minimum set of required components:
<teiHeader> <fileDesc> <titleStmt> <title></title> </titleStmt> <publicationStmt></publicationStmt> <sourceDesc></sourceDesc> </fileDesc> </teiHeader>
With some additional required elements for publicationStmt and sourceDesc.
Logic for Stylesheets
The logic for the stylesheet can be described as follows: For the transformation I mostly use a version of the xslt copy pattern. For CIS this means the generic hit of a element implies applying its children (without copying the element name). Then these rules overwrite the generic ones to fit wab-files to the cis-model.
- Ignore <facsimilie> –elements and their children.
- Ignore <fw> –element (Pagenumber, often outside of the structure. Do you want to keep this?).
- Copy- <body> and < text> element as is, apply templates to child elements. (Old CISWAB only has text, but it is my understanding that the Body is required for TEI.)
- <ab> element is copied with the @xml:id copied to @n, adds an @ana with the abnr: {count of preceding <ab>s and self<ab>} value. Apply templates to children.
- <s> element is copied. The ids {f:Ts-213,230r abnr:1178 satznr:3684 } is written to the @ana. (Some of these could probably find better homes in other attributes, but I don’t have the detail knowledge of TEI attributes). Child elements are applied.
- <lb> and <pb> elements are copied with the @facs copied as well. (Do you want to keep the @facs?)
- <choice> element is copied. Child elements are applied.
- <choice> element within a<choice> are copied as is. Child elements are applied.
- <*> all child elements of <choice> (except <choice>) are changed into a <seg>, to keep the logic similar to CIS (<alternative><alt>). these <choice>s have the @type value = ‘stripped’ to imply that old the old element name was stripped away. Child elements are applied.
- When a <seg> with @type=’notation’ is met, it is copied as is, and it’s child elements are fired.
In the DTD you specify XML Attribues id
- I have been thinking about using xml:id since it is used in Alois files, and it is also one of the attributes defined that’s allowed on all elements.
- See http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-s.html we could probably throw around the values a bit, for instance:
<s n="Ts-213,i-r[10]_1" ana="f:Ts-213,i-r abnr:9 satznr:23">(S.30) instance we can use <s xml:id="Ts-213,i-r[10]_1" n=”23” facs=” Ts-213,i-r” ana="9" >
The n attribute is described as a numeration or label that of the element and a counter should be good here, so I think the enumeration of it is a good fit. The facs attribute is the same used on the pb elements, and is correct. The only thing not very self-describing will beusing the ana attribute to tell the position of its containing ab element. This practice should be described. If possible I would suggest not using the ana here, but just pointing to the content of its parent ab/@n. But, putting the value into ana is allowed (Described as an one or more analytical units separated by space.) so for ease of use it could be described as this.
choice can enclose a choice
Is this necessary, that a choice can enclose a choice? <!ELEMENT choice (choice|seg)*>
seg TAGS
Seg should have detailed attributes: <!ATTLIST seg
type CDATA #IMPLIED>
should be clearly specified!
<!ATTLIST seg
type (stripped|notation) ‘stripped’>
It could be good, to have in <choice> the Type of choice specified.
linebreaks
Here is something strange: Alois always gives us an Linebreak, which identifies, if it is an Hyphenation, or not an Hyphenation-Linebreak. This Information is not in your file!
ozusagen -- einen Ein<lb/>flussß
should be: in<lb rend="hyphen"/>fluß
Strange Characters
Here another thing: What is this: Dassß diese Erfahrung aber‘
See around:
Dassß diese Erfahrung aber <choice> <seg type="stripped">das Verstehen
pagebreak tags
Our pagebreaks specify the Faksimilie The Faksimile is corresponding to the actual page: (this is our “et” resolution) See: <pb n="Ts-213,7r"/>