XML CISWAB

Aus WASTwiki
Zur Navigation springen Zur Suche springen

This WIKI Page describes the STEPs to transfrom the XML-WAB Format into a restricted CISWAB format of Wittgensteins Nachlass pages.

our new ideas : Okt 2013

  1. Our Input file for our Search-Machine must be XML-TEI conformant.
  2. You change the TAGs of our CISWAB DTD into TEI-Conformant XML-Tags, so it is TEI-conformant.
  3. You transform the very complex XML-File from Alois (see TS213_100-103.xml) directly to our CISWAB-XML File.
  4. We don't need the .html File from before and can throw away our PERL Programs.
  5. We write a xslt-Stylesheet to output the HITS (it is now XML-Text) of our search machine into a HTML output, which is displayed in the browser.


All Mathematical Notations

are signed specially with:  <notation>, you call it:  <seg type="notation">

Is <seg> exclusively for Notations, otherwise it needs an attribute?

I found a notation: <seg type="notation">~p</seg>,

ana Attribute

This is a very good idea from you:

ana="f:Ts-213,100r abnr:621 satznr:1545">

SPACE Problems

Spaces are a big problem: They should be present in the XML-File, but usually overridden from the Parser. So we decided former to replace every Space with the XML-Tag. <sp\>: For example: Ich<sp/><sp/>gehe.

I have no idea, if it is allowed in TEI, or they have better ideas. Maybe you find a better solution!

Transform the SPACES in the text into a <sp\> Tag or an XML-Entity like:

There is a <space> element in TEI. We can replace all spaces with this.

Choices

„denn es <choice> <seg>ist</seg><seg>klingt</seg></choice>“

We call it and can rename it! <alternative> <alt>ist</alt><alt>klingt</alt></alternative>

TEI compliant

I first changed the CISWAB output file to validate it with the “tei_all” dtd to see what needed to be changed. Which I then used to see what needed to be mapped by the XSL-stylesheet. The wab2cis has produced the TS-213-max.xml from Ts-213, and is compliant. I have tried to stick with the structure that CIS used, and hopefully this should be OK as a first draft. I’m not entirely sure what we should do with all the whitespace.

Do you want to make a template for a TEI-header for the CIS item? I.E metadata about the file being a stripped version of File_xx?

Teiheader

The teiheader has a minimum set of required components:

 <teiHeader>
    <fileDesc>
       <titleStmt>
          <title></title>
       </titleStmt>
       <publicationStmt></publicationStmt>
       <sourceDesc></sourceDesc>
    </fileDesc>
</teiHeader>

With some additional required elements for publicationStmt and sourceDesc.

Logic for Stylesheets

The logic for the stylesheet can be described as follows: For the transformation I mostly use a version of the xslt copy pattern. For CIS this means the generic hit of a element implies applying its children (without copying the element name). Then these rules overwrite the generic ones to fit wab-files to the cis-model.

  1. Ignore <facsimilie> –elements and their children.
  2. Ignore <fw> –element (Pagenumber, often outside of the structure. Do you want to keep this?).
  3. Copy- <body> and < text> element as is, apply templates to child elements. (Old CISWAB only has text, but it is my understanding that the Body is required for TEI.)
  4. <ab> element is copied with the @xml:id copied to @n, adds an @ana with the abnr: {count of preceding <ab>s and self<ab>} value. Apply templates to children.
  5. <s> element is copied. The ids {f:Ts-213,230r abnr:1178 satznr:3684 } is written to the @ana. (Some of these could probably find better homes in other attributes, but I don’t have the detail knowledge of TEI attributes). Child elements are applied.
  6. <lb> and <pb> elements are copied with the @facs copied as well. (Do you want to keep the @facs?)
  7. <choice> element is copied. Child elements are applied.
  8. <choice> element within a<choice> are copied as is. Child elements are applied.
  9. <*> all child elements of <choice> (except <choice>) are changed into a <seg>, to keep the logic similar to CIS (<alternative><alt>). these <choice>s have the @type value = ‘stripped’ to imply that old the old element name was stripped away. Child elements are applied.
  10. When a <seg> with @type=’notation’ is met, it is copied as is, and it’s child elements are fired.

In the DTD you specify XML Attribues id

  1. I have been thinking about using xml:id since it is used in Alois files, and it is also one of the attributes defined that’s allowed on all elements.
  2. See http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-s.html we could probably throw around the values a bit, for instance:

<s n="Ts-213,i-r[10]_1" ana="f:Ts-213,i-r abnr:9 satznr:23">(S.30) instance we can use <s xml:id="Ts-213,i-r[10]_1" n=”23” facs=” Ts-213,i-r” ana="9" >

The n attribute is described as a numeration or label that of the element and a counter should be good here, so I think the enumeration of it is a good fit. The facs attribute is the same used on the pb elements, and is correct. The only thing not very self-describing will beusing the ana attribute to tell the position of its containing ab element. This practice should be described. If possible I would suggest not using the ana here, but just pointing to the content of its parent ab/@n. But, putting the value into ana is allowed (Described as an one or more analytical units separated by space.) so for ease of use it could be described as this.

DECITION

No XML:id, We take this  (with facs ) 
<s n="Ts-213,i-r[10]_1" ana="facs:Ts-213,i-r abnr:9 satznr:23">(

choice can enclose a choice

@Max: Is this necessary, that a choice can enclose a choice? <!ELEMENT choice (choice|seg)*>

@Öyvind: Yes! There are 27 occurrences of choices within a choice In all of Alois xml. This was also the dtd described by CIS originally with alternative | alt |alternative.

@Öyvind: Do you mean we keep any existing @type attributes on <choice>?

<choice type="em">
<orig type="em1">
<seg type="notation" subtype="p" rend="literal">Zei<lb rend="shyphen"/>chen
<del type="d">erkl¨rung</del>verbindung</seg></orig>

I Solved this by adding a rule that stops orig elements if there exists another orig element with type=”alt2”. There will probably be more exceptions for choosing a dipl/normal version. Maybe a better version would be looking at what switches Vemund has used for choosing versions?

seg TAGS

Seg should have detailed attributes: <!ATTLIST seg

           type CDATA #IMPLIED>

should be clearly specified!

<!ATTLIST seg

  type (stripped|notation) ‘stripped’>

It could be good, to have in <choice> the Type of choice specified.

linebreaks

Here is something strange: Alois always gives us an Linebreak, which identifies, if it is an Hyphenation, or not an Hyphenation-Linebreak. This Information is not in your file!

ozusagen -- einen Ein<lb/>flussß

should be: in<lb rend="hyphen"/>fluß


Strange Characters

Here another thing: What is this: Dassß diese Erfahrung aber‘

See around:

          <s n="Ts-213,7r[5]_2" ana="f:Ts-213,7r abnr:197 satznr:486">Dassß diese Erfahrung aber <choice>
                 <seg type="stripped">das Verstehen

pagebreak tags

Our pagebreaks specify the Faksimilie The Faksimile is corresponding to the actual page: (this is our “et” resolution)

See: <pb n="Ts-213,7r"/>

Information outside sentences

Information outside sentences <s … > should be removed. An <ab> consists only out of Sentences, Linebreaks or Pagebreaks. This is very important.


<ab>
 <s ………. > | <pb> | <lb>
</ab>
Actual: <pb facs="Ts-213_i-r"
               rend="recto"
               n="pagename_Ts-213,i-r pageref_Ts-213,1"/>Ts-213#c1Ts-213#c1<s n="Ts-213,i-r[1]_1" 
               ana="facs:Ts-213,i-r  abnr:1 satznr:1">Verstehen.</s>
           <lb/>
        </ab>


Notation

why is this a notation?

<s n="Ts-213,ii-r[3]_2" ana="facs:Ts-213,ii-r abnr:28 satznr:63">Er ist eine <choice type="em">
             <seg type="stripped">
             <seg type="notation">Zei<lb rend="shyphen"/>chenerklärungverbindung</seg>
             </seg>


WAB Marks

Please remove the WAB Marks. Is is for now too much:

<seg type="wabmarks-secml_h" part="N">?∕</seg>
   <seg type="wabmarks-secmr_h" part="N">√</seg>


Page numbers

Please remove the Page numbers, it is too much now:

<seg type="int-ref"
     n="Ts-213,144r_Ts-213,165r"
     corresp="Ts-213#73"
                      part="N">S. 165</seg>


edinst Attribute

Please remove edinst, it is too much for now

<seg type="edinst" part="N">
    <s n="Ts-213,145r[4]_1" ana="facs:Ts-213,145r abnr:760 satznr:2423">Zu 
 
 S. 99
           </seg>