XML CISWAB: Unterschied zwischen den Versionen

Aus WASTwiki
Zur Navigation springen Zur Suche springen
Zeile 140: Zeile 140:
 
The Faksimile is corresponding to the actual page: (this is our “et” resolution)
 
The Faksimile is corresponding to the actual page: (this is our “et” resolution)
 
  See: <pb n="Ts-213,7r"/>
 
  See: <pb n="Ts-213,7r"/>
 +
 +
== Information outside sentences ==
 +
 +
Information outside sentences &lt;s … > should be removed. An <ab> consists only out of Sentences, Linebreaks or Pagebreaks.
 +
This is very important.
 +
 +
 +
<ab>
 +
  <s ………. > | <pb> | <lb>
 +
</ab>
 +
 +
Actual: <pb facs="Ts-213_i-r"
 +
                rend="recto"
 +
                n="pagename_Ts-213,i-r pageref_Ts-213,1"/>Ts-213#c1Ts-213#c1&lt;s n="Ts-213,i-r[1]_1"
 +
                ana="facs:Ts-213,i-r  abnr:1 satznr:1">Verstehen.&lt;/s>
 +
            <lb/>
 +
        </ab>

Version vom 19. November 2013, 09:21 Uhr

This WIKI Page describes the STEPs to transfrom the XML-WAB Format into a restricted CISWAB format of Wittgensteins Nachlass pages.

our new ideas : Okt 2013

  1. Our Input file for our Search-Machine must be XML-TEI conformant.
  2. You change the TAGs of our CISWAB DTD into TEI-Conformant XML-Tags, so it is TEI-conformant.
  3. You transform the very complex XML-File from Alois (see TS213_100-103.xml) directly to our CISWAB-XML File.
  4. We don't need the .html File from before and can throw away our PERL Programs.
  5. We write a xslt-Stylesheet to output the HITS (it is now XML-Text) of our search machine into a HTML output, which is displayed in the browser.


All Mathematical Notations

are signed specially with:  <notation>, you call it:  <seg type="notation">

Is <seg> exclusively for Notations, otherwise it needs an attribute?

I found a notation: <seg type="notation">~p</seg>,

ana Attribute

This is a very good idea from you:

ana="f:Ts-213,100r abnr:621 satznr:1545">

SPACE Problems

Spaces are a big problem: They should be present in the XML-File, but usually overridden from the Parser. So we decided former to replace every Space with the XML-Tag. <sp\>: For example: Ich<sp/><sp/>gehe.

I have no idea, if it is allowed in TEI, or they have better ideas. Maybe you find a better solution!

Transform the SPACES in the text into a <sp\> Tag or an XML-Entity like:

There is a <space> element in TEI. We can replace all spaces with this.

Choices

„denn es <choice> <seg>ist</seg><seg>klingt</seg></choice>“

We call it and can rename it! <alternative> <alt>ist</alt><alt>klingt</alt></alternative>

TEI compliant

I first changed the CISWAB output file to validate it with the “tei_all” dtd to see what needed to be changed. Which I then used to see what needed to be mapped by the XSL-stylesheet. The wab2cis has produced the TS-213-max.xml from Ts-213, and is compliant. I have tried to stick with the structure that CIS used, and hopefully this should be OK as a first draft. I’m not entirely sure what we should do with all the whitespace.

Do you want to make a template for a TEI-header for the CIS item? I.E metadata about the file being a stripped version of File_xx?

Teiheader

The teiheader has a minimum set of required components:

 <teiHeader>
    <fileDesc>
       <titleStmt>
          <title></title>
       </titleStmt>
       <publicationStmt></publicationStmt>
       <sourceDesc></sourceDesc>
    </fileDesc>
</teiHeader>

With some additional required elements for publicationStmt and sourceDesc.

Logic for Stylesheets

The logic for the stylesheet can be described as follows: For the transformation I mostly use a version of the xslt copy pattern. For CIS this means the generic hit of a element implies applying its children (without copying the element name). Then these rules overwrite the generic ones to fit wab-files to the cis-model.

  1. Ignore <facsimilie> –elements and their children.
  2. Ignore <fw> –element (Pagenumber, often outside of the structure. Do you want to keep this?).
  3. Copy- <body> and < text> element as is, apply templates to child elements. (Old CISWAB only has text, but it is my understanding that the Body is required for TEI.)
  4. <ab> element is copied with the @xml:id copied to @n, adds an @ana with the abnr: {count of preceding <ab>s and self<ab>} value. Apply templates to children.
  5. <s> element is copied. The ids {f:Ts-213,230r abnr:1178 satznr:3684 } is written to the @ana. (Some of these could probably find better homes in other attributes, but I don’t have the detail knowledge of TEI attributes). Child elements are applied.
  6. <lb> and <pb> elements are copied with the @facs copied as well. (Do you want to keep the @facs?)
  7. <choice> element is copied. Child elements are applied.
  8. <choice> element within a<choice> are copied as is. Child elements are applied.
  9. <*> all child elements of <choice> (except <choice>) are changed into a <seg>, to keep the logic similar to CIS (<alternative><alt>). these <choice>s have the @type value = ‘stripped’ to imply that old the old element name was stripped away. Child elements are applied.
  10. When a <seg> with @type=’notation’ is met, it is copied as is, and it’s child elements are fired.

In the DTD you specify XML Attribues id

  1. I have been thinking about using xml:id since it is used in Alois files, and it is also one of the attributes defined that’s allowed on all elements.
  2. See http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-s.html we could probably throw around the values a bit, for instance:

<s n="Ts-213,i-r[10]_1" ana="f:Ts-213,i-r abnr:9 satznr:23">(S.30) instance we can use <s xml:id="Ts-213,i-r[10]_1" n=”23” facs=” Ts-213,i-r” ana="9" >

The n attribute is described as a numeration or label that of the element and a counter should be good here, so I think the enumeration of it is a good fit. The facs attribute is the same used on the pb elements, and is correct. The only thing not very self-describing will beusing the ana attribute to tell the position of its containing ab element. This practice should be described. If possible I would suggest not using the ana here, but just pointing to the content of its parent ab/@n. But, putting the value into ana is allowed (Described as an one or more analytical units separated by space.) so for ease of use it could be described as this.

= DECITION

No XML:id, We take this  (with facs ) 
<s n="Ts-213,i-r[10]_1" ana="facs:Ts-213,i-r abnr:9 satznr:23">(

choice can enclose a choice

@Max: Is this necessary, that a choice can enclose a choice? <!ELEMENT choice (choice|seg)*>

@Öyvind: Yes! There are 27 occurrences of choices within a choice In all of Alois xml. This was also the dtd described by CIS originally with alternative | alt |alternative.

@Öyvind: Do you mean we keep any existing @type attributes on <choice>?

<choice type="em">
<orig type="em1">
<seg type="notation" subtype="p" rend="literal">Zei<lb rend="shyphen"/>chen
<del type="d">erkl¨rung</del>verbindung</seg></orig>

I Solved this by adding a rule that stops orig elements if there exists another orig element with type=”alt2”. There will probably be more exceptions for choosing a dipl/normal version. Maybe a better version would be looking at what switches Vemund has used for choosing versions?

seg TAGS

Seg should have detailed attributes: <!ATTLIST seg

           type CDATA #IMPLIED>

should be clearly specified!

<!ATTLIST seg

  type (stripped|notation) ‘stripped’>

It could be good, to have in <choice> the Type of choice specified.

linebreaks

Here is something strange: Alois always gives us an Linebreak, which identifies, if it is an Hyphenation, or not an Hyphenation-Linebreak. This Information is not in your file!

ozusagen -- einen Ein<lb/>flussß

should be: in<lb rend="hyphen"/>fluß


Strange Characters

Here another thing: What is this: Dassß diese Erfahrung aber‘

See around:

          <s n="Ts-213,7r[5]_2" ana="f:Ts-213,7r abnr:197 satznr:486">Dassß diese Erfahrung aber <choice>
                 <seg type="stripped">das Verstehen

pagebreak tags

Our pagebreaks specify the Faksimilie The Faksimile is corresponding to the actual page: (this is our “et” resolution)

See: <pb n="Ts-213,7r"/>

Information outside sentences

Information outside sentences <s … > should be removed. An <ab> consists only out of Sentences, Linebreaks or Pagebreaks. This is very important.


<ab>

  | <pb> | <lb>

</ab>

Actual: <pb facs="Ts-213_i-r"
               rend="recto"
               n="pagename_Ts-213,i-r pageref_Ts-213,1"/>Ts-213#c1Ts-213#c1<s n="Ts-213,i-r[1]_1" 
               ana="facs:Ts-213,i-r  abnr:1 satznr:1">Verstehen.</s>
           <lb/>
        </ab>