XML CISWAB: Unterschied zwischen den Versionen

Aus WASTwiki
Zur Navigation springen Zur Suche springen
Zeile 54: Zeile 54:
 
                   <seg type="stripped">
 
                   <seg type="stripped">
 
                     <choice type="dsl">
 
                     <choice type="dsl">
                        <seg type="stripped">Zeichenerklärung</seg>
+
                    <seg n="dsl_alt1">Zeichenerklärung</seg>
                        <seg type="stripped"> Zeichenverbindung</seg>
+
                    <seg n="dsl_alt2"> Zeichenverbindung</seg>
                    </choice>
+
                  </choice>
 
                   </seg>
 
                   </seg>
 
               </choice> von mehreren möglichen und im Gegensatz zu den<lb/> andern möglichen.</s>
 
               </choice> von mehreren möglichen und im Gegensatz zu den<lb/> andern möglichen.</s>
Zeile 68: Zeile 68:
 
In our Searchmachine we have amost only text, as you see in the previous example.
 
In our Searchmachine we have amost only text, as you see in the previous example.
  
In our version up to now we have only:
+
 
Man <choice type="dsl"><orig type="alt2"> möchte </orig></choice> davon redden
+
'''In our version up to now we have only'''
 +
dsl_alt2 and no alternative
 +
 
 +
Er ist eine Zeichenverbindung von mehreren möglichen un
  
 
== TEI compliant ==
 
== TEI compliant ==

Version vom 13. Dezember 2013, 09:17 Uhr

This WIKI Page describes the STEPs to transfrom the XML-WAB Format into a restricted CISWAB format of Wittgensteins Nachlass pages.

our new ideas : Okt 2013

  1. Our Input file for our Search-Machine must be XML-TEI conformant.
  2. You change the TAGs of our CISWAB DTD into TEI-Conformant XML-Tags, so it is TEI-conformant.
  3. You transform the very complex XML-File from Alois (see TS213_100-103.xml) directly to our CISWAB-XML File.
  4. We don't need the .html File from before and can throw away our PERL Programs.
  5. We write a xslt-Stylesheet to output the HITS (it is now XML-Text) of our search machine into a HTML output, which is displayed in the browser.


All Mathematical Notations

are signed specially with:  <notation>, you call it:  <seg type="notation">

Is <seg> exclusively for Notations, otherwise it needs an attribute?

I found a notation: <seg type="notation">~p</seg>,

ana Attribute

This is a very good idea from you:

ana="f:Ts-213,100r abnr:621 satznr:1545">

SPACE Problems

Spaces are a big problem: They should be present in the XML-File, but usually overridden from the Parser. So we decided former to replace every Space with the XML-Tag. <sp\>: For example: Ich<space/><space/>gehe.

Transform the SPACES in the text into a <sp\> Tag or an XML-Entity like:

There is a <space> element in TEI. We can replace all spaces with this.

multiple Spaces should be combined

Choices

„denn es <choice> <seg>ist</seg><seg>klingt</seg></choice>“

In the old CISWAB, we call it <alternative> <alt>

<alternative> <alt>ist</alt><alt>klingt</alt></alternative>


See: We have:

<satz n="Ts-213,ii-r[3]_2" f="Ts-213,ii-r" abnr="28" satznr="63">Er ist eine <choice type="dsl"><orig type="alt2">
Zeichenverbindung</orig></choice> von mehreren möglichen und im Gegensatz zu den andern möglichen.  </satz>

You have:

<s n="Ts-213,ii-r[3]_2" ana="facs:Ts-213,ii-r abnr:28 satznr:63">Er ist eine <choice type="em">
   <seg type="stripped">
              <seg type="notation">Zei<lb rend="shyphen"/>chenerklärungverbindung</seg>
                 </seg>
                 <seg type="stripped">
                    <choice type="dsl">
                    <seg n="dsl_alt1">Zeichenerklärung</seg>
                    <seg n="dsl_alt2"> Zeichenverbindung</seg>
                 </choice>
                 </seg>
              </choice> von mehreren möglichen und im Gegensatz zu den<lb/> andern möglichen.


Idea number ONE: Idea 1

We produce two files: Diplo/Norm Version

  • One file like the Normalized Version (ask Vemund for the xslt commands to get the right choices
  • The second file is almost that file, which you sent me now: All the choices are there! My student Pattrick has developed a program, which can produce all possibile readings of a sentence, be resolving all choices.

In our Searchmachine we have amost only text, as you see in the previous example.


In our version up to now we have only dsl_alt2 and no alternative

Er ist eine Zeichenverbindung von mehreren möglichen un

TEI compliant

I first changed the CISWAB output file to validate it with the “tei_all” dtd to see what needed to be changed. Which I then used to see what needed to be mapped by the XSL-stylesheet. The wab2cis has produced the TS-213-max.xml from Ts-213, and is compliant. I have tried to stick with the structure that CIS used, and hopefully this should be OK as a first draft. I’m not entirely sure what we should do with all the whitespace.

Do you want to make a template for a TEI-header for the CIS item? I.E metadata about the file being a stripped version of File_xx?

Teiheader

The teiheader has a minimum set of required components:

 <teiHeader>
    <fileDesc>
       <titleStmt>
          <title></title>
       </titleStmt>
       <publicationStmt></publicationStmt>
       <sourceDesc></sourceDesc>
    </fileDesc>
</teiHeader>

With some additional required elements for publicationStmt and sourceDesc.

Logic for Stylesheets

The logic for the stylesheet can be described as follows: For the transformation I mostly use a version of the xslt copy pattern. For CIS this means the generic hit of a element implies applying its children (without copying the element name). Then these rules overwrite the generic ones to fit wab-files to the cis-model.

  1. Ignore <facsimilie> –elements and their children.
  2. Ignore <fw> –element (Pagenumber, often outside of the structure. Do you want to keep this?).
  3. Copy- <body> and < text> element as is, apply templates to child elements. (Old CISWAB only has text, but it is my understanding that the Body is required for TEI.)
  4. <ab> element is copied with the @xml:id copied to @n, adds an @ana with the abnr: {count of preceding <ab>s and self<ab>} value. Apply templates to children.
  5. <s> element is copied. The ids {f:Ts-213,230r abnr:1178 satznr:3684 } is written to the @ana. (Some of these could probably find better homes in other attributes, but I don’t have the detail knowledge of TEI attributes). Child elements are applied.
  6. <lb> and <pb> elements are copied with the @facs copied as well. (Do you want to keep the @facs?)
  7. <choice> element is copied. Child elements are applied.
  8. <choice> element within a<choice> are copied as is. Child elements are applied.
  9. <*> all child elements of <choice> (except <choice>) are changed into a <seg>, to keep the logic similar to CIS (<alternative><alt>). these <choice>s have the @type value = ‘stripped’ to imply that old the old element name was stripped away. Child elements are applied.
  10. When a <seg> with @type=’notation’ is met, it is copied as is, and it’s child elements are fired.

In the DTD you specify XML Attribues id

  1. I have been thinking about using xml:id since it is used in Alois files, and it is also one of the attributes defined that’s allowed on all elements.
  2. See http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-s.html we could probably throw around the values a bit, for instance:

<s n="Ts-213,i-r[10]_1" ana="f:Ts-213,i-r abnr:9 satznr:23">(S.30) instance we can use <s xml:id="Ts-213,i-r[10]_1" n=”23” facs=” Ts-213,i-r” ana="9" >

The n attribute is described as a numeration or label that of the element and a counter should be good here, so I think the enumeration of it is a good fit. The facs attribute is the same used on the pb elements, and is correct. The only thing not very self-describing will beusing the ana attribute to tell the position of its containing ab element. This practice should be described. If possible I would suggest not using the ana here, but just pointing to the content of its parent ab/@n. But, putting the value into ana is allowed (Described as an one or more analytical units separated by space.) so for ease of use it could be described as this.

DESICION

No XML:id, We take this  (with facs ) 
<s n="Ts-213,i-r[10]_1" ana="facs:Ts-213,i-r abnr:9 satznr:23">(

choice can enclose a choice

@Max: Is this necessary, that a choice can enclose a choice? <!ELEMENT choice (choice|seg)*>

@Öyvind: Yes! There are 27 occurrences of choices within a choice In all of Alois xml. This was also the dtd described by CIS originally with alternative | alt |alternative.

@Öyvind: Do you mean we keep any existing @type attributes on <choice>?

<choice type="em">
<orig type="em1">
<seg type="notation" subtype="p" rend="literal">Zei<lb rend="shyphen"/>chen
<del type="d">erkl¨rung</del>verbindung</seg></orig>

I Solved this by adding a rule that stops orig elements if there exists another orig element with type=”alt2”. There will probably be more exceptions for choosing a dipl/normal version. Maybe a better version would be looking at what switches Vemund has used for choosing versions?

seg TAGS

Seg should have detailed attributes: <!ATTLIST seg

           type CDATA #IMPLIED>

should be clearly specified!

<!ATTLIST seg

  type (stripped|notation) ‘stripped’>

It could be good, to have in <choice> the Type of choice specified.

linebreaks

Here is something strange: Alois always gives us an Linebreak, which identifies, if it is an Hyphenation, or not an Hyphenation-Linebreak. This Information is not in your file!

ozusagen -- einen Ein<lb/>flussß

should be: in<lb rend="hyphen"/>fluß


Strange Characters

Here another thing: What is this: Dassß diese Erfahrung aber‘

See around:

          <s n="Ts-213,7r[5]_2" ana="f:Ts-213,7r abnr:197 satznr:486">Dassß diese Erfahrung aber <choice>
                 <seg type="stripped">das Verstehen

pagebreak tags

Our pagebreaks specify the Faksimilie The Faksimile is corresponding to the actual page: (this is our “et” resolution)

See: <pb n="Ts-213,7r"/>

Information outside sentences

Information outside sentences <s … > should be removed. An <ab> consists only out of Sentences, Linebreaks or Pagebreaks. This is very important.


<ab>
 <s ………. > | <pb> | <lb>
</ab>
Actual: <pb facs="Ts-213_i-r"
               rend="recto"
               n="pagename_Ts-213,i-r pageref_Ts-213,1"/>Ts-213#c1Ts-213#c1<s n="Ts-213,i-r[1]_1" 
               ana="facs:Ts-213,i-r  abnr:1 satznr:1">Verstehen.</s>
           <lb/>
        </ab>


Notation

why is this a notation?

<s n="Ts-213,ii-r[3]_2" ana="facs:Ts-213,ii-r abnr:28 satznr:63">Er ist eine <choice type="em">
             <seg type="stripped">
             <seg type="notation">Zei<lb rend="shyphen"/>chenerklärungverbindung</seg>
             </seg>


WAB Marks

Please remove the WAB Marks. Is is for now too much:

<seg type="wabmarks-secml_h" part="N">?∕</seg>
   <seg type="wabmarks-secmr_h" part="N">√</seg>


Page numbers

Please remove the Page numbers, it is too much now:

<seg type="int-ref"
     n="Ts-213,144r_Ts-213,165r"
     corresp="Ts-213#73"
                      part="N">S. 165</seg>


edinst Attribute

Please remove edinst, it is too much for now

<seg type="edinst" part="N">
    <s n="Ts-213,145r[4]_1" ana="facs:Ts-213,145r abnr:760 satznr:2423">Zu 
 
 S. 99
           </seg>


Attribute "subhead"

The TAG should be removed: seg type="subhead" corresp

We have: (I don’t know, where the [33] comes from?

<ab n="Ts-213,175r[1]" abnr="502">
<satz n="Ts-213,175r[1]_1" f="Ts-213,175r" abnr="502" satznr="1317">[33] Wie wirkt die einmalige Erklärung der Sprache, das  
Verständnis?  </satz>
</ab>

You have:

       <ab n="Ts-213,175r[1]" ana="abnr:885">
           <seg type="subhead" corresp="Ts-213#c46" rend="41" part="N">
              <s n="Ts-213,175r[1]_1" ana="facs:Ts-213,175r abnr:885 satznr:2803">
                 <seg type="mark-ref"
                      n="Ts-213,150r_Ts-213,175r"
                      corresp="Ts-213#76"
                      part="N"/>Wie wirkt die einmalige Erklärung der Sprache, das Verständnis?
           </seg>
        </ab>


Strange Words: enenthalten

Another strange thing: enenthalten

</choice>
   <lb/> nicht enenthalten.
      <s n="Ts-213,175r[4]et174v[1]_2"
           ana="facs:Ts-213,175r abnr:888 satznr:2809">(