A short (very short) XML primer

So, XML. It’s actually very easy. It’s a lot like HTML, except you get to make up your own tags. The rules are:

  1. Any text ought to be surrounded by tags: <my_tag>text</my_tag>
  2. Do not nest tags. <my_tag><bold_text>text</my_tag></bold_text> is not allowed. Use <my_tag><bold_text>text</bold_text></my_tag> instead.
  3. There needs to be a tag that starts and ends the document. This is generally called the root node.

That’s it, mostly.* Follow those rules and you will have well-formed XML. Valid XML, however, requires a set of tag definitions to validate against. Those rules are written in a dtd file. Like so:

<!ELEMENT my_tag (tags and stuff my_tag can contain) >

A dtd needs an ELEMENT declaration for every tag in your XML file. Note that in my previous post, I showed two dtds for two XML files. A line like so:

<!ELEMENT english_term (#PCDATA)>

means that the tag <english_term>...</english_term> contains text but no other tags. A line like so:

<!ELEMENT english_entry (english_term, pronunciation, definitions+, subentry*)>

lists the tags that <english_entry>...</english_entry> contains. Note that some of these tags have characters after them: +, *, ?. Those have a special meaning:

  • + means that 1 or more tags of this type can exist within the defined tag
  • * means that zero or more tags of this type can exist within the defined tag
  • ? means that zero or 1 tag of this type can exist within the defined tag
  • no mark means that 1 tag of this type must exist within the defined tag

There. That is basic (very basic) XML. Easy!

*Then there are attributes and namespaces and other optional complications.

Converting static pages to dynamic pages using PHP and MySQL, Part 1

The challenge:
Take Adam Walker’s Carrajina dictionary (http://carrajina.conlang.org/dicthome.html) and stick it in a database so that Adam can add an entry whenever he wants without re-writing an html page and possibly messing up his html. One stricture is that Adam does not want to learn too much new stuff – no SQL, no PHP, just HTML. And if he can enter SAMPA and have it turn into IPA, that’s a plus!

Step 1: get the html source code for the current dictionary pages. Examine them carefully and figure out the structure. Outcome:

For the English to Carthaginian side:

Each entry consists of an English word, followed by a pronunciation. Then come definitions: a number if there’s more than one, a part of speech, and the definition text – a set of glosses in Carrajina sometimes with explanation. In addition, some entries have subentries, which have an English word or phrase, the definition number if there’s more than one, and the definition text.

I turned this into XML to make it easier to analyze. Turning it into a spreadsheet might also have worked. The dtd for this part of the dictionary is:

<!ELEMENT english-carthaginian (english_entry+)>
<!ELEMENT english_entry (english_term, pronunciation, definitions+, subentry*)>
<!ELEMENT english_term (#PCDATA)>
<!ELEMENT pronunciation (#PCDATA)>
<!ELEMENT definitions (definition_entry+)>
<!ELEMENT definition_entry (definition_number, part_of_speech, definition) >
<!ELEMENT definition_number (#PCDATA) >
<!ELEMENT part_of_speech (#PCDATA) >
<!ELEMENT definition (#PCDATA) >
<!ELEMENT subentry (english_term, pronunciation?, definition+)>

Creating the database model, I decided to put the definitions in their own table. Also, since the subentry is more or less identical to a head entry, I decided to put them in the same table. The diagram:

data model for english-carthaginian tables

For the Carthaginian to English side:

<!ELEMENT carthaginian-english (entry+)>
<!ELEMENT carrajina_entry (carrajina_term, pronunciation, definitions+, etymology?, note?, idioms*)>
<!ELEMENT carrajina_term (#PCDATA)>
<!ELEMENT pronunciation (#PCDATA)>
<!ELEMENT definitions (definition_entry+)>
<!ELEMENT definition_entry (definition_number, part_of_speech, definition) >
<!ELEMENT definition_number (#PCDATA) >
<!ELEMENT part_of_speech (#PCDATA) >
<!ELEMENT definition (#PCDATA) >
<!ELEMENT etymology (#PCDATA)>
<!ELEMENT note (#PCDATA) >
<!ELEMENT idioms (idiom, idiom_definition)>
<!ELEMENT idiom  (#PCDATA)>
<!ELEMENT idiom_definition (#PCDATA) >

Again, I put the definitions in their own table. Despite the fact that English and Carthaginian definitions both have the same structure, I will keep them in separate tables. Since an entry can have multiple idioms associated with it, those will go in their own table, too. Since etymologies are common, I will keep those in the same table with the terms. Notes, however, are not very common, so I will put them in their own table. This is a space- and memory-saving move. It is perfectly valid to put notes in the same table as terms as well.

data model for carthaginian-english tables

The full model:

full data model

Note: ignore the VARCHAR(45). I haven’t decided on the length of these fields yet.

Questions? Comments? Mistakes?