Publications:
Collection contents
About the collection
|
Generalizing the Perseus XML Document Manager
Anne Mahoney, Jeffrey A. Rydberg-Cox*, David A. Smith, Clifford E. Wulfman
Perseus Project, Tufts University; *University of Missouri, Kansas
City
amahoney@perseus.tufts.edu; rydbergcoxj@umkc.edu; dasmith@perseus.tufts.edu;
cwulfman@perseus.tufts.edu
Paper presented at the workshop on
Web-Based
Language Documentation and Description
12-15 December 2000, Philadelphia, USA.
Abstract. The Perseus Digital Library includes tools for morphological
analysis; encoding and presentation of lexica; metadata and
cataloguing; abstract mapping of various SGML and XML DTDs;
and display of document sections. The system was developed for Ancient
Greek and has been extended to support Latin and Italian. We describe
how we are generalizing this document management system for other languages
and for use by other projects. The original implementation did not
clearly separate infrastructure from project-specific data and configuration.
Determining which data elements, template files, and routines are
part of the system and which are part of the various corpora has helped
us determine what is crucial to the infrastructure of a multi-lingual digital
library, which naming conventions and meta-data standards should be shared
among co-operating projects, and what features can be configurable by individual
projects. One goal of the present work is to make the Perseus DL
infrastructure available as open-source software.
1. INTRODUCTION
The Perseus Project manages a digital library containing almost 9 million
words of Greek, over 4 million words of Latin, and growing corpora in Italian
and German, as well as 55 million words of English. We have
developed tools for morphological analysis; encoding and presentation
of lexica; metadata and cataloguing; abstract mapping of various SGML
and XML DTDs; and display of document sections. We are now working on
generalizing this document management system for use by other projects.
In this paper, we will explain our strategies for packaging this
existing infrastructure, describe the problems we've run in to and how
we solved them, and describe the system itself and how other projects
expect to use it. The Perseus Digital Library is on line at
http://www.perseus.tufts.edu.
Our document management system and linguistic tools were originally designed
for Ancient Greek. We extended the system to Latin several years ago, and
more recently have added full support for Italian; this fall, we have begun
work on support for Arabic. Supporting a language in our system includes
encoding one or more lexica in SGML, adding this language's inflexion rules
to the morphological analysis database, and adding texts to the digital
library. We use the lexicon to seed the morphological database, then
make new words available as they appear in new
texts added to the collection. Section 2 describes the morphological
analysis program, and section 3 describes our methods for encoding lexica and
presenting them in the digital library.
Although we use the TEI DTD for
new texts, we have a variety of older texts that use different DTDs,
and we have also incorporated materials from other projects that use
other DTDs, or use the TEI in different ways. Our system manages
varying DTDs by mapping relevant structural features to abstract
structures. For example, "chapter 5" might be represented by the
fifth occurrence of a "<chapter>" element,
by "<div2 type=chapter n=5>",
by "<milestone unit=chapter id="Ch5" />", or in
some other way. The mapping scheme converts all of these to the
abstract "chapter 5". Thereafter, routines processing the XML files
need not be concerned with the details of their original DTDs. We describe
the abstract structure mapping further below, section 4.
One of the strengths of our document management system is its flexible,
modular display front end. We present texts in HTML over the Web, but have also
experimented with presentation in XML and in Adobe's PDF. Readers can
request display of a section of a document, using the standard citation
scheme for the text if there is one (for example, book and line for Homer's
Iliad, or book, chapter, and verse for the Bible). The document system
identifies the desired section based on structural metadata provided by
the DTD mapping engine and converts it to display format based on a template
supplied by the corpus editor or digital librarian. User preferences also
affect the display; in particular, since we cannot yet assume that all
potential readers have access to Unicode display fonts, we can convert
non-Roman characters into any of several popular fonts for display.
Additional modules in the display system manage implicit searching of feature
databases (for geography, timelines, and keyword lookups) and automatic
connection of texts to morphological analyses, glosses, and collocation data.
In generalizing this document management system for use outside the Perseus
Project, we have identified various problems and limitations. Most important,
the original implementation did not clearly separate infrastructure from
project-specific data and configuration. Determining which data elements,
template files, and routines are part of the system and which are part of the
various corpora has helped us determine what is crucial to the infrastructure
of a multi-lingual digital library, which naming conventions and meta-data
standards should be shared among co-operating projects, and what features
can be configurable by individual projects. Section 5 discusses what we have
learned from the exercise of generalizing the system.
2. RULE-BASED MORPHOLOGICAL ANALYSIS
Morpheus, the rule-based morphological system, is the foundation
of linguistic analysis in the Perseus Digital Library. It was first
developed for Greek by Gregory Crane in 1985, extended to Latin in 1996, and
extended to Italian in 1999. Morpheus maintains separate databases
for morphological information ("what are the endings of the present tense,
active voice?") and lexical information ("is volo a regular verb?"
"what is the stem of femina?"). This allows new forms to be added
easily, usually automatically: if the stem ama- (Latin, = "love") is
known to belong to a first-conjugation verb,
forms like amavisti, amabantur, ames will be recognized wherever they
appear in the texts.
The original implementation, Greek Morpheus, can handle regular verbs
and nouns, irregular verbs (in Greek, mostly suppletive) and nouns, verb prefixes
(a very common kind of derivation), and the various dialects of Greek in common
use in the archaic and classical periods. Virtually all inflections in Greek
are endings, though many past-tense verb forms take a prefix (the "temporal augment")
and some stems are formed by reduplication of the first consonant. Morpheus
therefore assumes that inflected words can be divided into stems and endings. The
stems are related to lexical headwords (e.g. the stems pemp- and pepomph-
belong to the verb pempô, "send") so that tools using Morpheus
can offer definitions as well as morphological analyses. For each stem, moreover,
Morpheus knows the relevant grammatical category (the "conjugation" or
"declension"), which determines the possible endings. It can then recognize that
pempoimi is a valid form, but pempeiên is not: both use endings
for the first person singular, present optative active, but only the first of these
endings is appropriate for the verb pempô.
When the Perseus Project received a grant from the National Endowment for the
Humanities to add coverage of Roman art, Roman history, and Latin literature, one
of the first necessary steps was to generalize Morpheus to handle Latin. Because
Latin morphology, like Greek, uses endings, this was straightforward. The only
processes in Latin that were not already accounted for in Greek were assimilation
of prefixes and syncopation of certain endings. In Greek, most verb prefixes end
in vowels (epi-, kata-, apo-, and so on), while Latin has many prefixes that
end in consonants (ad-, in-, sub-). These prefixes may be left as they are
(adfero, "carry to") or may be assimilated to a following consonant (affero);
current printed Latin editions may do either. Morpheus had to recognize
that affero = ad- + fero.
Latin also has a class of perfect endings that can be syncopated: amavisti,
the full form (= "thou hast loved"), frequently appears as amasti. The
syncopated endings could be considered simply alternate endings, analogous to the
dialectical variations in Greek, but it is convenient to recognize the syncopation
because this is how the forms are usually presented in textbooks and student grammars.
Certain clitic particles in Latin are conventionally written as suffixes (-que, -ve, -ne),
and Morpheus has to recognize those as well, but this rarely presents problems
as there are few cases where the form is ambiguous. That is, forms like eque which
admit of two analyses (vocative of equus, "O horse," or e + -que, "and out of")
are quite rare.
Extension of Morpheus to Italian was straightforward, since Italian morphology
works on the same principles as Greek and Latin. We expect that Morpheus in its
present form will work for any Indo-European language, and for any other language whose
morphology is based primarily on endings. Other languages present more problems. Currently,
we are planning work on Arabic, related to one of our collaborators' projects in the history
of science. Morpheus is not the best tool for Arabic morphology, which is based on
vowel changes and infixation as well as affixation. Moreover, standard Arabic texts do not even contain the
vowels, which means it is necessary to parse a form in context to recognize which of
several possible forms it is. We expect, therefore, to use a different morphological
analysis engine for Arabic.
3. ENCODING AND PRESENTING LEXICA
Because we use the TEI DTD for our texts, it was a natural choice for our
lexica as well. We do not use the strict TEI dictionary tag set, however, because
older print dictionaries are not completely consistent in their structure and the
strict structures associated with the <entry> tag set do not allow for the inevitable
variation that occurs in these dictionaries. For this reason, we use the much less strict
<entryfree> syntax for most of the dictionaries in the digital library.
The actual display of dictionary entries is handled by the
document management system described below. The lexica are also integrated with
Morpheus: every Greek and Latin word anywhere in the Perseus digital
library is linked to a word study tool, based on the morphological analysis. Users
can click on a word and see an automatically generated hypertext giving the
morphological analysis and other resources based on the dictionary headword. These
resources include a short definition (automatically extracted from the lexicon),
word frequency charts, links to searching tools, links to grammar helps and, of course,
links to the full definitions of the word in the dictionary.
Tagging all of our lexica according to a consistent format such as the TEI has
allowed us to develop several scalable tools to extract and re-present the
knowledge encoded in these documents. As noted above, the Perseus morphological
analysis system maintains a separate database for lexical information. This has
allowed us to develop programs to extract lists of lexical forms from dictionary
entries, check them against the existing lexical database, and add new entries where
appropriate. Thus, when the National Endowment for the Humanities provided funding
to enter the standard unabridged Greek-English lexicon (LSJ), we were able to add
approximately 70,000 extra words to the lexical database with little extra hand
work. Similarly, one of the first steps in expanding the morphological analysis
system to Italian was entering an Italian dictionary, extracting lexical information,
and using it to create a new lexical database. Our work on Arabic will begin with
entry of Lane's Arabic-English Lexicon.
We can extract additional information from lexica. We have developed
programs that extract definitions from the lexica and generate lists of
words with similar definitions, or possible synonyms,
using vector-space document similarity models. This works not only for other
words in the same lexicon, but for words in other lexica, even in other languages.
In addition, we have written programs to scan dictionary entries and
extract short definitions that we can present to end-users as part of the word study tool,
or can include in a vocabulary list for students.
These lexica also provide important data to the Perseus citation
and cross referencing engine. This tool allows us to display links
to other texts that cite the document currently being displayed. A
simple example is a commentary, which explicitly talks about another
text. For example, when a reader views the text of Thucydides or
Homer's Iliad, we are able to show notes from several commentaries
about these texts. Much more exciting, however, is the fact that each
of these citations is also displayed as a link from the cited text back
to the commentary. For example, a reader of Herodotus 3.119 might be
interested to know that Jebb cites this passage in his commentary on
Sophocles' Antigone. Our text display system generates a link
to Jebb's commentary when a user is reading this passage in the text of Herodotus.
Lexica are rich with the sorts of citations that make this display system
truly useful. The LSJ Greek-English lexicon, for example, contains
more than 200,000 citations of texts that exist in the Perseus digital
library. Each of these citations is converted into a link allowing
users to see that the dictionary offers specific suggestions about
the way that a word is being used in a particular context. For
example, a person reading Homer's Odyssey 16.323 will see an
active link from the word phere to the section of the LSJ entry
for pherô that cites this passage.
All of the lexica that have been made publicly available in the
Perseus Digital Library are general dictionaries, designed to provide
broad coverage of a language. These sorts of dictionaries, however,
often cannot provide the level of detail that is necessary to understand
how a single author is using a word. For this reason, we are working on
several specialized lexica for classical authors such as Homer and Pindar.
These dictionaries can be integrated into the word study tool and displayed
when users are reading works by one of these authors, while they can also
provide information that can be used in all of the knowledge management tools
described above. Because these reversible citations and automatically
presented dictionary definitions seem to us an effective way to integrate
linguistic information into the presentation of a text, we are also working
on specialized lexica for English authors, notably the Shakespeare lexica of
Dyce, Onions, and Schmidt. We will therefore need to enhance our linguistic
infrastructure to offer help to readers of texts in the primary language
of the digital library.
4. OVERVIEW OF THE XML DOCUMENT MANAGER
The Perseus text processing system manages XML and SGML texts encoded
according to various different DTDs. The key to the system is the mapping
of specific SGML elements to abstract structural elements. If a user
wishes to read Our Mutual Friend, book 3, chapter 6, or if a
commentary refers to Iliad, book 22, line 361, the document
management system can identify this section of the text by its citation
scheme (by book and chapter, or book and line), no matter what DTD was
used for Dickens or for Homer.
In addition, the text processing system manages multiple versions of
the same text. Just as structural elements are mapped to abstract
structures, so texts are mapped to abstract works (called "abstract
bibliographic objects," or ABO). A user reading Homer may begin with the
Greek text, but can also move to the English text of the same section.
Similarly, a commentary written with reference to the Greek text can
be offered to readers of the English text.
Using our system, digital librarians create partial mappings
between elements in a DTD (e.g., div1, div2, and lb) and abstract
structural elements (act, scene, and line) from which the text
processing system generates lookup tables (indices) of the elements so
mapped. Thus what is encoded as <div2 type="scene"> in one
document and as <scene> in another are both indexed as an
abstract, structural "scene." This mapping hides the
use of different DTDs from the higher-level processing routines.
These abstractions facilitate the implementation of knowledge
discovery tools, including full-text searching (based on words, not
mere strings), identification of toponyms and generation of
maps, identification of dates and generation of timelines, and implicit
keyword searching.
5. GENERALIZING THE DOCUMENT MANAGER FOR OTHER PROJECTS
The Perseus Digital Library is well known among classicists. Other
scholars wishing to make electronic editions of Greek and Latin texts have
wanted to use its resources, in particular the lexica and morphological
analysis facilities. Until recently, they could only do this by making
explicit links from their HTML texts to the Perseus site.
A collaborating project, the Stoa Consortium
at the University of Kentucky, had begun to develop tools of its own, but it
quickly became clear that this would be too much work for the Stoa's limited
resources. At the same time, other projects also expressed interest in the
digital library toolset. We therefore decided to make the toolset available.
It will ultimately be generally available under an open-source license, though
it is not yet ready for general release.
The Perseus toolset was written for one project, operating one digital
library on one web server. Naturally, it was not written with portability
in mind. In the course of converting "project" code to "product" code, we
discovered that we had not clearly distinguished infrastructure, project-
specific data, and configuration data. Getting all this straight has helped
us identify the real core of each of the modules in the system.
We have worked on meta-data standards and naming rules for texts and ABOs.
While each project could determine its own naming rules, it is convenient if
projects that are to share data can ensure there are no name conflicts. In the
Perseus Digital Library, texts are stored in SGML or XML files with descriptive names
(for example, soph.oc_eng.sgml for an English translation of Oedipus at
Colonus by Sophocles). When the text is normalized, its XML version receives
an internal name like 1999.01.0190.xml, which serves to identify the text to
the rest of the system. The formal name of this text is Perseus:text:1999.01.0190,
and each derived file (normalized XML, lookup table, citation list, and so on) has
a file name incorporating the numbers 1999.01.0190. The formal naming scheme has
three parts: the naming authority ("Perseus"), the object type ("text"), and the
specific object identifier ("1999.01.0190"). The naming authority section of the
name is crucial for federation of libraries: a link to "Perseus:text:1999.01.0190" is
a link to a text in the Perseus library, not the local library. If two co-operating
digital libraries were to have copies of the same SGML source file, they could use
the same name for it.
Naming for ABOs is similar. Oedipus at Colonus, for example, is known
to the digital library as Perseus:abo:tlg,0011,007, while Hamlet is
Perseus:abo:shak,hamlet. Every text that is a version of one of these plays
(a translation, a particular edition) has a meta-data record that indicates it is a
version of this abstract bibliographic object; every commentary, similarly, has a
meta-data record declaring it a commentary on the abstract bibliographic object. Just
as for texts, the formal name falls into three parts: naming authority, object type,
and specific identifier. The intention here is that co-operating libraries will use
the same identifiers for ABOs even if they include different versions of the texts. For
example, if a hypothetical co-operating library called Livres were to include French
translations of these two plays, the French texts might be called Livres:text:2000.01.0001
and Livres:text:2000.01.0001, but they would be declared versions of
Perseus:abo:tlg,0011,007 and Perseus:abo:shak,hamlet respectively. But
if the Livres library were to produce an edition of, say, Racine's Phèdre,
it would assign its own ABO identifier, perhaps Livres:abo:r1.
Meta-data assertions about the texts come from two different places, the TEI header and
a hand-maintained database. The TEI header supplies most of the Dublin Core fields that
we use: title, creator, contributor, language, and source (from the <sourceDesc> element).
Our texts have DC type "text"; we also use type "image" for the pictures in our digital
library. From the TEI header we also determine some project-specific meta-data elements,
in particular the funder and the citation scheme (as described in section 4 above).
Co-operating projects will use these fields in the same way by virtue of using the
DTD in the same way.
The hand-maintained meta-database includes the Dublin Core relation field, which we
use to relate works to collections and other groupings of texts. We also use
the relation field to indicate the ABO that a particular SGML file is a version
of or a commentary on, if any. We use the Dublin Core date field with the "Available"
qualifier to indicate the date on which our electronic version of the text became
available; we do not currently record the
creation date of the original work, the publication date of the print edition we
worked from, or any of the other various relevant dates. We use the Dublin Core
identifier field for those few texts that are in HTML rather than SGML or XML; it
holds a URL for the file. One important project-specific field that is maintained by
hand is a publication status: public, in development, or restricted access. Co-
operating projects will need to maintain these fields as well. Currently, these
fields can be updated by a web-based application or by editing a canonical textual
version of the database.
We do not currently use the Dublin Core publisher, format, coverage, rights, or
subject fields for texts.
The basic display mechanism uses HTML templates to format pages, and
stylesheets (written in CoST) to turn XML into HTML. We allow projects, or
collections within projects, to override all or part of the default stylesheet
or template, so that their texts can have a distinctive appearance. Because
the Perseus Digital Library already uses this facility extensively, it is
easy to provide it to co-operating projects as well.
6. REFERENCES
Agosti, Maristella., Fabio Crestani, Massimo Melucci. 1998. "On the Use of Information Retrieval Techniques for the Automatic Construction of Hypertext". Information Processing and Management 32:2, 133-144.
Arms, William Y. 2000. Digital Libraries. Cambridge: MIT Press.
Birnbaum, David, and David A. Mundie. 1999. "The Problem of Anomalous Data."
Markup Languages: Theory and Practice 1.4, 1-14.
Burnard, Lou. 1995. "What is SGML, and How Does It Help?" Computers
and the Humanities 29, 41-50. http://www.uic.edu/orgs/tei/sgml/teiedw25/.
Crane, Gregory. 1991. "Generating and Parsing Classical Greek." Literary and
Linguistic Computing 6, 243-245.
Crane, Gregory. 1998. "New Technologies for Reading: The Lexicon and the Digital Library." Classical World 91, 471- 501.
Crane, Gregory. 2000. "Extending a Digital Library: Beginning a Roman Perseus."
New England Classical Journal 27, 140-160.
Lane, Edward William. 1863. An Arabic-English Lexicon. London: Williams
and Norgate.
Liddell, Henry George, Robert Scott, Sir Henry Stuart Jones, Roderick McKenzie. 1843. A
Greek-English Lexicon. Ninth edition, 1940. Oxford University Press.
Lesk, Michael. 1997. Practical Digital Libraries: Books, Bytes, and Bucks. San Francisco: Morgan Kaufmann Publishers.
Lubell, Joshua. 1999. "Structured Markup on the Web: A Tale of Two Sites."
Markup Languages: Theory and Practice 1.3, 7-22.
Rydberg-Cox, Jeffrey A. 2000. "Word Co-Occurrence and Lexical Acquisition in
Ancient Greek Texts." Literary and Linguistic Computing 15, 121-129.
Rydberg-Cox, Jeffrey A. (forthcoming) "Mining Data from the Electronic Greek
Lexicon." Classical Journal.
Rydberg-Cox, Jeffrey A., Robert F. Chavez, Anne Mahoney, David A. Smith,
Gregory R. Crane. 2000. "Knowledge Management in the Perseus Digital Library."
Ariadne 25, http://www.ariadne.ac.uk/issue25/rydberg-cox/
Smith, David A., Anne Mahoney, Jeffrey A. Rydberg-Cox. 2000. "Management of XML
Documents in an Integrated Digital Library. Proceedings of Extreme Markup Languages 2000,
219-224.
Sperberg-McQueen, C., and L. Burnard. 1994. Guidelines for Electronic Text Encoding and Interchange. Chicago: Text Encoding Initiative.
7. APPENDIX: SAMPLE CODE
7.1. Indexing and abstract DTD mapping
This routine creates the lookup table that abstracts from the specific element names
used in the DTD. It is run on each new or changed text. Ideally, the lists of
elements to be indexed, mappings from concrete tags to abstract structural elements,
and elements to be suppressed would be in an external table rather than directly in
the code. Adding support for another DTD requires modifying these lists, which
therefore are at least party project-specific configuration rather than code.
Notes about significant features appear in bold throughout the text.
#!/usr/bin/perl
use XML::Parser;
## useful tags to index
my %idxTags = map { $_, 1 } qw(div div0 div1 div2 div3 div4 div5 div6 div7
group text front body back milestone pb cb lb l
frag entry entryfree orth
poem speech sp section
head figure docauthor doctitle
pageinfo printpgno illus);
my %tagTypes = ( elements to be mapped onto abstractions
'pb' => 'page', e.g., the <pb> element denotes an abstract page
'cb' => 'column',
'lb' => 'line',
'l' => 'line',
'speech' => 'sp',
'frag' => 'fragment',
'pageinfo' => 'spage',
'printpgno' => 'page',
'illus' => 'figure',
'entryfree' => 'entry',
);
my %fakeEmpty = (
'controlpgno' => 1,
'pageinfo' => 1,
'printpgno' => 1,
);
my %suppress = (
'tei.header' => 1,
'teiheader' => 1,
'note' => 1,
'verse' => 1,
'oracle' => 1,
'quotedtext' => 1,
'quote' => 1,
'castgroup' => 1,
'list' => 1,
'table' => 1,
'rdg' => 1,
'bibl' => 1, # mostly to kill
);
my %printContent = (
head => 1,
orth => 1,
figure => 1,
illus => 1,
docauthor => 1,
doctitle => 1,
);
The Expat XML parser uses call-back routines
my $parser = new XML::Parser(Handlers => {Start => \&handle_start,
End => \&handle_end,
Char => \&handle_char,
Default => \&handle_default});
my $pstack = 0;
## We might need to defer some lines if they happen inside others.
my @defer = ();
## Keep track of element context
my @context = ();
## Keep track of active elements
my @suppress = (0);
my $chunk = 0;
## Keep track of language for heads
my(@lang) = ('');
my $pfile = shift @ARGV;
if ($pfile ne '') {
$parser->parsefile($pfile);
}
else {
$parser->parse(*STDIN);
}
foreach my $defLine (@defer) {
print $defLine, "\n";
}
sub handle_start { Called for start of a new element
my $p = shift;
my $el = shift;
my %atts = @_;
## This may be useful later...
delete $atts{'teiform'};
my $curContext = join("",@context);
push @context, make_start_tag($el,\%atts);
push @suppress, ($suppress{lc($el)} ? 1 : $suppress[$#suppress]);
my $newlang = $atts{'lang'} ? lc($atts{'lang'}) : $lang[$#lang];
push @lang, $newlang;
print make_start_tag($el,\%atts) if $pstack && !$suppress[$#suppress];
$el = lc $el;
++$pstack if $printContent{$el};
return if $pstack > 1;
## If the current tag's purpose in life is to print its content,
## but it's being suppressed, bail out.
return if $printContent{$el} && $suppress[$#suppress];
if (!$idxTags{$el} or ($pstack and !$printContent{$el})) {
if ($chunk and $atts{'id'} ne '') {
my $idLine = join("\t", $p->current_byte, $p->depth, 'id',
$atts{'id'}, 0, $curContext);
if ($pstack) {
push @defer, $idLine;
}
else {
foreach my $defLine (@defer) {
print $defLine, "\n";
}
@defer = ();
print $idLine, "\n";
}
}
return;
}
return if ($el eq 'milestone') && ($atts{'unit'} eq 'para');
## Only throw out PHI prose lines for now.
return if ($el eq 'lb') && (lc($atts{'ed'}) eq 'phi');
my $n = $atts{'n'};
my $id = $atts{'id'};
## Are we an empty tag? If not, tielut will blow away our state when we
## exit this tag. tielut loads the output of this routine into a database
my $isEmpty = 0;
$isEmpty = 1 if $fakeEmpty{$el} || ($p->recognized_string =~ /\/>$/);
my $type = $tagTypes{$el};
if ($atts{'name'} ne '') { # For old Perseus texts.
$type = $atts{'name'};
}
elsif ($type eq 'line' and lc($n) eq 'tr') {
$type = 'tr line';
$n = '-1';
}
elsif ($atts{'type'} ne '' and $el ne 'entry' and $type ne 'entry') {
$type = $atts{'type'};
}
elsif ($atts{'unit'} ne '') {
$type = $atts{'unit'};
}
if ($atts{'ed'} ne '' and lc($atts{'ed'}) ne 'p' and $type) {
$type = "$atts{'ed'} $type";
}
$type = $el if $type eq '';
## Some DIVs are troublesome.
$suppress[$#suppress] = 1 if $suppress{$type};
if ($type eq 'line' and $n eq '') {
$n = '-1';
}
## If the line is split, only count the initial one.
return if ($type eq 'line'
and (lc($atts{'part'}) eq 'm' or lc($atts{'part'}) eq 'f'));
$n = $atts{'key'} if defined($atts{'key'});
$chunk++;
foreach my $defLine (@defer) {
print $defLine, "\n";
}
@defer = ();
Output: byte position in XML file, nest depth of tags, abstract element type,
counter (how many of this element we've seen), whether this is an empty "marker"
element as opposed to a container, and the list of tags open at this point in
the XML
if ($n =~ /=/ and !defined($atts{'key'})) {
my $count = 0;
foreach my $i (split /:/, $n) {
my($curType,$curN) = split /=/, $i, 2;
print "\n" if $count++;
print join("\t", $p->current_byte, $p->depth, lc($curType), $curN,
$isEmpty, $curContext);
}
}
else {
print join("\t", $p->current_byte, $p->depth, lc($type), $n,
$isEmpty, $curContext);
}
if ($id ne '') {
my $idLine = join("\t", $p->current_byte, $p->depth, 'id', $id,
$isEmpty, $curContext);
if ($pstack) {
push @defer, $idLine;
}
else {
print "\n", $idLine;
}
}
if ($type eq 'card' and $n ne '') {
print "\n", join("\t", $p->current_byte, $p->depth, 'line', $n,
$isEmpty, $curContext);
}
## Only suppress the content of suppressed elements, not their LUT lines.
if ($pstack and !$suppress[$#suppress]) {
print "\t";
print "" if $lang[$#lang];
}
else {
print "\n";
}
1;
}
sub handle_end { Called for end of an element
my $p = shift;
my $el = shift;
pop @context;
my $oldlang = pop @lang;
--$pstack if $printContent{lc($el)};
return if pop(@suppress);
if ($pstack) {
print "$el>";
}
elsif ($printContent{lc($el)}) {
print "" if $oldlang;
print "\n";
}
1;
}
sub handle_char { Called for contents of elements
my($p,$s) = @_;
return unless $pstack;
return if $suppress[$#suppress];
$s = $p->original_string; # we want all characters escaped
$s =~ s/\n/ /gs;
print $s;
}
sub handle_default {
## Do nothing.
}
sub make_start_tag {
my $el = shift;
my $atts = shift;
my $res = "<$el";
while (my($att,$val) = each %$atts) {
$val =~ s/\&/\&/g;
$val =~ s/\</g;
$val =~ s/>/\>/g;
$val =~ s/\"/\"/g;
$val =~ s/\'/\'/g;
$res .= " $att=\"$val\"";
}
$res .= ">";
$res;
}
7.2. Lexicon entry
This is a relatively short entry from the Greek-English Lexicon, showing
the use of the <entryfree> element and the inconsistent structure of the print
entry. Greek is encoded in Beta-code, as devised by the Thesaurus Linguae Graecae project. Note the <sense> tags and their attributes: the first group, which could be labelled "I." but is not, centers on the idea of a messenger, while the only sense in the second group, labelled "II.", is a cult title for a god. The lexicon supplies citations for each sense, and we have encoded them using the ABO codes for the authors and works they refer to.
<entryFree key="a)/ggelos"><orth extent=full lang=greek>a)/ggelos</orth>,
<gen lang=greek>o(</gen>, <gen lang=greek>h(</gen>,
<tr>messenger, envoy</tr>,
<bibl n="Perseus:abo:tlg,0012,001:2:26"><author>Il.</author><biblScope>2.26</biblScope></bibl>, etc.;
<foreign lang=greek>di' a)gge/lwn o(mile/ein tini/</foreign> <bibl n="Perseus:abo:tlg,0016,001:5:92"><author>Hdt.</author><biblScope>5.92</biblScope></bibl>.<foreign lang=greek>z/</foreign>,
cf. <bibl><title>SIG</title><biblScope>229.25</biblScope></bibl> (<placeName>Erythrae</placeName>):—
prov., <foreign lang=greek>*)ara/bios a)/</foreign>., of a loquacious person, <bibl n="Perseus:abo:tlg,0541,001:32"><author>Men.</author><biblScope>32</biblScope></bibl>.
<sense n="2" level="3"> generally, <tr>one that announces</tr> or <tr>tells</tr>, e.g. of birds of augury,
<bibl n="Perseus:abo:tlg,0012,001:24:292"><author>Il.</author><biblScope>24.292</biblScope></bibl>,
<bibl n="Perseus:abo:tlg,0012,001:296"><biblScope>296</biblScope></bibl>;
<foreign lang=greek>*mousw=n a)/ggelos</foreign>, of a poet, <bibl n="Perseus:abo:tlg,0002,001:769"><author>Thgn.</author><biblScope>769</biblScope></bibl>;
<foreign lang=greek>a)/ggele e)/aros . . xelidoi=</foreign> <bibl n="Perseus:abo:tlg,0261,001:74"><author>Simon</author><biblScope>74</biblScope></bibl>;
<foreign lang=greek>a)/. a)/fqoggos</foreign>, of a beacon, <bibl n="Perseus:abo:tlg,0002,001:549"><author>Thgn.</author><biblScope>549</biblScope></bibl>;
of the nightingale, <foreign lang=greek>o)/rnis . . *dio\s a)/</foreign>. <bibl n="Perseus:abo:tlg,0011,005:149"><author>S.</author><title>El.</title><biblScope>149</biblScope></bibl>:
c. gen. rei, <foreign lang=greek>a)/. kakw=n e)mw=n</foreign> <bibl n="Perseus:abo:tlg,0011,002:277"><author>Id.</author><title>Ant.</title><biblScope>277</biblScope></bibl>;
<foreign lang=greek>a)/ggelon glw=ssan lo/gwn</foreign> <bibl n="Perseus:abo:tlg,0006,008:203"><author>E.</author><title>Supp.</title><biblScope>203</biblScope></bibl>;
<foreign lang=greek>ai)/sqhsis h(mi=n a)/.</foreign> <bibl n="Perseus:abo:tlg,2000,001:5:3:3"><author>Plot.</author><biblScope>5.3.3</biblScope></bibl>;
neut. pl., <foreign lang=greek>a)/ggela ni/khs</foreign> <bibl n="Perseus:abo:tlg,2045,001:34:226"><author>Nonn.</author><title>D.</title><biblScope>34.226</biblScope></bibl>. </sense>
<sense n="3" level="3"> <tr>angel</tr>,
<bibl n="Perseus:abo:tlg,0527,001:28:12"><author>LXX</author> <title>Ge.</title><biblScope>28.12</biblScope></bibl>,
al., <bibl n="Perseus:abo:tlg,0031,001:1:24"><title>Ev.Matt.</title><biblScope>1.24</biblScope></bibl>,
al., <bibl n="Perseus:abo:tlg,0018,001:2:604"><author>Ph.</author><biblScope>2.604</biblScope></bibl>, etc. </sense>
<sense n="4" level="3"> in later philos., <tr>semi-divine being</tr>,
<foreign lang=greek>h(liakoi\ a)/.</foreign> <bibl n="Perseus:abo:tlg,2003,001:4:141b"><author>Jul.</author><title>Or.</title><biblScope>4.141b</biblScope></bibl>,
cf. <bibl n="Perseus:abo:tlg,2023,006:2:6"><author>Iamb.</author><title>Myst.</title><biblScope>2.6</biblScope></bibl>,
<bibl><author>Procl.</author></bibl> <tr>in R.</tr><bibl><biblScope>2.243</biblScope></bibl> K.;
<foreign lang=greek>a)/. kai\ a)rxa/ggeloi</foreign> <bibl><title>Theol.Ar.</title><biblScope>43.10</biblScope></bibl>,
cf. <bibl n="Perseus:abo:tlg,4066,003:183"><author>Dam.</author><title>Pr.</title><biblScope>183</biblScope></bibl>,
al.: also in mystical and magical writings, <bibl><author>Herm.</author></bibl> ap. <bibl><author>Stob.</author><biblScope>1.49.45</biblScope></bibl>,
<bibl><title>PMag.Lond.</title><biblScope>46.121</biblScope></bibl>, etc. </sense>
<sense n="II" level="2"> title of Artemis at Syracuse, <bibl><author>Hsch.</author></bibl></sense>
</entryFree>
|