Chapter 8. The Record Model

Table of Contents
Local Representation
Internal Representation
Configuring Your Data Model
Exchange Formats

The Zebra system is designed to support a wide range of data management applications. The system can be configured to handle virtually any kind of structured data. Each record in the system is associated with a record schema which lends context to the data elements of the record. Any number of record schemas can coexist in the system. Although it may be wise to use only a single schema within one database, the system poses no such restrictions.

The record model described in this chapter applies to the fundamental, structured record type grs, introduced in the Section called Record Types in Chapter 5.

Records pass through three different states during processing in the system.

When records are accessed by the system, they are represented in their local, or native format. This might be SGML or HTML files, News or Mail archives, MARC records. If the system doesn't already know how to read the type of data you need to store, you can set up an input filter by preparing conversion rules based on regular expressions and possibly augmented by a flexible scripting language (Tcl). The input filter produces as output an internal representation, a tree structure.
When records are processed by the system, they are represented in a tree-structure, constructed by tagged data elements hanging off a root node. The tagged elements may contain data or yet more tagged elements in a recursive structure. The system performs various actions on this tree structure (indexing, element selection, schema mapping, etc.),
Before transmitting records to the client, they are first converted from the internal structure to a form suitable for exchange over the network - according to the Z39.50 standard.

Local Representation

As mentioned earlier, Zebra places few restrictions on the type of data that you can index and manage. Generally, whatever the form of the data, it is parsed by an input filter specific to that format, and turned into an internal structure that Zebra knows how to handle. This process takes place whenever the record is accessed - for indexing and retrieval.

The RecordType parameter in the zebra.cfg file, or the -t option to the indexer tells Zebra how to process input records. Two basic types of processing are available - raw text and structured data. Raw text is just that, and it is selected by providing the argument text to Zebra. Structured records are all handled internally using the basic mechanisms described in the subsequent sections. Zebra can read structured records in many different formats. How this is done is governed by additional parameters after the "grs" keyword, separated by "." characters.

Four basic subtypes to the grs type are currently available:

grs.sgml: This is the canonical input format — described below. It is a simple SGML-like syntax.
grs.regx.filter: This enables a user-supplied input filter. The mechanisms of these filters are described below.
grs.tcl.filter: Similar to grs.regx but using Tcl for rules.
grs.marc.abstract syntax: This allows Zebra to read records in the ISO2709 (MARC) encoding standard. In this case, the last parameter abstract syntax names the .abs file (see below) which describes the specific MARC structure of the input record as well as the indexing rules.
: This filter reads XML records. Only one record per file is supported. The filter is only available if Zebra/YAZ is compiled with EXPAT support.

Canonical Input Format

Although input data can take any form, it is sometimes useful to describe the record processing capabilities of the system in terms of a single, canonical input format that gives access to the full spectrum of structure and flexibility in the system. In Zebra, this canonical format is an "SGML-like" syntax.

To use the canonical format specify grs.sgml as the record type.

Consider a record describing an information resource (such a record is sometimes known as a locator record). It might contain a field describing the distributor of the information resource, which might in turn be partitioned into various fields providing details about the distributor, like this:

<Distributor> <Name> USGS/WRD </Name> <Organization> USGS/WRD </Organization> <Street-Address> U.S. GEOLOGICAL SURVEY, 505 MARQUETTE, NW </Street-Address> <City> ALBUQUERQUE </City> <State> NM </State> <Zip-Code> 87102 </Zip-Code> <Country> USA </Country> <Telephone> (505) 766-5560 </Telephone> </Distributor>

The keywords surrounded by <...> are tags, while the sections of text in between are the data elements. A data element is characterized by its location in the tree that is made up by the nested elements. Each element is terminated by a closing tag - beginning with </, and containing the same symbolic tag-name as the corresponding opening tag. The general closing tag - </> - terminates the element started by the last opening tag. The structuring of elements is significant. The element Telephone, for instance, may be indexed and presented to the client differently, depending on whether it appears inside the Distributor element, or some other, structured data element such a Supplier element.

Record Root

The first tag in a record describes the root node of the tree that makes up the total record. In the canonical input format, the root tag should contain the name of the schema that lends context to the elements of the record (see the Section called Internal Representation). The following is a GILS record that contains only a single element (strictly speaking, that makes it an illegal GILS record, since the GILS profile includes several mandatory elements - Zebra does not validate the contents of a record against the Z39.50 profile, however - it merely attempts to match up elements of a local representation with the given schema):

<gils> <title>Zen and the Art of Motorcycle Maintenance</title> </gils>

Variants

Zebra allows you to provide individual data elements in a number of variant forms. Examples of variant forms are textual data elements which might appear in different languages, and images which may appear in different formats or layouts. The variant system in Zebra is essentially a representation of the variant mechanism of Z39.50-1995.

The following is an example of a title element which occurs in two different languages.

<title> <var lang lang "eng"> Zen and the Art of Motorcycle Maintenance</> <var lang lang "dan"> Zen og Kunsten at Vedligeholde en Motorcykel</> </title>

The syntax of the variant element is <var class type value>. The available values for the class and type fields are given by the variant set that is associated with the current schema (see the Section called The Variant Set (.var) Files).

Variant elements are terminated by the general end-tag </>, by the variant end-tag </var>, by the appearance of another variant tag with the same class and value settings, or by the appearance of another, normal tag. In other words, the end-tags for the variants used in the example above could have been omitted.

Variant elements can be nested. The element

<title> <var lang lang "eng"><var body iana "text/plain"> Zen and the Art of Motorcycle Maintenance </title>

Associates two variant components to the variant list for the title element.

Given the nesting rules described above, we could write

<title> <var body iana "text/plain> <var lang lang "eng"> Zen and the Art of Motorcycle Maintenance <var lang lang "dan"> Zen og Kunsten at Vedligeholde en Motorcykel </title>

The title element above comes in two variants. Both have the IANA body type "text/plain", but one is in English, and the other in Danish. The client, using the element selection mechanism of Z39.50, can retrieve information about the available variant forms of data elements, or it can select specific variants based on the requirements of the end-user.

Input Filters

In order to handle general input formats, Zebra allows the operator to define filters which read individual records in their native format and produce an internal representation that the system can work with.

Input filters are ASCII files, generally with the suffix .flt. The system looks for the files in the directories given in the profilePath setting in the zebra.cfg files. The record type for the filter is grs.regx.filter-filename (fundamental type grs, file read type regx, argument filter-filename).

Generally, an input filter consists of a sequence of rules, where each rule consists of a sequence of expressions, followed by an action. The expressions are evaluated against the contents of the input record, and the actions normally contribute to the generation of an internal representation of the record.

An expression can be either of the following:

INIT: The action associated with this expression is evaluated exactly once in the lifetime of the application, before any records are read. It can be used in conjunction with an action that initializes tables or other resources that are used in the processing of input records.
BEGIN: Matches the beginning of the record. It can be used to initialize variables, etc. Typically, the BEGIN rule is also used to establish the root node of the record.
END: Matches the end of the record - when all of the contents of the record has been processed.
/pattern/: Matches a string of characters from the input record.
BODY: This keyword may only be used between two patterns. It matches everything between (not including) those patterns.
FINISH: The expression associated with this pattern is evaluated once, before the application terminates. It can be used to release system resources - typically ones allocated in the INIT step.

An action is surrounded by curly braces ({...}), and consists of a sequence of statements. Statements may be separated by newlines or semicolons (;). Within actions, the strings that matched the expressions immediately preceding the action can be referred to as $0, $1, $2, etc.

The available statements are:

begin type [parameter ... ]

Begin a new data element. The type is one of the following:

record: Begin a new record. The following parameter should be the name of the schema that describes the structure of the record, eg. gils or wais (see below). The begin record call should precede any other use of the begin statement.
element: Begin a new tagged element. The parameter is the name of the tag. If the tag is not matched anywhere in the tagsets referenced by the current schema, it is treated as a local string tag.
variant: Begin a new node in a variant tree. The parameters are class type value.

data

Create a data element. The concatenated arguments make up the value of the data element. The option -text signals that the layout (whitespace) of the data should be retained for transmission. The option -element tag wraps the data up in the tag. The use of the -element option is equivalent to preceding the command with a begin element command, and following it with the end command.

end [type]

Close a tagged element. If no parameter is given, the last element on the stack is terminated. The first parameter, if any, is a type name, similar to the begin statement. For the element type, a tag name can be provided to terminate a specific tag.

The following input filter reads a Usenet news file, producing a record in the WAIS schema. Note that the body of a news posting is separated from the list of headers by a blank line (or rather a sequence of two newline characters.

BEGIN { begin record wais } /^From:/ BODY /$/ { data -element name $1 } /^Subject:/ BODY /$/ { data -element title $1 } /^Date:/ BODY /$/ { data -element lastModified $1 } /\n\n/ BODY END { begin element bodyOfDisplay begin variant body iana "text/plain" data -text $1 end record }

If Zebra is compiled with support for Tcl (Tool Command Language) enabled, the statements described above are supplemented with a complete scripting environment, including control structures (conditional expressions and loop constructs), and powerful string manipulation mechanisms for modifying the elements of a record. Tcl is a popular scripting environment, with several tutorials available both online and in hardcopy.