From BibTeX to HTML semantic nets

J.J. Almeida and J.C. Ramalho

Indice Remissivo

Introdution

DBIB: the tool

Usage

Generated Files

BiBTeX extensions

Author index

Thesaurus and subject index

Thesaurus syntax

Future work

Example

Thesaurus file: 'doc.the'

Bibtex file: 'doc.bib'

Produced pages


Introdution

Most researchers at universitys and research centers use LaTeX to produce their documentation.

Choosing LaTeX as the text formatter and processor, becomes natural the use of BibTeX to manage references and bibliography.

So, in an university, problably there are some hundreds of bib files, and normally some of them are huge. This, sometimes, makes bib files hard to use and manage. The need for some tool or mechanism capable of dealing with the problem arises.

This work is about building such a tool. We built a tool capable of providing means to navigate inside a bib file, in a linear or structural way.

The idea was to use WWW browsers to navigate and thus serving two needs: navigation inside a bib file and making that information available to the comunity. This way, our problem turned to be the development of a bibtex processor, that generates index structures and produces parametrized trees of HTML pages.

The developed tool can produce three different indexes, ordered by author, subject and bibtex, and one individual page for each bibitem in the bibtex file.


DBIB: the tool

This work is somehow related to a project called DAVID so it became DBIB.

Perl was the language chosen to develop the tool, due to its orientations towards text processing, record processing and easy and powerful file manipulation.

Usage

The user will have some switches to parametrize the tool, and generate output files.

To invoke the tool one must respect the following syntax:

   dbib [-s] [-a] -[b] [-sdp] thesaurus bibfile [doc_pages_dir]

  -s   generate subject index
  -a   generate author index
  -b   generate bibtex list
  -sdp single doc page  (bibfile.dp.html)

As you may see, you must invoke 'dbib' with two arguments, all the others are optional:

thesaurus
is the name of a text file containing the declaration of a subject tree according to a specific syntax (look in "thesaurus" section in this document).
bibtex
is a normal bibtex file.
docpagesdir
is an optional argument that stands for the name of the directory where the individual document HTML pages will be created. This directory should exist prior to the invocation of the tool. If this parameter isn't supplied, DBIB will assume that a subdirectory named 'docpages' exists.

The switches control the index generation and the individual HTML page generation.

Switches -s, -a, and -b work in conjunction. If none of them is supplied all the index files will be generated. If one or two of them are supplied, only the index files bound to those switches will be generated.

Concerning output, we can distinguish two major different ways of functionning:

multipage
In this mode, for each bibitem in the bib file a HTML page will be created. All this pages will be created in a specific directory and the index HTML pages will have the apropriate hyperlinks set to each of these pages.
single page
In this mode, the output produced for the different bibitems will be concatenated in a single file. Although, it will be possible to access each of them individually since the apropriate HTML hyperlinks will be generated.

We can think of two situations to apply these modes. If you have small bib file, let's say, with about 10 to 20 bibitems, you are advised to use the single mode. If you a larger bib file it's advisable to use the multipage mode.

Switch -sdp, selects single mode. Multipage mode is the default mode.

Generated Files

DBIB generates several files. One for each index, and one or more for the individual pages (depending on selected mode single/multi).

We had to be careful with the names of the files to prevent overwriting, so we use the name of the bibtex file as a prefix which is concatenated with the apropriate string in each case:

bibfile.aut.html
author index file - contains an ordered list of authors, each one with the list of his publications (a short description about the publication is generated together with a link to a page where all the information about that publication is available). This file is created if switch -a is given or in default mode (neither -a nor -b nor -s is given).
bibfile.sub.html
subject index file - same as author file but more structural. This file has a heavy structure (this structure is translated in a visual structure and in the wright hyperlink generation), taken from 'thesaurus' that is passed as an argument. This file is created if switch -s is given or in default mode (neither -a nor -b nor -s is given).
bibfile.html
bibtex file - contains the html version of the bibtex file. It's composed by a list of bibitems in same order as they appear in the original file. This file is created if switch -b is given or in default mode (neither -a nor -b nor -s is given).
docpages/....html
bibitem individual pages - if single mode is selected only one page with all the references will be generated in the current directory; if multi mode, one page for each bibitem will be generated; directory 'docpages' (by default) will contain all the individual pages; the page reference is built upon the bibitem citation key.


BiBTeX extensions

To take the most out of DBIB you must be aware of the assumptions we made and the extensions we have added.

Here is a list of fields, that you can add to each bibitem. None of them is needed but if you supply them you'll end up with a richer output:

keyword
(used to build subject index) - list of strings comma separated (case sensitive). Example:
        keyword = "perl, tool, text processing",
abstract
- free text - can contain several paragraphs; will be used in the individual page generation.
url
- if you want to make your publication available in ps, dvi or other format, you must fill this field with the proper url. Example:
        url = "http://www.di.uminho.pt/~jj/bib/dbib.ps",
docpage
- if you already created a HTML page for some document and you don't DBIB to overwrite it, or simply you want DBIB to take into account that some individual pages have been already created, you supply in this field the url to the document page. If you do that, DBIB won't create the correspondent individual page but it will use the supplied url to relate the index files to this page.


Author index

As we said before, the author index is generated (option -a or no options).

Author index is based in the "author" field of bibtex items. Author field may have:

Names can be introduced in normal order (Ex: J.J. Dias Almeida) or normalized notation (Ex: Almeida, J.J. Dias).

In author index of dbib all names are normalized and sorted; no miracles are made in order to guess that "J. Almeida" is tha same as "J.J. Dias Almeida" so you must be aware of what you are writting


Thesaurus and subject index

Subject index is based on the Thesaurus file and the "keyword" field of the bibtex items. Keyword field may have:

DBIB will give a warning for every keyword not present in the Thesaurus. Although that keyword will show in the index as a top subject.

The generated index contains:

The thesaurus files defines the set of valid keywords (term). A term may be associated to:

A term is valid if:

Thesaurus syntax

Term definitions are separated by empty lines. Each term entry looks like:
      term
      * description line (optional)
      # url              (optional)
      = sinonymos        (optional)
      > specific terms   (optional)

The special caracters (\verb"*#=>") must be in the beginnig of the line.


Future work

No invariants are tested (Ex: Thesaurus should be antitransitive, antissimetric, antireflexive).

Configuration files are not yet available.

Some tool are planed to improve thesaurus and bibtex construction.


Example

We present a small example of dbib usage. We took a small bibtex file concerning the subject of 'Document Processing' and applied dbib to it to generated the HTML version.

A list of the original files is presented in the following together with some images of the HTML pages generated.

Thesaurus file: 'doc.the'

Document Processing
> Project, Specification and Programming Language, Paradigms

Project
* Related Projects
> camila, DAVID
= project

Specification and Programming Language
* Languages related to this framework
> camila, SGML, Perl, LaTeX, ODA
= formal specification

Paradigms
> Literate Programming

LaTeX
* text processor and programming tool
= latex

ODA
* Open Document Architecture
= oda

Literate Programming
* Documenting while programming
= literate programming

Perl
* programming language
= perl

SGML
* standard generalized markup language (ISO 8879)
= sgml

camila
* porject and tool - A Platform for Software Mathematical Development
# http://www.di.uminho.pt/~lsb/camila.html

DAVID
* project - document specification and manipulation
# http://www.di.uminho.pt/~jcr/projectos/david/princ.html
= david

\newpage

Bibtex file: 'doc.bib'

@book{barbosa95,
author = "Luis Barbosa and J.J. Almeida",
title  = "System Prototyping in CAMILA",
year = 1995,
note = "Lecture notes for the system Design Course, Computer System
Engineering, University of Bristol",
publisher="University of Minho",
url="http://www.di.uminho.pt/~lsb/camila.ps.gz",
keyword = "camila, formal specification",
}

@InProceedings{DAVID95,
author =         "J.C. Ramalho and J.J. Almeida and P.R. Henriques",
title =          "DAVID - Algebraic Specification of Documents",
booktitle = "TWLT10 - Algebraic Methods in Language Processing - AMiLP95",
year =   "1995",
editor =         "A.Nijholt & G.Scollo & R.Steetskamp",
number =         "10",
series =         "Twente Workshop on Language Technology",
address =        "Twente University - Holland",
month =          "Dec.",
keyword = "david,camila,sgml,formal specification,perl,project",
abstract =  "It is becoming normal that a document should serve several
purposes.  However, the majority of available text processors is
purpose-oriented, reducing the necessary flexibility and reusability of
documents.  Some waste of time arises from adapting the same text to
each different purpose, when this task could be done automatically
(from the first version of the document) with an appropriate system.
...

We intend to build on camila (a specification language and prototyping
environment developed at Universidade do Minho, by the Computer Science
group) developing the above mentioned system as one of its extensions.",
}

...

Produced pages