Publications Scripts

by Christopher Twigg

These scripts are used to generate both the publications and courses pages on the Carnegie Mellon Graphics website. They are posted here in the hope that they will be useful to other schools. If you use these for your own page, we request only that (1) you include an acknowledgement and a link back to this page on each of the pages you generate (the easiest way to do this is to add some text to the footer), and (2) if you make any changes that you think would be useful to others, you make these available for others to use.

The software was written largely by Christopher Twigg with some modifications by Jim McCann. It is all written in C++ and should be easy to compile on any Linux machine (Windows and OS X should also be doable but might take a little more work). It is released under a BSD license, and can be downloaded here.

It is likely that you won't want your publications page to look exactly like ours. Superficial changes can easily be made using style sheets; larger changes can be made by modifying the code itself (primarily genPages.cxx) and coursesPages.cxx). I have tried to document the code but if you have any questions please email me at cdtwigg at cs.cmu.edu.

Overview

The main publications database is a single BiBTeX file. This is for several reasons:

It seems silly to use a dynamically-generated pages and an SQL database for content that changes only once every few months, and this would add overhead maintaining the database, worrying about security, etc.
BiBTeX is a used throughout the academic world, and we know that it contains everything needed to fully describe a publications. On the other hand, developing a database schema that contains everything we could possibly need to describe publications would take substantial effort and involve dealing with lots of rules about the difference between conferences and journals, etc.
We would like to be able to supply all our publications in BiBTeX format at the end of the day, and if we start with this format this becomes trivial.

Out particular BiBTeX database has been enhanced with a number of extra fields, such as abtract, images, etc. It is described in full here.

Running the scripts

To run the scripts, you first need to be logged in locally to kip.graphics.cs.cmu.edu, which is our web server. Since the scripts depend on a number of libraries which are installed in the /usr0/www-scripts/lib directory, you will need to add this to your LD_LIBRARY_PATH environment variable. In csh or tcsh, the command to do this is

setenv LD_LIBRARY_PATH /usr0/www-scripts/lib:$LD_LIBRARY_PATH

You can force this to get set every time you log into kip by adding the following to your .cshrc file (assuming again that you are using tcsh),

if( `hostname` =~ 'kip.graphics.cs.cmu.edu' ) then
  setenv LD_LIBRARY_PATH /usr0/www-scripts/lib
endif

pubsPages

To get a full list of options, run the pubsPages script from the command line with /usr0/www-scripts/genPages/pubsPages --help. The best way to run it, though is with a configuration file so you don't need to specify all the options on the command line. To do this, simply list all the options in a file sthg.cfg with the format

option1 = value
option2 = value

Then, run pubsPages sthg.cfg.

The script will read in the appropriate files (described below) and generate the files abstracts.*.php, blurbs.php, and index.php. It will also create various images in image-root/sidebar, image-root/pubIcons, and image-root/abstracts. If any errors are encountered, it will exit and not change any files. You should then fix the errors and re-run the scripts.

Note that the script relies on certain bibtex-generated files to run correctly. I therefore suggest running it from a wrapper script like the following, (here, dummy.tex is the placeholder LaTeX file described below)

#!/bin/sh

latex dummy
bibtex dummy
latex dummy
bibtex dummy

LD_LIBRARY_PATH=/usr0/www-scripts/lib /usr0/www-scripts/genPages/pubsPages sthg.cfg

Further explanation of the various options follows.

--bibtex-file and --bbl-file

In order for the script to run, it first needs a BiBTeX file in the appropriate format. Less obviously, it also needs a compiled BiBTeX file to generate the citation that appears in the abstracts file. To generate this, we need to create a dummy.tex that simply cites all publications, like this,

\documentclass{article}
\author{Nobody}
\title{Placeholder Document}

\begin{document}
\maketitle

This is a placeholder document which just includes all the
referenced in glab.bib so that I can later extract formatting
information about them.

\bibliographystyle{plain}
\nocite{*}
\bibliography{glab}

\end{document}

We simply need to latex and bibtex this file to generate the appropriate dummy.bbl file. If there are any publications missing from the bbl file, the script will tell us and error out; therefore it is important that it be kept up to date with the original BiBTeX file; hence, the updatePubs script described above.

Note that here we are using the "plain" biliography style. More exotic bibliography styles may stick extra junk into the bbl file which may cause the bbl parser to throw an error, so if you are unhappy with the current format and would rather use e.g. ACM style you may need to modify the script some.

--homepage-file

In order to generate links to people's home pages, they need to be listed in the homepages file. This is a file with the format firstname lastname homepage, like this:

John Smith http://www.example.com/~jsmith
Jenny Public http://www.example.com/~jbrown
George Washington http://www.whitehous.gov/~george

Note that the current script only uses the last name and first initial (this is because there was some variation in the way some people spell their first names). So if we ever get two people with the same last name and first initial, this will need to be changed.

At current the homepage file contains anyone who is or has ever had an association with Carnegie Mellon; other coauthors are not linked. This is mostly laziness on account of me not wanting to add every coauthor, but it also helps to signal people which paper authors are CMU-associated.

These are placed at the start and end respectively of every HTML file generated, and provide the opportunity to add logos, links to other pages, etc. Should be fairly self-explanatory.

--extension

The current graphics lab site uses PHP for all its pages (to enable rotating sidebar images). I recognize though that other sites may want to use a .html or .htm extension for the generated pages and have provided that option.

--bibtex-database

All publications include a link to view the BiBTeX for convenience in citation. To make this possible, we strip out all the "extended" parts of the bibtex entry and place it in a GDBM database that is indexed by the key (think of GDBM as a file-based hashtable); this makes it possible to write a simple script that very quickly pulls up bibtex entries that looks like this:

#!/usr/local/bin/perl -Tw

use strict;
use GDBM_File;
use CGI qw(:standard);

my $query = new CGI;

my $filename = "bibtexEntries.gdbm";
my %entries = ();
tie %entries, 'GDBM_File', $filename, &GDBM_READER, 0644
  or die "cannot open file '$filename'.";

my $key = param('key');
print $query->header('text/plain');
if( defined($key) && defined($entries{$key}) )
{
  print "$entries{$key}\n";
}
else
{
  print "Error: unknown key.\n";
}

untie %entries;

--venues-list

In order to generate the shortened publication names (e.g., "ACM SIGGRAPH") that appear on the publications index page, we need to have some way to map from full publication names (" 2006 ACM SIGGRAPH / Eurographics Symposium on Computer Animation") to shortened names ("Symposium on Computer Animation"). I have chosen to do this with regular expressions, which theoretically gives a lot of flexibility. Regular expressions format is the Perl regex format, described here. The file looks like "regular expression" Shortened Name; here are some samples:

"SIGGRAPH.*Conference on Sketches.*Applications" ACM SIGGRAPH Sketch
"ACM Transactions on Graphics" ACM Transactions on Graphics
"\d{4} ACM SIGGRAPH\s*/\s*Eurographics Symposium on Computer Animation" Symposium on Computer Animation

If you don't want to worry about having these publication names appearing, simply provide an empty file for the venues list and they will be automatically dropped.

--images-root-filesystem and --images-root-server

The script generates copies of images in various sizes for things like publication page icons, abtracts page images, and sidebar images. We need to specify where this path is, both in terms of the www server document root (e.g., /images), and in terms of the filesystem (e.g., /usr0/www-graphics/images). The scripts create a number of directories in this path, including image-root/sidebar, image-root/pubIcons, and image-root/abstracts.

--server-name

To make the generated HTML shorter and cleaner, it makes sense to provide relative URLs like /projects/myproj http://www.example.com/projects/myproj wherever possible. We do this by stripping the server name out of whatever URLs we see; this requires that we know the name of the server. To specify this, specify server-name = http://myserver.cs.cmu.edu in the config script.

coursesPages

The courses pages look very similar to the publications pages, and the script shares a significant amount of code. However, BiBTeX doesn't make any sense for class descriptions, so I have instead substituted a simple (and mostly self-explanatory) XML format.

Here is a sample of a "courses" file; the current complete courses.xml can be found here.

<?xml version="1.0" encoding="UTF-8"?>
<courses>
<course number="15-462"
    name="Computer Graphics I"
    description="Introduction to graphics, undergraduate level."
    image="http://www.example.com/graphicsI/graphicsI_image.jpg">
  <offering semester="fall" year="2004"
      href="http://www.example.com/graphicsI_fall04/"
      instructor="Jane Public" />
  <offering semester="spring" year="2004"
            href="http://www.example.com/graphicsI_spring04/index.html"
            instructor="John Smith" />
</course>
</courses>

Basically, the file is divided into "courses" and "offerings", where the course tag describes the basics of the course (number, description, image) and specific offerings of each course are then listed. Note that the file must be in valid XML format or Xerces will refuse to parse it; this means that tags are case-sensitive and any "empty" element that doesn't have a closing tag must use the special syntax <tag />.

Note that it doesn't matter what order courses and offerings are specified in as after they are read in they are automagically sorted by (1) most recent offerings and (2) course number (lowest numbers first).

One major difference between the courses script and the publications script is that the courses page script makes no effort to resize images to fit; this is simply because course images change very slowly and it is therefore not an undue burden on the maintainer to make sure that images are appropriately sized.

Most of the options are basically the same for the coursesPages script as for the pubsPages script. We will briefly go over the differences.

--courses-file

A list of the courses in the XML format described above.

--other-courses-file

If you look at the current courses page, you will notice that a list of additional courses appears at the bottom. Since this appears on the index page but not on the descriptions page, it needs to be separate from the standard footer, which is what the file specified in --other-courses-file is meant to do.

Building the scripts

Library dependencies

The scripts depend on the following libraries, all of which (except those installed natively) are currently located in /usr0/www-scripts.

ImageMagick: Used for resizing images for icons
btparse: Used for parsing out BiBTeX entries
Xerces-C: Parsing of XML files, used in the courses pages script
libcurl: Used for fetching URLs
gdbm: stripped-down BiBTeX entries are stored in a GDBM database for fast access from a CGI script
Boost: Lots of useful platform-independent utilities for filesystem manipulation and regex parsing.

Things to watch out for

Style sheets

As much as possible, I have tried to separate the appearance of the pages from the actual HTML code that is generated. If you look at the generated code, it looks something like this:

<div class="abstract" id="Biv:2008:rays">
<h2>Ray Tracing Really Fast</h2>
<p>Roy G. Biv</p>
<h3>Abstract</h3>
<p>Amazing stuff is done here.</p>
<h3>Citation</h3>
etc.

Things like class identifiers have been added to make it easier to control appearance using style sheets. This makes the code cleaner, faster to load, and more viewable on older browsers. If you want to change the appearance of the generated pages, I suggest trying to accomplish it by changing the style sheets rather than the actual generated HTML. glab.css in the root is the primary style sheet for pages in the graphics.cs.cmu.edu domain. For more information on the use of style sheets, I suggest Jeffrey Zeldman's book.

International characters

International characters like ć and č are handled differently in HTML and in BiBTeX. The latter specifies them with an escape character, as in "\'c", while the former requires that we specify special characters in the format ć. Translating from the former to the latter is just a lookup table, but I couldn't find any online, so to avoid having to hand-generate it I added special cases for the ones that we see in our lab. It may well be necessary to add more in the future. The function to look at is HtmlAccented( char in, char accentType ) in latexParser.cxx. It should be fairly self-explanatory, just follow the same pattern. If someone wants to write a magic translation function that handles all accent types and characters, this would certainly be welcome.