Greenstone Digital Library Software (GSDL) : A Tutorial: Paper: S
Greenstone Digital Library Software (GSDL) : A Tutorial: Paper: S
Paper: S
Jaba Das
Documentation and Research Training Centre
Indian Statistical Institute
Bangalore-560 059
email: [email protected]
Abstract
This tutorial describes how to build your own digital library using
the Greenstone Digital Library (GSDL) Software -- a
comprehensive, open-source system for constructing, presenting, and
maintaining digital collections. Collections can be built and rebuilt
automatically in GSDL. The collections are easily maintainable and
include effective full-text searching and metadata-based browsing
facilities that are attractive and easy to use. Browsing utilizes
hierarchical structures that are created automatically from metadata
associated with the source documents. Collections can include text,
pictures, audio, and video, using an easy to use tool called the
Collector. Documents in the collection can be in any language. Even
the GSDL interface is available in many languages including
Chinese and Arabic. The system is extensible and customizable i.e.
software "plugins" can accommodate different documents and
metadata types according to user requirement.
Paper: S Jaba Das
1. INTRODUCTION
The Greenstone software runs under Unix, Windows and Mac (OS/X), and is distributed under
the GNU public license. General users can download the software and set up a digital library
system. Those with programming skills can extend and customize the system according to their
own requirements.
cd “C:\Program Files\gsdl” (You need the quotation marks because of the space in
Program Files.)
prompt will come like,
C:\Progra~1 \GSDL>
Next, at the prompt type
C:\ Progra~1\GSDL>setup.bat
This batch file is needed to create the environment for running the Greenstone programs in
different MS-DOS sessions. When you work in DOS mode, Setup.bat command is always
needed. Now the GSDL environment will be set and you are in a position to make, build and
rebuild collections. The first program is the Perl program mkcol.pl, whose name stands for “make
a collection”. Run the program by typing
C:\ Progra~1\GSDL>perl –S mkcol.pl
This will list out the arguments to be used with the command. The only argument it lists is the ‘-
creator’, which specifies who created the collection.
Let us now use the following command to create the initial files and subdirectories necessary for
our digital library. Let us assign the collection the name test; now type,
C:\ Progra~1\GSDL>perl –S mkcol.pl –creator [email protected] test
Here, [email protected] is the email id of the creator of the collection, you can use your
own id.
To view the newly created files, move to the newly created collection directory by typing
C:\ Progra~1\GSDL>cd collect\test
you can list the content of this directory be typing ‘dir’ at the prompt.
There should be seven subdirectories: archives, building, etc, images, import, index and perllib.
Now we must populate the collection with sample documents. Supposing the source material for
the test collection is in d:\xxx\jaba, then give the following command at the prompt:
C:\Progra~1 \GSDL\collect\test >xcopy /s d:\xxx\jaba\*.* import
GSDL: a tutorial Paper: S
Alternatively, in Windows you can select the contents of the jaba directory and drag them into
the test collection’s import directory.
In the collection’s etc directory, there is a file called collect.cfg. It is the configuration file of the
‘test’collection. if you open the file, you can see it contains the following information
Creator [email protected]
maintainer [email protected]
public true
beta true
indexes document:text
default index document:text
plugin ZIPPlug
plugin GMLPlug
plugin TEXTPlug
plugin HTMLPlug
plugin EMAILPlug
plugin ArcPlug
plugin RecPlug
This process takes about five minutes on a 1 GHz computer, and correspondingly longer on
slower machines. Note that you do not have to be in either the collect or test directories when this
command is entered; because GSDLHOME is already set by the initial ‘setup.bat’ command, the
Greenstone software can work out where the necessary files are.
Now let’s make some changes to the collection configuration file to customize its appearance.
First, give the collection a name. This will be treated by web browsers as the page title for the
front page of the collection, and used as the collection icon in the absence of a picture. Change
the line that reads collectionmeta collectionname "test" to something like collectionmeta
collectionname "First Project". Add a description of your collection between the quotes of the
line that reads collectionmeta collectionextra "". This is used as the about this collection text on
the collection’s home page. You can add, “This collection is an experimental one for my project
work.”
You can use any picture you can view in a web browser for a collection icon. Put the location of
the image between the quotes of the line collectionmeta iconcollection "" in the configuration file.
For example you could enter: _httpprefix_/collect/test/images/icon.gif if you have put a suitable
image in the collection’s images directory (collect\test\images in our example). Save the
collection configuration file, and close it. Fig. 2 shows collection icons.
Paper: S Jaba Das
Fig. 2
The next step is to “build” the collection, which creates all the indexes and files that make the
collection work. Type
C:\ Progra~1\GSDL\collect\test>perl –S buildcol.pl
at the command prompt for a list of collection-building options. Then type
C:\ Progra~1\GSDL\collect\test>perl –S buildcol.pl test
Otherwise select the contents of the test collection’s building directory and drag them into the
index directory. Alternatively, you can remove the index directory (and all its contents) by typing
the command (If the current working directory is not test , type
cd "%GSDLHOME% \collect\test"
before going through the rd, ren and mkdir sequence above.)
and then change the name of the building directory to index with
ren building index
Finally, type
GSDL: a tutorial Paper: S
mkdir building
You should be able to access the newly built collection from your Greenstone homepage. You
will have to reload the page if you already had it open in your browser, or perhaps even close the
browser and restart it (to prevent caching problems). Alternatively, if you are using the “local
library” version of Greenstone you will have to restart the library program. To view the new
collection, click on the image. The result should look something like Fig. 3.
Fig. 3
In summary then, the commands typed to produce the test collection are:
cd C:\ Progra~1\gsdl # assuming default location
setup.bat
perl –S mkcol.pl –creator [email protected] test
cd “%GSDLHOME%\collect\test”
xcopy /s d:\xxx\jaba\* import # assuming D drive
perl –S import.pl test
perl –S buildcol.pl test
rd /s index # on Windows NT/2000
deltree /Y index # on Windows 95/98
ren building index
mkdir building
Paper: S Jaba Das
You can build a collection of variety of documents like Word files, PDF files, email, image files,
video files, MP3 files, HTML files, etc. In case of email files, the files must have the ‘.Email’
extension. In case of HTML files you need to keep all the interlinked files in one folder. In case
of image or video or MP3 files there is no need to copy source files into ‘import‘directory. In this
case first you have to keep all files into the ‘images’ directory of your collection. While in case of
Word, emails and PDF files there is not much to be done, for documents like image files, video
files, MP3 files and HTML files with hyperlinks, you have to change the ‘plug-in’ options in the
collection configuration file, i.e. the ‘collect.cfg’ file. And also, you have to create a
‘metadata.xml’ file in ‘import’ directory to assign metadata to the images or the video clips or the
audio clips. Assigning metadata to the documents of a collection and formatting the configuration
file will be discussed in detail in later sections of this tutorial.
If, later on in your command-line session with Greenstone, you wish to return to the top level
Greenstone directory you can accomplish this by typing
cd $GSDLHOME
With the appropriate setup file sourced, we are now in a position to make, build and rebuild
collections. The first program we will look at is the Perl program mkcol.pl, whose name stands
for “make a collection.” First run the program by typing mkcol.pl to display a list of arguments to
appear on the screen. The only required argument is creator, which is used to specify who built
the collection.
Let us now use the command to create the initial files and directories necessary for our digital
library. To assign the collection the name test , type
mkcol.pl –creator [email protected] test.
To view the newly created files, move to the newly created collection directory by typing
cd $GSDLHOME/collect/test
There should be seven subdirectories: archives, building, etc, images, import, index and perllib.
If source material is in your hard disk, for instance, the source file “text” is under
/home/jaba/text/, copy the contents of the /home/jaba/text directory into the
GSDLHOME/collecttest/import directory. Type the command,
cp –r home/jaba/text/* GSDLHOME/collect/test/import
here, –r, stands for recursively copy
GSDL: a tutorial Paper: S
In the collection’s etc directory, there will be a file called collect.cfg. Open the file using a text
editor like vim or vi, a popular editor on Linux.
cd collect/test/etc
vi collect.cfg (or) vim collect.cfg
creator [email protected]
maintainer [email protected]
public true
plugin ZIPPlug
plugin GAPlug
plugin TEXTPlug
plugin HTMLPlug
plugin EMAILPlug
plugin PDFPlug
plugin RTFPlug
plugin WordPlug
plugin PSPlug
plugin ArcPlug
plugin RecPlug
The following table presents the differences in building a collection in Windows and Unix.
Windows Unix
Run ‘setup.bat’ to make Greenstone programs ‘Source setup.bash’ or ‘Source setup.csh’ to
available. make programs available.
Old collection index replaced by typing ‘rd /s Old collection index replaced by typing rm –r
index’ then ‘ren building index’ followed by index/* then mv building/* index
‘mkdir building’, or by using visual file
manager.
Table 1 Collection-Building Differences between Windows and Unix
Paper: S Jaba Das
For example,
< Metadata name=”Title”> Theory of Library Classification</Metadata>
The Dublin Core metadata standard (1) is used for defining metadata types.
XML files are manually created for assigning the metadata to documents under import directory.
If the ‘use_metadata_files’ option is specified, RecPlug uses an auxiliary metadata file called
metadata.xml. For example see Fig. 5,
<FileName>AA.PDF</FileName>
<Description>
<Metadata name="Title">Informetrics:Scope, Definition, Methodology and
Conceptual Questions</Metadata>
<Metadata name="Language" mode="accumulate">English</Metadata>
<Metadata name="Subject" mode="accumulate">Informetrics</Metadata>
<Metadata name="Creator" mode="accumulate">I K Ravichandra
Rao</Metadata>
<Metadata name="Date" mode="accumulate">1998</Metadata>
<Metadata name="AZList" mode="accumulate">T.1</Metadata>
</Description>
</FileSet>
</DirectoryMetadata>
Fig. 5
‘Metadata name’ is the specific tag or field. Sometimes metadata is multi-valued and new values
accumulate, rather than overriding previous ones. The mode = accumulate should be used and it
must be specified for every occur rence. The metadata.xml mechanism that is embodied in
RecPlug is just one way of specifying metadata for documents. It is easy to write different plugins
that accept metadata specifications in completely different formats.
Fig. 6
Each line of the collection configuration file is essentially an “attribute, value” pair. Each
attribute gives a piece of information about the collection that affects how it is supposed to look
or how documents are to be processed.
The collection configuration file created by the mkcol.pl script, shown in Table 3 below, is a very
simple one and contains a bare minimum of information. Lines 1 and 2 stem from the creator
value supplied to the mkcol.pl program, and contain the E-mail addresses of the person who
created the collection and the person responsible for maintaining it (not necessarily the same
person).
Line 3 indicates whether the collection will be available to the public when it is built, and is either
true (the default, meaning that the collection is publicly available), or false (meaning that it is
not). This is useful when building collections to test software, or building collections of material
for personal use.
GSDL: a tutorial Paper: S
Line 4 indicates whether the collection is beta or not (this also defaults to true, meaning that the
collection is a beta release). Line 5 determines what collection indexes are created at build time:
in this example only the document text is to be indexed. Indexes can be constructed at the
document, section, and paragraph levels. They can contain the material in the text, or in any
metadata—most commonly Title. The form used to specify an index is level:data. For example,
to include an index of section titles as well, you should change line 5 to indexes document:text
section:Title. More than one type of data can be included in the same index by separating the data
types with commas. For example, to create a section-level index of titles, text and dates, the line
should read indexes section:text,Title,Date.
The default index defined in line 6 is the default to be used on the collection’s search page.
Lines 7–13 specify which plugins to use when converting documents to Greenstone Archive
format and when building collections from archive files. The Greenstone Archive format is a
XML style document that marks documents into sections, and can hold metadata at the document
or section level. The Greenstone archive files need nto be created manually. These are
automatically created by the document processing plugins that are described in later sections.
Attribute Value
1 creator [email protected]
2 maintainer [email protected]
3 public True
4 beta True
5 indexes Document : text
6 defaultindex Document : text
7 plugin ZIPPlug
8 plugin GAPlug
9 plugin TextPlug
10 plugin HTMLPlug
11 plugin EMAILPlug
12 plugin ArcPlug
13 plugin RecPlug
14 classify AZList metadata Title
15 collectionmeta Collectionname “sample collection”
16 collectionmeta Iconcollection “”
17 collectionmeta Collectionextra “”
18 collectionmeta .document : text “document”
Line 14 specifies that an alphabetic list of titles is to be created for browsing purposes. Browsing
structures are constructed by “classifiers”. Classifiers are discussed in detail in later sections.
Lines 15–18 are used to specify collection-level metadata. Specified through collectionname, the
long form of the name is used as the collection’s “title” for the web browser. The collectionicon
entry gives the URL of the collection’s icon. If an index is specified (as in line 18), the string
following is displayed as the name of that index on the collection’s search page. A particularly
important piece of collection-level metadata is
Paper: S Jaba Das
collectionextra, which gives a stretch of text, surrounded by double quotes, describing the
collection. This will be shown as the “About this collection” text. This simple collection
configuration file does not include any examples of format strings, nor of the subcollection and
language facilities provided by the configuration file.
If a collection contains documents in different languages, separate indexes can be built for each
language. Language is a metadata statement; values are specified using the ISO 639 standard two-
letter codes for representing the names of languages—for example, en is English, zh is Chinese,
and mi is Maori. Since metadata values can be specified at the section level, parts of a document
can be in different languages. For example, if the configuration file contained indexes section:text
section:Title document:text paragraph:text languages en zh mi section text, section title,
document text, and paragraph text indexes would be created in English, Chinese, and Maori for
each—so twelve indexes altogether are created. Adding a couple of subcollections multiplies the
number of indexes again. Hence, one has to be careful and guard against a index glut. (This index
specification could be defined using the subcollection facility rather than the languages facility.
However, since the syntax precludes creating subcollection of subcollections, it would then be
impossible to index each language in the subcollections separately.)
Following the keyword format is a two-part keyword, only one part of which is mandatory. The
first part identifies the list to which the format applies. The list generated by a search is called
Search, while the lists generated by classifiers are called CL1, CL2, CL3,… for the first, second,
third,… classifier specified in collect.cfg. The second part of the keyword is the part of the list to
which the formatting is to apply—either HList (for horizontal list, like the A-Z selector in an
AZList), VList (for vertical list, like the list of titles under an AZList), or DateList.
For example:
format CL4VList ... applies to all VLists in CL4
format CL2HList ... applies to all HLists in CL2
format CL1DateList ... applies to all DateLists in CL1
format SearchVList ... applies to the Search Results list
plugin GAPlug
plugin TEXTPlug
plugin HTMLPlug
plugin PSPlug
plugin ArcPlug
plugin RecPlug -use_metadata_files
Note the last two paragraphs of the file. The format Vlist arranges the collection icons vertically
on the screen, as shown below. href and src contain the path for the image source file. srcicon
will display the collection icon image on the screen and the title also can be displayed as seen in
Fig. 6. Both the collection icon as well as the title are links. In this example, the icon links to the
actual image and the title links to the actual image alongwith the title and also a short description
about the image (as seen in Fig. 8).
Paper: S Jaba Das
Fig. 8
The same procedure is followed for audio and video clips. Only the audio/video source file is
mentioned in href and src instead of the image file.
As for an HTML document collection, the only modification required in the collect.cfg is to add
the line assoc_files to the plugin HTMLPlug to include all the associated files like .jpg, .gif, etc.
4.1. Plugins
Plugins are used to convert each source document depending on its format. A collection’s
configuration file lists all plugins that are used when building it. During the import operation,
each file or directory is passed to each plugin in turn until one is found that can process it—thus
earlier plugins take priority over later ones. If no plugin can process the file, a warning is printed
to standard error and processing passes to the next file. During building, the same procedure is
used, but the archives directory is processed instead of the import directory.
GSDL has group collection plugins, collection specific plugins and general plugins. This tutorial
describes few of the general plugins below:
• Nolinks: Nolinks do not trap links within the collection. This speeds up the
import/build process, but any links in the collection will be broken.
• Description_tags: Description_tags interpret tagged document files.
• Metadata_fields: It takes a comma-separated list of metadata types (defaults to
title) to extract. To rename the metadata in the Greenstone archive file, use
tag<newname> where tag is the HTML tag sought and newname its new name.
• Hunt_creator_metadata: It finds as much metadata as possible about
authorship and put it in the Greenstone archive document as Creator metadata. It
is needed to include Creator using the metadata_fields option.
• File_is_url: Use this option if a web mirroring program has been used to create
the structure of the documents to be imported.
• Assoc_files : Gives a Perl regular expression that describes file types to be
treated as associated files. The default types are .jpg, .jpeg, .gif, .png, .css.
• Rename_assoc_files : Rename files associated with documents. During this
process the directory structure of any associated files will become much
shallower (useful if a collection must be stored in limited space).
EMAILPlug (*.email)
EMAILPlug imports files containing E-mail, and deals with common Email formats such as are
used by the Netscape, Eudora, and Unix mail readers. Each source document is examined to see if
it contains an E-mail, or several E-mails joined together in one file under any directory, but file
name should follow. Email extension, and if so its contents are processed. The plugin extracts
Subject, To, From, and Date metadata. However, this plugin does not yet handle MIME-encoded
E-mails properly—although legible, they often look rather strange.
ArcPlug
ArcPlug processes files named in the archives.inf, which is used to communicate between the
import and build processes.
RecPlug
RecPlug recourses through a directory structure by checking to see whether a file name is a
directory into the plugin pipeline. RecPlug accumulate the metadata.xml file. So, if you have
metadata.xml file you have to assign this plugin such as –use_metadata_files which are presented
in Fig. 7.
GAPlug (.xml)
GAPlug processes Greenstone archive files generated by import.pl. It is included by default in
.xml file in archives directory.
Fig. 9
For this you have to first create separate text files in which you input author names, subject
headings and dates of publication. For example, here the file sub.txt (Fig. 10) contains the subject
headings for the test collection, auth.txt contains the author names and date.txt contains the dates
of publication. The method of rendering the data in these files is as shown in Fig. 10. In cases
where an author has authored more than one document there is NO need to enter the author name
more than once. Same is the case with subject headings in case of more than one occurrences.
GSDL: a tutorial Paper: S
Fig. 10
As seen in Table 4 below the lines 17 and 18 i.e. classifiers for title and filenames are there by
default. Now, let us see what modifications have to be done in the collect.cfg file to create author,
subject and date indexes. A classifier line starts with the keyword classify and followed by the
name of the classifier and any options. In the collect.cfg file you can include the following lines,
below lines 17 and 18.
in this line ‘Hierarchy’ is the classifier being used,which displays the subject headings in an
hierarchical manner i.e. broader to narrower subject (see Fig. 11). The ‘-hfile’ gives the name of
the file where the metadata hierarchy is defined, here ‘sub.txt’. the argument ‘metadata’ is used to
mention the assigned metadata name i.e. Subject.
classify AZList -metadata Creator
in this line AZList is the classifier, which displays the author names in alphabetical order. The
metadata name here is Creator.
Fig. 11
has to be added in the collect.cfg file as shown in the (Table 4), line 31. For title and filenames
the lines
Fig. 12
As seen in Fig. 12 the dropdown box displays the author, title and filenames indexes.
Fig. 13
So, Source files have to be edited as a HTML file structure for the section and subsections. The
HTML plugin has a description_tags option that processes tags in the text like this:
<!---
<Section>
<Description>
<Metadata name= ”Title”> </Metadata>
</Description>
--->
<!---
<Section>
<Description>
<Metadata name=”Title”> </Metadata>
</Description>
--- >
<!----
<Section>
<Description>
<Metadata name=”Title”> </Metadata>
</Description>
Paper: S Jaba Das
<!---
</Section>
---- >
the <! --- --- > markers are used because they indicate comments in HTML. Thus these section
tags will not affect document formatting. In the description part other kinds of metadata can be
specified and you can include any subsection. Here Fig 14 is screenshot of modified HTML file.
Fig 14
Then, as in the case of author, add the following line to the collect.cfg file
collectionmeta .section:Title "section titles"
6. FINDING INFORMATION
Greenstone Digital Library provides easy browsing facility. User choose some searching
keywords and click on search button. Search is provided by keywords.
Greenstone digital library systems usually comprise several separate collections —for example,
computer science technical reports, literary works, Internet FAQs, magazines. There is a common
home page for the digital library system which allows users to access any publicly accessible
collection; in addition, each collection has its own “about” page that gives the users information
about how the collection is organized and the principles governing what is included in it. To get
GSDL: a tutorial Paper: S
back to the “about” page at any time, it is required to just clic k on the “collection” icon that
appears at the top left side of all searching and browsing pages. Screen appears as shown below
Fig. 15
As an example this collection “test” is used to describe the different ways of finding information.
Almost all icons are clickable. Several of these icons appear at the top of almost every page.
Fig. 16
Publications by subject can be accessed by pressing the subjects button. This brings up a list of
subjects, represented by subject headings as shown below.
GSDL: a tutorial Paper: S
Fig. 17
Publications by title can be accessed by pressing the titles a-z button. This brings up a list of
documents in alphabetic order as shown in figure below.
Paper: S Jaba Das
Fig. 18
Publications by filename can be accessed by pressing the filenames button. This brings up a list
of entries, sorted by original filename as shown below.
GSDL: a tutorial Paper: S
Fig. 19
Publications by author can be accessed by pressing the authors a-z button. This brings up a list of
documents, sorted by author name as shown in the figure below:
Paper: S Jaba Das
Fig. 20
Publications by date can be accessed by clocking on the date button. This brings up a list of all
the issues, sorted chronologically as shown below:
GSDL: a tutorial Paper: S
Fig. 21
Fig. 22
When it is made a query, the titles of twenty matching documents will be shown Fig. 22. Number
of search results per page can be specified in preferences. Here it is set to 20. A navigation
facility is provided for viewing next twenty, going back or forth among the search results. If users
click the title of any document, or the little button beside it, they will see it. A maximum of 100 is
imposed on the number of documents returned. This number can be changed by clicking the
preferences button at the top of the page as shown in Fig. 23.
GSDL: a tutorial Paper: S
Fig. 23
6.2. Search terms
Whatever users type into the query box is interpreted as a list of words or "search terms." When a
multi word term is given GSDL search alphabetically by the term given and also by adjacent
terms and presents the results. It ignores punctuations marks in the query. For example, for the
query
Fig. 24
6.3. Que ry Type
There are two different kinds of query.
• Queries for all of the words. These look for documents (or chapters, or titles) that contain
all the words it has been specified as shown in Fig. 25. Documents that satisfy the query
are displayed.
GSDL: a tutorial Paper: S
Fig. 25
• Queries for some of the words. Just lists some terms that are likely to appear in the
documents users are looking for. Documents are displayed in order of how closely they
match the query. When determining the degree of match the criteria used are:
o The more search terms a document contains, the closer it matches;
o Rare terms are more important than common ones;
o Short documents match better than long ones.
Result of this type of search will be like Fig. 24. Users can use many search terms as a whole
sentence, or even a whole paragraph. If it has to be specified only one term, documents will be
ordered by its frequency of occurrence.
Fig. 26
After changing the preferences, do not use Back button of browser—that would undo the
changes. Instead, click any of the buttons on the search/browse bar in the preference page.
Fig. 27
GSDL does first exact match for the phrase “Library and Information Service”. Later it also
search by consistent terms and presents the results.
Phrase matches are case-insensitive if ignore case differences is set on the Preferences page.
If the words AND, OR, and NOT appear in the query they are treated as ordinary search terms,
not operators. For operators users must use &, |, and !. In addition, parentheses can be used for
grouping.
Fig. 28
7. CONCLUSION
GSDL is an easy to use software. It can be used to create digital collections ranging from a small
library to a large one. The facility of having a graphical user interface (Web browser) and
command line interface is very advantageous. Advanced users, good at programming can develop
a customized collection using the command line. But GSDL is not completely 100% perfect. It
has some drawbacks also. Like, when a Web documents collection is built through the Web
browser interface, it indexes all the hyperlinked files also, which makes it very cumbersome for
the user to browse through the bulky list of files. But overall it is a very flexible software. And, a
number of DLs have been developed using it, like the Project Gutenberg, the New Zealand
Digital Library (NZDL).
8. REFERENCES
1. Dublin Core Metadata Initiative. from http://dublincore.org/
2. New Zealand Digital Library. from http/www.nzdl.org/cgi-bin/library
3. Greenstone Digital Library Software. from http://www.greenstone.org/english/home.html