SiSU >>

SiSU information Structuring Universe

Serialized information, Structured Units

software for electronic texts, document collections, books, digital libraries, and search

with "atomic search" and text positioning system (shared text citation numbering: "ocn")

outputs include: plaintext, html, xhtml, XML, LaTeX, pdf, SQL (PostgreSQL and SQLite)


What does SiSU do? Summary [*]

Book Samples [*]

Markup Examples [*]

Object Citation Numbering - ocn [*]

(a text positioning system)

Search - "Atomic" [*]

Of interest is the ease of streaming documents to a relational database, at an object (roughly paragraph) level and the potential for increased precision in the presentation of matches that results thereby. The ability to serialise html, LaTeX, XML, SQL, (whatever) is also inherent in / incidental to the design. For a description see the abandoned U.S. provisional patent application

License [*]

Download [*]

Gnu / Linux / Unix


man pages:





document preparation can be on any platform, in any editor:

Markup Syntax


* Composite document

the composite document is a superset of the following documents:

SiSU description

SiSU examples

SiSU technical

SiSU chronology

SiSU license

SiSU standard

SiSU download

SiSU abandoned provisional patent

Note: the placement of SiSU documents on the Net predate the release of SiSU.

For less markup than the most elementary HTML you can have so much more.

SiSU - Serialized information, Structured Units for Electronic Documents, is a document creation/management framework with the following features:

(i) markup syntax: (a) simpler than html, (b) mnemonic, influenced by mail/messaging/wiki markup practices, (c) human readable, and easily writable,

(ii) (a) minimal markup requirement, (b) single file marked up for multiple outputs,

(iii) (a) multiple outputs include amongst others: html; pdf via LaTeX; (structured) XML; sql - currently PostgreSQL (and SQLite); ascii, (also texinfo), (b) takes advantage of the strengths implicit in these very different output types,

(iv) provides a common object positioning and citation system for all outputs, which is human relevant and machine usable: object citation numbering, all objects (paragraphs, headings, verse, tables etc. and images) are numbered identically, for citation purposes, in all outputs (html, pdf, sql etc.),

(v) use of Dublin Core and other meta-tags to permit the addition of some semantic information on documents, and making easy integration of rdf/rss feeds etc.,

(vi) creates organised directory/file structure for (file-system) output, easily mapped with its clearly defined structure, with all text objects numbered, you know in advance where in each document output type, a bit of text will be found (eg. from an sql search, you know where to go to find the prepared html output or pdf etc.)... there is more; easy directory management and document associations, the document preparation (sub-)directory may be used to determine output (sub-)directory, the skin used, and the sql database used,

(vii) search of document sets, the relational database retains information on the document structure, and citation numbering makes it possible for example to present search matches as an index of documents and locations within the document where the match is found,

(viii) "word maps" rudimentary index, consisting of all the words in a document and their (text/ object) locations within the text, (and the possibility of adding vocabularies),

(ix) easily skinnable, document appearance on a project/site wide, directory wide, or document instance level easily controlled/changed,

(x) in many cases a regular expression may be used (once in the document header) to define all or part of a documents structure obviating or reducing the need to provide structural markup within the document,

(xi) is a batch processor for handling large document sets, ... though once generated they need not be re-generated, unless changes are made to the desired presentation of a particular output type,

(xii) possible to pre-process, which permits: the easy creation of standard form documents, and templates/term-sheets, or; building of composite documents (master documents) from other sisu marked up documents, or marked up parts, i.e. import documents or parts of text into a main document should this be desired

(xiii) future proofing, a framework for adding further capability or updating existing capability as required: (a) modular, (thanks in no small part to Ruby) another output format required, write another module....(b) easy to update output formats (eg html, xhtml, latex/pdf produced can be updated in program and run against whole document set), (c) easy to add, modify, or have alternative syntax rules for input, should you need to,

(xiv) scalability, dependent on your file-system (in my case Reiserfs) and on the relational database used (currently Postgresql and SQLite), and your hardware,

(xv) only marked up files need be backed up, to secure the larger document set produced,

(xvi) document version and comparison considerations (a) possibility to easily check or guarantee that the substantive content of a document is unchanged, through md5 (or other) hash keys, (b) version control, documents integrated with time based version control system, default CVS with use of $Id$ tag, which SiSU checks (c) SiSU's minimalist markup makes for meaningful "diffing" of the substantive content of markup-files,

(xvii) document management,

(xviii) use your favourite editor, syntax highlighting files for markup, primarily (g)vim so far,

(xviv) remote operations: (a) run SiSU on a remote server, (having prepared sisu markup documents locally or on that server, i.e. this solution where sisu is installed on the remote server, would work whatever type of machine you chose to prepare your markup documents on), (b) alternatively, (assuming sisu is available to you locally but not installed on the remote server) configure sisu to securely copy (scp) its output to your remote host and run sisu locally, (c) request a remotely located sisu markup file and process it locally by identifying it by its' url.

More information on SiSU provided at

More information on SiSU provided at:

SiSU was developed in relation to legal documents, and is strong across a wide variety of texts (law, literature...(humanities, law and part of the social sciences)). SiSU handles images but is not suitable for formulae/ statistics, or for technical writing at this time.

SiSU has been developed and has been in use for several years. Requirements to cover a wide range of documents within its use domain have been explored.

Some modules are more mature than others, the most mature being Html and LaTeX / pdf. PostgreSQL and search functions are useable and together with ocn unique (to the best of my knowledge). The XML output document set is "well formed" but largely proof of concept.

w3 since October 3 1993