Website of Frank Rügheimer

Tools for Condensed Random Sets

Download

The program and sample workflows use libraries and tools from the table utility package written by Christan Borgelt (included in the compressed sourcecode). That package has been released under the GNU LESSER GENERAL PUBLIC LICENSE Version 2.1.

If you are planning to use the table utilities in your own programs, consider visiting Christian's website to obtain the most recent version of the package and its documentation.

Linked Publications

Description

Set-based descriptions, such as annotations and relations, are widely used for Knowledge Representation in Computer Linguistics and Computational Biology. The csvdist packages provides utilities to induce, operate on and query distributions over subsets of large item domains. The tools also support sets over tree-structured value domains, which are associated with ontologies and can be used to consistently model full and partial knowledge over various resolution levels.

The model limits the number of parameters by using a two-layered approximation based on a coarsened probability distribution and coverage factors (see linked publications for technical detail). The representation emphasizes rates for singleton sets and element coverage, which are relevant for the estimation of probability bounds.

The hdist library implements functionality to induce and query distributions over sets and to project them between different resolutions on structured domains. The programs crsinduce and crsapply provide a command-line interface to this functionality. Psvmodel serves as a platform for comparative evaluation of the flat CRS, structured CRS and alternative approaches modelling element presence in sets as mutually independent.

Command-line interface

General usage and arguments for each program are explained by the integrated program help. The help screen can be accessed from the command line by calling the respective program without any arguments or by including the "-?" option:


USAGE: crsinduce [options] dotfile trnfile [outfile]

ARGUMENTS:
dotfile provides a hierarchy structure as a directed tree in simplified
        .dot format.
trnfile contains tables with training data. Format description follows:
        Each table has an optional comment part (line starting with '#'
        followed by a header line with field names. Data records take
        one each, with fields separated by tabulators.
        Any field may either contain an identifier or a set of identifi-
        ers that are separated by spaces and enclosed in curly brackets
        (e.g. {A B C}). 
        By default, the last field is assumed to contain the attribute
        to be processed
outfile specifies the name of an output file for saving the induced mo-
        del. Output consists of several variable-value pairs for model
        parameters and an enriched tree structure annotated with the
        distribution information.

OPTIONS:
  -l#.# set value for Laplace correction to #.#      (default=0.0)
  -m<mode> set interpretation mode for non-leaf nodes; options are:
        disj, incl or other: non-leaf node indicate non-specific
          outcome separate from possible expansions of the node.
          Note: This is similar to using explite "other" outcome in
          each branch. Unexpanded intermediate nodes can absorb probabi-
          lity mass and contribute towards the coverage rates of their
          respective sub-trees. However, that unlike with separate 
          nodes, the generating grammar for instantiations over hierar-
          chical  domains prevent a node from occurring in the same set
          outcome as any node from its expansion sub-tree. If such in-
          valid value combinations are encountered in instantiation 
          specifications the (implicit) ancestor node will be silently
          dropped. Distributions saved in this mode can also serve as
          intermediates if delayed expansion to any of the other inter-
          pretations is desired. It is used e.g., for the Gene Ontology
          standard, which requires genes to be annotated with non-leaf
          terms when this constitutes the most specific characterization
          supported by the available evidence.  (default configuration)
        conj or impr: non-leaf nodes are expanded to the set of all leaf
          nodes in the subtree rooted at the respective node -- used,
          e.g., as an imprecise specification.
        expand, split or uncert: force expansion of outcome tree after
          training. Probability mass and coverage contributions are
          split according to the distribution of expansions of the node
          observed in the remaining data. This mode is consistent with a
          model that permits only leave nodes and sets of leave nodes as
          outcomes
  -s    silent mode (suppress startup greeting and copyright message)

USAGE: crsapply [options] mdlfile [datfile [outfile]]

DESCRIPTION:
  This program reads a hierarchical distribution model and a database of
  (set-)instantiations and assesses the fit of the data to the model.
  If outfile is specified log-likelihoods for individual instantiations
  are written to that file. The program writes a table with fit statis-
  tics of the database as a whole to stdout
ARGUMENTS:
mdlfile hdist model file (e.g. from crsinduce)
datfile database with (set instantiations) to be assessed
        Format description follows:
        Each table has an optional comment part (line starting with '#'
        followed by a header line with field names. Data records take
        one each, with fields separated by tabulators.
        Any field may either contain an identifier or a set of identifi-
        ers that are separated by spaces and enclosed in curly brackets
        (e.g. {A B C}). 
        By default, the last field is assumed to contain the attribute
        to be processed. If datfile is not specified it defaults to 
        stdin.
outfile If outfile is specified a table associating any valid (set-)
        instantiation from datfile  along with their respective log-
        likelihood under the model specified in mdlfile will be written
        to that file. The interpretation of non-leaf symbols and set-
        tings for Laplace corrections follow specifications in mdlfile
OPTIONS:
  -f <framefile>   read frame specification from framefile -- nodes 
        listed in framefile will be considered as leaves for query in-
        terpretation.
  -s    silent mode (suppress greeting and copyright message at startup)

After unpacking the source code. Run the ./demo script to start a tutorial/demonstration of the program and its interface.