Website of Frank Rügheimer

k-nearest neighbor classifier

Download

The program and sample workflows use libraries and tools from the table utility package written by Christan Borgelt (included in the compressed sourcecode). That package has been released under the GNU LESSER GENERAL PUBLIC LICENSE Version 2.1.

If you are planning to use the table utilities in your own programs, consider visiting Christian's website to obtain the most recent version of the package and its documentation.

Description

The k-nearest neighbor algorithm implements a simple data classification method, in which new cases are assigned the category label most frequently observed among their closest neighbors within a set of examples. Since the approach does not involve the construction of an explicit model, it avoids strong assumptions about the distribution of the target classes. The k-nearest neighbor approach is an instance of a lazy learning algorithm.

The set of "closest elements" is determined using a distance measure on a vector space, with the parameter k controlling the desired cardinality of the set. In this implementation the notion of k-neighborhood has been extended to avoid ambiguity if the distance measure fails to discern a unique set of k-neighbors. To that end all examples, for which at most k-1 elements are closer to the case to be classified, are included in the k-neighborhood. For program behavior in cases with more than one maximally frequent class in the in the k-neighborhood see the option -R.

This implementation of the knn algorithm was designed to specifically support applications in Systems Biology. It operates on large tables and comes with measures suited to comparing gene expression vectors.

Command-line interface

General usage and arguments are explained in the help text that comes with the program. It can be accessed from the command line by calling the program without any arguments or by including the "-?" option:

USAGE: ./knn [options] domfile [-h hdrx] xmplfile [-h hdri] infile [outfile]
OPTIONS:
  -k#      set neighborhood size parameters (the actual number
           of neighbors may exceed the value as examples having
           equal distance from the current case are included
           as a group)        (default: "5")
  -R meth  select method for forced tie resolution (initially
           ties are resolved by iteratively removing the most
           distant points from the neighborhood. If this fails,
           no class is assigned. However, tie resolution can be
           enforced by randomly selecting among the candidate 
           classes or using the class encountered first)
           valid values: "random" ,"order"
                              (default: do not resolve)
  -a       align fields       (default: do not align)
  -w       do not write field names to output file
  -b#      blank   characters (default: " \t\r")
  -f#      field   separators (default: " \t")
  -r#      record  separators (default: "\n")
  -s       suppress matching of corresponding tuple pairs in
           example and input files. Used for leave-one-out
           cross-validation   (default: disabled)
  -c name  select target attribute column in example file
                              (default:last column)
  -l name  set name for column classifier output column
                              (default:knn)
  -m#      select distance/similarity measure
           permissible values are:
           "euclid ":      ordering with euclidean distance
           "correl ":      ordering based on correlation
                           neg. correlation increases distance
           "correl2":      ordering based on correlation
                           neg. correlation decreases distance
                              (default: euclidean distance)
  -C#      comment characters (default: "#")
  -x       extended output
           Setting this option adds columns with average value
           of similarity/distance to members of neighborhood,
           number of occurrences for the most frequent class in
           the neighborhood and the real size of the selected
           neighborhood. The latter may differ from the argument
           passed via -k due to the extended definition the
           neighborhood and the tie resolution strategy
           Note: distances will currently be averaged even if
           that operation is not meaningful for the selected
           measure. This may change in future versions
  -h       read table header (field names) from hdrfile
                    (default: field names in first record)
  hdrx     file containing table header (field names) for
           example file
  hdri     file containing table header (field names) for
           input case file
ARGUMENTS:
  domfile  file with domain specifications (see dom utility)
  xmplfile file containing classification examples (no header)
  infile   file containing cases to be classified (no header)
  outfile  output file for table with classification results to
           if omitted output is written to <stdout>

Note that domainfile has to be created prior to running the program. A convenient way to create a domain file is by using Christian Borgelt's dom tool, which is distributed with this program. Examples of typical workflows can be found in the makefile under the target names starting with test.