k-nearest neighbor classifier
The program and sample workflows use libraries and tools from the table utility package written by Christan Borgelt (included in the compressed sourcecode). That package has been released under the GNU LESSER GENERAL PUBLIC LICENSE Version 2.1.
If you are planning to use the table utilities in your own programs, consider visiting Christian's website to obtain the most recent version of the package and its documentation.
The k-nearest neighbor algorithm implements a simple data classification method, in which new cases are assigned the category label most frequently observed among their closest neighbors within a set of examples. Since the approach does not involve the construction of an explicit model, it avoids strong assumptions about the distribution of the target classes. The k-nearest neighbor approach is an instance of a lazy learning algorithm.
The set of "closest elements" is determined using a distance measure on a
vector space, with the parameter k controlling the desired cardinality of the
set. In this implementation the notion of k-neighborhood has been extended to
avoid ambiguity if the distance measure fails to discern a unique set of
k-neighbors. To that end all examples, for which at most k-1 elements are
closer to the case to be classified, are included in the k-neighborhood. For
program behavior in cases with more than one maximally frequent class in the in
the k-neighborhood see the option
This implementation of the knn algorithm was designed to specifically support applications in Systems Biology. It operates on large tables and comes with measures suited to comparing gene expression vectors.
General usage and arguments are explained in the help text that comes with the program. It can be accessed from the command line by calling the program without any arguments or by including the "-?" option:
USAGE: ./knn [options] domfile [-h hdrx] xmplfile [-h hdri] infile [outfile] OPTIONS: -k# set neighborhood size parameters (the actual number of neighbors may exceed the value as examples having equal distance from the current case are included as a group) (default: "5") -R meth select method for forced tie resolution (initially ties are resolved by iteratively removing the most distant points from the neighborhood. If this fails, no class is assigned. However, tie resolution can be enforced by randomly selecting among the candidate classes or using the class encountered first) valid values: "random" ,"order" (default: do not resolve) -a align fields (default: do not align) -w do not write field names to output file -b# blank characters (default: " \t\r") -f# field separators (default: " \t") -r# record separators (default: "\n") -s suppress matching of corresponding tuple pairs in example and input files. Used for leave-one-out cross-validation (default: disabled) -c name select target attribute column in example file (default:last column) -l name set name for column classifier output column (default:knn) -m# select distance/similarity measure permissible values are: "euclid ": ordering with euclidean distance "correl ": ordering based on correlation neg. correlation increases distance "correl2": ordering based on correlation neg. correlation decreases distance (default: euclidean distance) -C# comment characters (default: "#") -x extended output Setting this option adds columns with average value of similarity/distance to members of neighborhood, number of occurrences for the most frequent class in the neighborhood and the real size of the selected neighborhood. The latter may differ from the argument passed via -k due to the extended definition the neighborhood and the tie resolution strategy Note: distances will currently be averaged even if that operation is not meaningful for the selected measure. This may change in future versions -h read table header (field names) from hdrfile (default: field names in first record) hdrx file containing table header (field names) for example file hdri file containing table header (field names) for input case file ARGUMENTS: domfile file with domain specifications (see dom utility) xmplfile file containing classification examples (no header) infile file containing cases to be classified (no header) outfile output file for table with classification results to if omitted output is written to <stdout>
domainfile has to be created prior to running the
A convenient way to create a domain file is by using Christian Borgelt's
dom tool, which is distributed with this program. Examples of
typical workflows can be found in the makefile under the target names starting