NAME


class_discoverer.pl - multiclass discovery in array data


SYNOPSIS

class_discoverer.pl --datafile=<name> [--normalize] [--logtransform] [--printgenes] [--noprintgenes] [--score --classfile=<name>] [--k=<value>] [--Pcut=<value>] [--Nsuccess=<value>] [--Ntotal=<value>] [--Tstart=<value>] [--Tend=<value>] [--eta=<value>] [--minscore=<value>] [--help] [--man]


OPTIONS

--datafile=<name>
Specify file with array data. The array data should be in tab-delimited format where rows correspond to genes and columns correspond to individual experiments (measurements). The first column is reserved for gene id's and the first row for experiment id's. Two consequetive tabs are interpreted as a missing value. Gene id's and experiment id's should be unique. The following is an example of the array data format.
    Genes   Exp_1   Exp_2   Exp_3    Exp_4
    Gene_1   0       0       0       0
    Gene_2   1               1       1 
    Gene_3   -1      -1      -1      -1 
    Gene_4   0       1       2       3

--normalize
The array data is centered such that each experiment has average expression value zero.

--logtransform
The array data is logarithm transformed (base 2).

--printgenes
Prints the discriminatory genes for classes. Default is to print the genes.

--noprintgenes
Turns off printing of the discriminatory genes for classes

--score
The score for a given partitioning into classes is calculated. No class discovery is performed. A file with the classes for experiments has to be provided using --classfile.

--classfile=<name>
Specify file with classes for the experiments. If --score is set the score for the classes in the file is calculated. The file should contain two tab-separated columns. The first column should contain experiment id's and the second column class labels. The following is an example of the class labeling format.
    Exp_1       1
    Exp_2       3
    Exp_3       2
    Exp_4       1

--genefile=<name>
Specify file with genes to peeled from the dataset prior to any analysis. The file should contain tab-separated columns. The first column should contain indexes of genes to be peeled. Further columns are optional and ignored in the analysis.

--k=<value>
Specify number of classes.

--Pcut=<value>
Specify cutoff in P value for score calculation.

--Nsuccess=<value>
Specify maximum number of successful updates before lowering temperature in the simulated annealing scheme.

--Ntotal=<value>
Specify maximum total number number of proposed changed labellings before lowering temperature in the simulated annealing scheme.

--Tstart=<value>
Specify start temperature in the simulated annealing scheme.

--Tend=<value>
Specify end temperature in the simulated annealing scheme.

--eta=<value>
Specify factor to decrease the temperature with in the simulated annealing scheme.

--minscore=<value>
Specify minimum score for recorded classes.

--help
Prints a help message and exits.

--man
Prints the manual page and exits.


NOTES

  1. For discovery of two classes, P values from random permutation tests are stored in the file 'pvalues.data' in binary format using the CPAN module Storable. If 'pvalues.data' is not compatible with your system you have to generate one using the included 'generate_pvalues.pl' program.

  2. The file 'pvalues.data' contains results for which the total number of experiments is maximally 100. If you are analysing a data set with more than 100 experiments and do not want to perform the permutation tests every time, you have to modify the subroutine 'new' in 'WilcoxonTest.pm'.


DESCRIPTION

See Y. Liu and M. Ringner, Multiclass discovery in array data, BMC Bioinformatics 5, 70 (2004) for a comprehensive description of the class discovery method.


AUTHORS

Yingchun Liu and Markus Ringner

Please report bugs to markus.ringner@med.lu.se