Repository 'w4mkmeans' in Galaxy Toolshed
Kmeans for W4m is Galaxy tool-wrapper to wrap the
R stats::kmeans
package for use with the
Workflow4Metabolomics flavor of
Galaxy.
This tool is built with planemo.
Arthur Eschenlauer (University of Minnesota, [email protected])
See the NEWS section below
Using the intensities in the dataMatrix, this tool clusters samples, features (variables), or both from the W4M dataMatrix and writes the results to new columns in sampleMetadata, variableMetadata, or both, respectively.
- If several, comma-separated K's are supplied, then one column is added for each K.
- This clustering is not hierarchical; each member of a cluster by definition is not a member of any other cluster.
- For feature-clustering, each feature is assigned to a cluster such that the feature's response for all samples is closer to the mean of all features for that cluster than to the mean for any other cluster.
- For sample-clustering, each sample is assigned to a cluster such that the sample's response for all features is closer to the mean of all samples for that cluster than to the mean for any other cluster.
Please note that some places in the W4m documentation refer to features (ion-intensity vs. retention-time and mass-to-charge ratio) as 'variables', since they are the variables used in statistical analysis.
Tool category | Upstream tool category | Downstream tool categories |
---|---|---|
Statistical Analysis | Preprocessing | Statistical Analysis |
File | Format |
---|---|
Data matrix | tabular |
Sample metadata | tabular |
Variable (i.e., feature) metadata | tabular |
Data matrix - input-file dataset
W4m variable x sample 'dataMatrix' (tabular separated values) file of the numeric data matrix, with . as decimal, and NA for missing values; the table must not contain metadata apart from row and column names; the row and column names must be identical to the rownames of the sample and feature metadata, respectively (see below)
Sample metadata - input-file dataset
W4m sample x metadata 'sampleMetadata' (tabular separated values) file of the numeric and/or character sample metadata, with . as decimal and NA for missing values
Feature metadata - input-file dataset
W4m variable x metadata 'variableMetadata' (tabular separated values) file of the numeric and/or character feature metadata, with . as decimal and NA for missing values
prefix for cluster names - character(s) to add as prefix to category number (default = 'c')
Some tools require non-numeric values to discern categorical data; e.g., enter 'c' here to prepend 'c' to cluster numbers in the output; default 'c'.
ksamples - K or K-range for samples (default = 0)
integer or comma-separated integers ; zero (the default) or less will result in no calculation.
kfeatures - K or K's for features (default = 0)
integer or comma-separated integers ; zero (the default) or less will result in no calculation.
iter_max - maximum_iterations (default = 20)
maximum number of iterations per calculation (see
stats::kmeans
documentation ).
nstart - how many random sets should be chosen (default = 20)
number of random sets of centers to start calculation (see
stats::kmeans
documentation ).
algorithm - algorithm for clustering (default = 20)
K-means clustering algorithm, default 'Hartigan-Wong'; alternatives 'Lloyd', 'MacQueen'; 'Forgy' is a synonym for 'Lloyd' (see
stats::kmeans
documentation ).
W4m sampleMetadata - (tabular separated values) file identical to the Sample metadata file given as an input argument, excepting one column added for each K
- k# - cluster number for clustering samples with K = #
W4m variableMetadata - (tabular separated values) file identical to the Feature metadata file given as an input argument, excepting one column added for each K
- k# - cluster number for clustering features with K = #
scores - (tabular separated values) file with one line for each K.
- clusterOn - what was clustered - either 'sample' or 'feature'
- k - the chosen K for clustering
- totalSS - total ( between-treatements plus total of within-treatements ) sum of squares
- betweenSS - between-treatements sum of squares
- proportion - betweenSS / totalSS
Input files
Other input parameters
Input Parameter | Value |
---|---|
prefix for cluster names | c |
ksamples | 3,4 |
kfeatures | 5,6,7 |
iter_max | 20 |
nstart | 20 |
algorithm | Hartigan-Wong |
September 2018, Version 0.98.5 - Maintenance release
- Fix issue inherited by copy-paste from HegemanLab/w4mclassfilter#1.
March 2018, Version 0.98.4 - Maintenance release
- Update bioconda r-base dependency to v3.4.1
- Add dependency on conda packages libssh2 and krb5 needed by makePSOCKcluster on some platforms
- Make tool fail when no results are produced
- Changed parameter defaults for iterations and random sets to improve convergence of results.
- Published to the Galaxy toolshed https://toolshed.g2.bx.psu.edu/view/eschen42/w4mkmeans/c415b7dc6f37
August 2017, Version 0.98.3 - Feature-tuning release
- Add (optional) prefix to category numbers for downstream tools that treat only non-numeric data as categorical.
- Accept as possible K only unique numbers convertible to integer; discard others without failing.
- Published to the Galaxy toolshed https://toolshed.g2.bx.psu.edu/view/eschen42/w4mkmeans/02cafb660b72
August 2017, Version 0.98.1 - First release
- Published to the Galaxy toolshed https://toolshed.g2.bx.psu.edu/view/eschen42/w4mkmeans/6ccbe18131a6
R Core Team (2017). stats::kmeans - K-Means Clustering, R Foundation for Statistical Computing.[Link]
Forgy, E. (1965). Cluster Analysis of Multivariate Data: Efficiency versus Interpretability of Classification. In Biometrics, 21 (3), pp. 768-769.
Guitton, Yann and Tremblay-Franco, Marie and Le Corguillé, Gildas and Martin, Jean-François and Pétéra, Mélanie and Roger-Mele, Pierrick and Delabrière, Alexis and Goulitquer, Sophie and Monsoor, Misharl and Duperier, Christophe and et al. (2017). Create, run, share, publish, and reference your LC–MS, FIA–MS, GC–MS, and NMR data analysis workflows with the Workflow4Metabolomics 3.0 Galaxy online infrastructure for metabolomics. In The International Journal of Biochemistry & Cell Biology, [doi:10.1016/j.biocel.2017.07.002]
Giacomoni, F. and Le Corguille, G. and Monsoor, M. and Landi, M. and Pericard, P. and Petera, M. and Duperier, C. and Tremblay-Franco, M. and Martin, J.-F. and Jacob, D. and et al. (2014). Workflow4Metabolomics: a collaborative research infrastructure for computational metabolomics. In Bioinformatics, 31 (9), pp. 1493–1495. [doi:10.1093/bioinformatics/btu813]
Hartigan, J. and Wong, M. (1979). Algorithm AS136: A k-means clustering algorithm. In Applied Statistics, 28, pp. 100-108. [doi:10.2307/2346830]
Lloyd, S. (1982). Least squares quantization in PCM. In IEEE Transactions on Information Theory, 28 (2), pp. 129–137. [doi:10.1109/tit.1982.1056489]
MacQueen, J. B. (1967). Some Methods for Classification and Analysis of MultiVariate Observations. In Proc. of the fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 281-297.