MULTICLUST

Multinomial clustering of multi-allelic, multi-locus population genomic data to assign population structure.

Installation

MULTICLUST is written in the C programming language, and will therefore need to be compiled prior to execution. To install from source, download the GitHub repository to your machine and install via:

git clone https://github.com/arunsethuraman/multiclust/ cd multiclust make

This should create an executable named "multiclust" inside the folder.

Synopsis

multiclust -k < n > | -1 < n > -2 < n > [-a -b -B < n > -bou < d > -c -C < n > -d < s > -e < d > -E < d > -f < d > -g < d > -h
	-i < n > -I -m < n > --missing < n > -M < n > -n < n > -o < s > -p < n > --projection --plus -R -s < n > -t < n > -T < d > -u < s > -v -w < s > -x -f < s > --format < s >
	--simulate <qfile> <pfile> <ofile>

	where < n > stands for integer, < s > for string, < d > for double

Description

MULTICLUST clusters multivariate discrete data (here multi-locus genomic data, multi-allelic, polyploid) observed on a sample of individuals using the EM algorithm (both accelerated and unaccelerated). It handles data missing at random. It assumes loci within an individual are independent (i.e. in linkage equilibrium). It allows the admixture model, where each locus is independently drawn from a cluster, or the mixture model, where each individual (i.e. the entire genotype of an individual) is drawn from a cluster.

Options

-a	Choose admixture model (default: no).

-b, --bootstrap	Bootstrap test of H0: K=<k>-1 vs. Ha: K=<k>, where <k> is
	given by -k option.  Specify number of bootstraps as
	argument (default: 0).

--bound	Lower bound for allele and mixing/admixing proportions
	(default: 1.000000e-08).

-B	DEBUG ONLY: turns on opt->test_run

-c	Constrain mixing proportions identical across individuals
	(only enforced with -a; default: no).

-C	The maximum number of iterations to fit (default: 0).

-d	Directory where output files are written (default: ./).

-e	Allowable log likelihood relative error for convergence
	(default: 0.0e+00).

-E	Allowable log likelihood absolute error for convergence
	(default: 1.0e-04).

-f	Name of data file (STRUCTURE format).

--format
	Format of data output file (default: stru).
		stru	STRUCTURE format, the default.
		ped	Plink's ped format.

-g	Adjust step size at most this many times (default: 0)

-h	Display help.

-i	Initial iterations prior to acceleration (default: 0)

-I	Alleles are indices (no sorting, etc.) (default: no)

-I1	Alleles are indices plus 1 (no sorting, etc.) (default: no)

-k	The number of clusters to fit (default: 6).

-1	The minimum number of clusters to fit (default: 6).

-2	The maximum number of clusters to fit (default: 6).

-m	The number of Rand EM initializations, 0 to avoid Rand EM
	(default: 50).

--missing
	Integer value that indicates missing (default: -9).

-M	Parallelization option
	(default: no parallelization). 

-n	Number of initializations to run EM to convergence
	(default: 50).

-o	Option to create unique output file name

-p	The ploidy (default: 2).

--projection
	Turn off simplex projection (default: on).

--plus
	Plus one to alleles when writing data (default: on).

-r	Random number (default: 1234567).

-R	Data file in R format (default: no).

-s	The acceleration scheme (default: No acceleration).

	Acceleration options:

		0 (default) - no acceleration

		1 - SQUAREM version 1

		2 - SQUAREM version 2

		3 - SQUAREM version 3

		4 - Quasi Newton version 1 (1 secant condition)

		5 - Quasi Newton version 2 (2 secant conditions)

		6 - Quasi Newton version 3 (3 secant conditions)

--simulate <qfile> <pfile> [<ofile>]	Simulate data from admixture <qfile>, <pfile>, and write data to <ofile>.

-u	Iterate until beat target

	-u n #: repeat until reach same max # times (default: 0)

	-u l #: repeat until reach max log likelihood # (default: 0.000000)

-t	The time (in minutes) to maximize likelihood (default: 0).
	Be sure to check convergence if you set the above!

-v	Level of verbosity (default: 3)

-w	Write best fits to file (default: yes)

	Also used to time repeated runs, as follows:

	-w n < n >: repeat at least < n > times (default: 1)

	-w t < n >: repeat at least < n > minutes (default: 0)

	-w m < n >: repeat at most < n > minutes (default: 0)

-x	Use block relaxation algorithm (default: no).

Citation

MULTICLUST - Fast Multinomial Clustering of multiallelic genotypes to infer genetic population structure by Arun Sethuraman, Wei-Chen Chen, Margaret Wanjiku, Karin S. Dorman (forthcoming) Please contact Dr. Sethuraman ([email protected]) with questions, or post bug reports to this GitHub page.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
00README		00README
LICENSE		LICENSE
README.md		README.md
accel_em.c		accel_em.c
array.h		array.h
bootstrap.c		bootstrap.c
cline.c		cline.c
cline.h		cline.h
em_alg.c		em_alg.c
log_likelihood.c		log_likelihood.c
makefile		makefile
message.c		message.c
message.h		message.h
multiclust.c		multiclust.c
multiclust.h		multiclust.h
read_file.c		read_file.c
rnd_init.c		rnd_init.c
simplex.c		simplex.c
write_file.c		write_file.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MULTICLUST

Installation

Synopsis

Description

Options

Citation

About

Releases

Packages

Contributors 3

Languages

License

arunsethuraman/multiclust

Folders and files

Latest commit

History

Repository files navigation

MULTICLUST

Installation

Synopsis

Description

Options

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages