GenTrain 2 is a new genotype clustering algorithm debuted in GenomeStudio 2009.2 released in December 2009. Preliminary comparison shows it is about 5% faster and also has a slightly higher call rate than GenTrain 1; about 0.5% of calls on autosomes and just under 1% calls on Chromosome X can be different; In brief, GenTrain 1 and GenTrain 2 are about as different as with some 3rd-party algorithms. (any semi-decent algorithm on well-behaving SNPs do _exactly_ the same thing, so differences are always in small percentages of SNPs/call-rate/calls.).

Despite the name similiarity, GenTrain 2 is not an update to GenTrain 1, but a different algorithm altogether. It uses HWE (and hopefully does not suffer the same problem as Illuminus), and about 5% faster than Gentrain 1. Both GenTrain 1 and Gentrain 2 are 2x to 5x faster than Illuminus - 5x on windows, 2x on Linux. The call rate of GenTrain 2 is slightly higher than GenTrain 1, and about 0.5% of calls can be different between GenTrain 2 and GenTrain 1.

So I have finished running GenTrain 2 on the 4000 550k samples of 1958 birth cohort (plus every thing else). On a windows-based computing grid, it would have taken about one hour and 20 minutes; on a Linux-based grid, it takes 2.5 hours. I don't have either, so it took a while.

While testing GenTrain 2, I found a bug that was introduced into Mono in Jul 2009 (r138254): Bug 574597 - [Regression?] poor compression in mono 2.6 with System.IO.Compression.GZipStream. The fix was committed right away but didn't make Mono 2.6.1 (released a couple of weeks before the bug fix). So most people should upgrade to 2.6.3 (Mar 11 2010)

AFAIK, only MRC Edinburgh personnel has the expertise to run GenTrain 1 to thousands of samples properly (Don't ask me how they do it), not Sanger nor T1DGC. It is also because of the existence of the Mono that I have provided the GenTrain 2 genotypes.

The genotypes were encrypted with fairly strong encryption - please ask the relevant people (not me) for the decryption key files. David, Neil, and possibly other DIL personnel have the decryption keys for the GenTrain 1 and GenTrain 2 genotypes for the 2600 1958BC samples typed under the T1DGC study; Panos, Simon Porter and possibly other Sanger personnel have the decryption keys for the GenTrain 1 and GenTrain 2genotypes for the 1400 1958BC samples typed in Sanger.

Call rate, etc are no measure of the "goodness" of a clustering algorithm - I think one wants to look at *the worst* disagreements, and in those SNPs, which algorithm is correct. (in that sense, Illuminus is very poor, since most of the instances when it disagrees with GenTrain 1 or 2, it is wrong). If anybody learns anything interesting about how GenTrain 2 behaves compared to GenTrain 1, feel free to share.

Here is the decryption procedure:

For those who prefer GUI, launch "Gnu Privacy Assistant" (for windows from GPA for windows download, linux version from your typical linux installation, really) - see screenshot: ( GPA screenshot) click the "import" button to import the decryption key, then load the file into the file-manager and click the "decrypt" button, then answer the single pop-up question.

For command-line, well, it is just:

gpg --import <key_file>
gpg <data_file>

then answer the one question it asks. (there is no option needed - gpg auto-detects encrypted data and the default is decrypt - can't be simplier than that).

The problem with Illuminus (and some of other bioinformatics/biostatistics software) is that it was written to publish and to claim credit - they are *not* written to do a job well. After a lot of publicity, etc they get abandoned with a list of known or unknown bugs, and promised features don't get implemented, because, well, they are just promises. I don't know which is the best, but

Hin-Tak Leung, last updated 2010-03-25

