Thanks Amgen staff for swift co-operation and feedback on the data corruption fix. The Amgen staff has integrated the patch within a week, and bumped the version up from 1.5.0 to 1.5.2 (r45365, presumably there was an internal unreleased 1.5.1):
The problem (and other similiar issues for "in-house" XML processing), as detailed below, does not affect windows users, but affects Linux users of R. Or more precisely, it affects people trying to process FlowJo workspace files on systems for which the native encoding is not iso8859-1/latin1/codepage 1252. So only English MS Windows users are not affected. Linux and Solaris defaults to UTF-8, and also CJK windows R users are affected.
Slide 16 acknowledgement of flowFlowJo FICCS presentation listed these people:
|Gary Means||Florian Hahne|
|Katie Newhall||Nolwenn Le Meur|
|Research Information Systems||Adam Triester|
|Sharon Wong-Madden||Becton Dickinson|
|Cheng Su||Perry Haaland|
|University of Cambridge|
Most of the Amgen/Becton Dickinson people are probably windows users; the FHCRC people probably only reviewed the work but not actually use it. "Vincent Plagnol" is both a primarily linux user, and had also published on flowcytometry in the last year... I don't know about the others.
Isn't it fun to find out scientific "discovery" may equal carelessness and data corruption?
Duncan Temple Lang (author of RSXML) has posted the result of our discussion on-line link here .
--- On Mon, 15/3/10, Hin-Tak Leung <htl10@...> wrote: > From: Hin-Tak Leung <htl10@...> > Subject: silent data corruption in flowFlowJo, and fix > To: paboyoun@..., gosinkj@... > Cc: bioconductor@... > Date: Monday, 15 March, 2010, 2:20 > Hi, > > Commit r41352 from j.gosink broke flowFlowJo Bioc's nightly > check for most of summer/autumn 2009 until just before BioC > 2.5 code freeze, p.aboyoun committed r42419 which involves > using iconv() to strip multibyte data to make the nightly > check pass. Unfortunately it "fixes" some flowjo workspace > files but breaks others. I finally find the time to look at > it - it is actually fairly serious and causes silent data > corruption and here is the fix - please review and commit. > > The underlying issue is this: FlowJo workspaces files are, > in most(?all) cases, XML with iso8859-1 encoding (a.k.a. > 'latin1'). With win32 R which defaults to codepage 1252 (a > superset of latin1), R check passes - everything is in > latin1 and the data stripping has no effort. On Linux and > other "modern" unix systems, which defaults to UTF-8, R > check fails - not all iso8859-1 text is valid UTF-8 text and > vice versa, and also, the multibyte data strip causes data > corruption. > > The proper fix is to query libxml2 about the xml encoding > and set the encoding explicitly - it is a substantial > rewrite. As a side-effect, the code possibly run faster as > well - most of the gsub() don't not need to be 'g'. The > regular expressions are only concerned with manipulating the > header and only need to match the first instance. > > Cheers, > Hin-Tak Leung > > >
Hin-Tak Leung, last updated 2010-03-25