Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Require sequence dictionary MD5s when writing CRAM. #1612

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,7 @@ public CRAMContainerStreamWriter(
this.cramIndexer = indexer;
this.outputStreamIdentifier = outputIdentifier;
this.containerFactory = new ContainerFactory(samFileHeader, encodingStrategy, referenceSource);
samFileHeader.getSequenceDictionary().requireMD5sOrThrow(outputIdentifier);
}

/**
Expand Down
41 changes: 39 additions & 2 deletions src/main/java/htsjdk/samtools/SAMFileWriterFactory.java
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,10 @@
*/
package htsjdk.samtools;

import htsjdk.io.HtsPath;
import htsjdk.io.IOPath;
import htsjdk.samtools.cram.ref.CRAMReferenceSource;
import htsjdk.samtools.cram.ref.ReferenceSource;
import htsjdk.samtools.cram.structure.CRAMEncodingStrategy;
import htsjdk.samtools.util.BlockCompressedOutputStream;
import htsjdk.samtools.util.FileExtensions;
import htsjdk.samtools.util.IOUtil;
Expand All @@ -36,9 +37,11 @@

import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.List;
import java.util.zip.Deflater;
import static htsjdk.samtools.SamReader.Type.*;

Expand Down Expand Up @@ -634,6 +637,7 @@ private CRAMFileWriter createCRAMWriterWithSettings(
final Path referenceFasta) {

final CRAMReferenceSource referenceSource;
SAMFileHeader repairedHeader = header;
if (referenceFasta == null) {
log.info("Reference fasta is not provided when writing CRAM file " + outputFile.toUri().toString());
log.info("Will attempt to use a default reference or download as set by defaults:");
Expand All @@ -642,6 +646,10 @@ private CRAMFileWriter createCRAMWriterWithSettings(

referenceSource = ReferenceSource.getDefaultCRAMReferenceSource();
} else {
// CRAMs are required to have a SAMHeader with a sequence dictionary that contains MD5
// checksums for each sequence. Check the proposed dictionary to see if it has MD5s, and
// if not, attempt to update it with MD5s from the reference dictionary, if one exists.
repairedHeader = repairSequenceMD5s(header, new HtsPath(referenceFasta.toUri().toString()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An aside, should we add direct HtsPath(Path) HtsPath(File) constructors?

referenceSource = new ReferenceSource(referenceFasta);
}
OutputStream cramOS = null;
Expand Down Expand Up @@ -674,7 +682,7 @@ private CRAMFileWriter createCRAMWriterWithSettings(
indexOS,
presorted,
referenceSource,
header,
repairedHeader,
outputFile.toUri().toString());
setCRAMWriterDefaults(writer);

Expand All @@ -687,6 +695,35 @@ private void setCRAMWriterDefaults(final CRAMFileWriter writer) {
//writer.setEncodingParams(new CRAMEncodingStrategy());
}

// If any sequence records don't contain MD5s, attempt to repair them by consulting the reference's
// dictionary file, if one exists.
private SAMFileHeader repairSequenceMD5s(final SAMFileHeader header, final IOPath referenceFasta) {
final List<SAMSequenceRecord> missingMD5s = header.getSequenceDictionary().getSequencesWithMissingMD5s();
if (!missingMD5s.isEmpty()) {
final String missingMD5Message = SAMSequenceDictionary.createFormattedMD5Message("SAM header", missingMD5s);
log.warn(String.format(
"%s Attempting to use the reference dictionary to repair missing MD5s required for CRAM output.",
missingMD5Message));
final IOPath referenceDictionary = SAMSequenceDictionary.getFastaDictionaryFileName(referenceFasta);
try (final InputStream fastaDictionaryStream = referenceDictionary.getInputStream()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems really awkward:

  1. We already have SAMSequenceDictionaryExtractor and this reimplements what it does although it might be more efficient.
  2. Immediately after this method returns we open a ReferenceSource which owns a ReferenceSequenceFile which can be queried directly for a sequence dictionary.
  3. This is IOPath aware, which is good, but it will fail in any complex cases in the future like using RefGt where there's no simple way to go from a path name to a dictionary file.
  4. I think the reference fasta can be null in the case the program wants to use the default one.

I think we should eliminate these lines and the new methods on SAMSequenceDictionary and instead add an accessor for the sequence dictionary to ReferenceSource. Then do the repair call after opening the reference source and get the sequence dictionary directly from the ReferenceSource.

final SAMSequenceDictionary newDictionary =
SAMSequenceDictionary.loadSAMSequenceDictionary(fastaDictionaryStream);
newDictionary.requireMD5sOrThrow(referenceFasta.toString());
final SAMFileHeader repairedHeader = header.clone();
repairedHeader.setSequenceDictionary(newDictionary);
return repairedHeader;
} catch (final IOException | RuntimeException e) {
throw new RuntimeException(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a CRAM or SAMException instead?

String.format(
"The attempt to repair the missing sequence MD5s (%s) using the reference dictionary %s failed.",
missingMD5Message,
referenceDictionary),
e);
}
}
return header;
}

@Override
public String toString() {
return "SAMFileWriterFactory [createIndex=" + createIndex + ", createMd5File=" + createMd5File + ", useAsyncIo="
Expand Down
100 changes: 100 additions & 0 deletions src/main/java/htsjdk/samtools/SAMSequenceDictionary.java
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,15 @@
package htsjdk.samtools;

import htsjdk.beta.plugin.HtsHeader;
import htsjdk.io.HtsPath;
import htsjdk.io.IOPath;
import htsjdk.samtools.util.BufferedLineReader;
import htsjdk.samtools.util.FileExtensions;
import htsjdk.samtools.util.Log;
import htsjdk.samtools.util.RuntimeIOException;

import java.io.IOException;
import java.io.InputStream;
import java.io.Serializable;
import java.math.BigInteger;
import java.security.MessageDigest;
Expand Down Expand Up @@ -294,6 +301,99 @@ public String md5() {
}
}


/**
* Given a fasta filename, return the name of the corresponding dictionary file.
*/
public static IOPath getFastaDictionaryFileName(IOPath fastaFile) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be adding more filename guessing and munging here. I think we can rely on ReferenceSequenceFile to find and provide the SequenceDictionary.

final String fastaName = fastaFile.getURIString();
int lastDot = fastaName.lastIndexOf('.');
return new HtsPath(fastaName.substring(0, lastDot) + FileExtensions.DICT);
}

/**
* Given a fasta dictionary file, returns its sequence dictionary
*
* @param fastaDictionary fasta dictionary file
* @return the SAMSequenceDictionary from fastaDictionaryFile
*/
public static SAMSequenceDictionary loadSAMSequenceDictionary( final IOPath fastaDictionary ) {
try ( final InputStream fastaDictionaryStream = fastaDictionary.getInputStream() ) {
return loadSAMSequenceDictionary(fastaDictionaryStream);
}
catch ( IOException e ) {
throw new RuntimeIOException("Error loading fasta dictionary file " + fastaDictionary, e);
}
}

/**
* Given an InputStream connected to a fasta dictionary, returns its sequence dictionary
*
* Note: does not close the InputStream it's passed
*
* @param fastaDictionaryStream InputStream connected to a fasta dictionary
* @return the SAMSequenceDictionary from the fastaDictionaryStream
*/
public static SAMSequenceDictionary loadSAMSequenceDictionary( final InputStream fastaDictionaryStream ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a good thing to have here. Can you update SAMSequenceDictionaryExtractor to call this method?

// Don't close the reader when we're done, since we don't want to close the client's InputStream for them
final BufferedLineReader reader = new BufferedLineReader(fastaDictionaryStream);

final SAMTextHeaderCodec codec = new SAMTextHeaderCodec();
final SAMFileHeader header = codec.decode(reader, fastaDictionaryStream.toString());

// Make sure we have a valid sequence dictionary before continuing:
if (header.getSequenceDictionary() == null || header.getSequenceDictionary().isEmpty()) {
throw new RuntimeException (
"Could not read sequence dictionary from given fasta stream " +
fastaDictionaryStream
);
}

return header.getSequenceDictionary();
}

final void requireMD5sOrThrow(final String contextMessage) {
final List<SAMSequenceRecord> missingMD5s = getSequencesWithMissingMD5s();
if (!missingMD5s.isEmpty()) {
throw new RuntimeException(SAMSequenceDictionary.createFormattedMD5Message(contextMessage, missingMD5s));
}
}

/**
* @return all sequences in this dictionary that do not have an MD5 attribute
*/
List<SAMSequenceRecord> getSequencesWithMissingMD5s() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seem like very specific methods. It's a bit funny to have them here as package private. I wonder if it makes more sense to move them to the callsite.

return getSequences().stream().filter(
seqRec -> {
final String MD5 = seqRec.getMd5();
return MD5 == null || MD5.length() == 0;
}).collect(Collectors.toList());
}

/**
* Create a formatted error string message describing which MD5s are missing from the sequence dictionary.
*
* @param contextID a string recognizable to the user describing the input that is the source of the failure
* @param badSequenceRecords the sequence with missing MD5s
* @return a message string suitable for presentation to the user
*/
static String createFormattedMD5Message(final String contextID, List<SAMSequenceRecord> badSequenceRecords) {
final int MAX_ERRORS_REPORTED = 10;
if (badSequenceRecords.size() != 0) {
return String.format(
"The sequence dictionary for %s is missing the required MD5 checksum for some contigs: %s%s.",
contextID,
badSequenceRecords.stream()
.limit(Integer.min(badSequenceRecords.size(), MAX_ERRORS_REPORTED))
.map(SAMSequenceRecord::getSequenceName)
.collect(Collectors.joining(",")),
badSequenceRecords.size() > MAX_ERRORS_REPORTED ?
" and others":
"");
}
return null;
}

@Override
public int hashCode() {
return mSequences.hashCode();
Expand Down
23 changes: 20 additions & 3 deletions src/main/java/htsjdk/samtools/SAMSequenceRecord.java
Original file line number Diff line number Diff line change
Expand Up @@ -24,16 +24,16 @@
package htsjdk.samtools;

import htsjdk.samtools.util.Locatable;
import htsjdk.samtools.util.StringUtil;
import htsjdk.samtools.util.SequenceUtil;

import java.math.BigInteger;
import java.util.ArrayList;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.Arrays;
import java.util.Collection;
import java.util.Collections;
import java.util.HashSet;
import java.util.LinkedHashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.regex.Pattern;
Expand Down Expand Up @@ -148,6 +148,23 @@ public SAMSequenceRecord setMd5(final String value) {
return this;
}

/**
* Set the M5 attribute for this sequence using the provided sequenceBases.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: M5 -> MD5

* @param sequenceBases sequence bases to use to compute the MD5 for this sequence
* @return the update sequence record
*/
public SAMSequenceRecord setComputedMd5(final byte[] sequenceBases) {
try {
final MessageDigest md5Digester = MessageDigest.getInstance("MD5");
if (md5Digester != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method seems broken. It doesn't actually digest the bases. Also, the bases should be normalized.

Why not use SequenceUtil.calculateMD5String()?

setMd5(SequenceUtil.md5DigestToString(md5Digester.digest()));
}
} catch (final NoSuchAlgorithmException e) {
throw new RuntimeException("Error getting MD5 algorithm for sequence record MD5 computation", e);
}
return this;
}

public String getDescription() {
return getAttribute(DESCRIPTION_TAG);
}
Expand Down
5 changes: 0 additions & 5 deletions src/test/java/htsjdk/samtools/CRAMBAIIndexerTest.java
Original file line number Diff line number Diff line change
Expand Up @@ -140,11 +140,6 @@ private void testMultipleContainerStream() throws IOException {
final int refId1 = 0;
final int refId2 = 1;

// for each ref, we alternate unmapped-placed with mapped

final int expectedMapped = 1;
final int expectedUnmappedPlaced = 2;

try (final ByteArrayOutputStream contentStream = new ByteArrayOutputStream();
final ByteArrayOutputStream indexStream = new ByteArrayOutputStream()) {
final CRAMContainerStreamWriter cramContainerStreamWriter = new CRAMContainerStreamWriter(
Expand Down
4 changes: 2 additions & 2 deletions src/test/java/htsjdk/samtools/CRAMCRAIIndexerTest.java
Original file line number Diff line number Diff line change
Expand Up @@ -43,13 +43,13 @@ private void testCRAIIndexer(Index index) throws IOException {

@Test
public void testMultiRefContainer() throws IOException {
final SAMFileHeader samFileHeader = new SAMFileHeader();
SAMFileHeader samFileHeader = new SAMFileHeader();
samFileHeader.setSortOrder(SAMFileHeader.SortOrder.coordinate);

samFileHeader.addSequence(new SAMSequenceRecord("1", 10));
samFileHeader.addSequence(new SAMSequenceRecord("2", 10));
samFileHeader.addSequence(new SAMSequenceRecord("3", 10));

samFileHeader = CRAMTestUtils.addFakeSequenceMD5s(samFileHeader);
final ReferenceSource source = new ReferenceSource(new FakeReferenceSequenceFile(samFileHeader.getSequenceDictionary().getSequences()));

byte[] cramBytes;
Expand Down
6 changes: 3 additions & 3 deletions src/test/java/htsjdk/samtools/CRAMComplianceTest.java
Original file line number Diff line number Diff line change
Expand Up @@ -179,7 +179,7 @@ private void doComplianceTest(
// write them to a cram stream
final ByteArrayOutputStream baos = new ByteArrayOutputStream();
final ReferenceSource source = new ReferenceSource(t.refFile);
final CRAMFileWriter cramFileWriter = new CRAMFileWriter(baos, source, samFileHeader, name);
final CRAMFileWriter cramFileWriter = new CRAMFileWriter(baos, source, CRAMTestUtils.addFakeSequenceMD5s(samFileHeader), name);
for (SAMRecord samRecord : samRecords) {
cramFileWriter.addAlignment(samRecord);
}
Expand Down Expand Up @@ -331,7 +331,7 @@ public void testBAMThroughCRAMRoundTrip(final String testFileName, final String
final File tempCRAMFile = File.createTempFile("testBAMThroughCRAMRoundTrip", FileExtensions.CRAM);
tempCRAMFile.deleteOnExit();
SAMFileHeader samHeader = getFileHeader(originalBAMInputFile, referenceFile);
writeRecordsToFile(originalBAMRecords, tempCRAMFile, referenceFile, samHeader);
writeRecordsToFile(originalBAMRecords, tempCRAMFile, referenceFile, CRAMTestUtils.addFakeSequenceMD5s(samHeader));

// read the CRAM records back in and compare to the original BAM records
List<SAMRecord> cramRecords = getSAMRecordsFromFile(tempCRAMFile, referenceFile);
Expand Down Expand Up @@ -359,7 +359,7 @@ public void testBAMThroughCRAMRoundTripViaPath() throws IOException {
try (FileSystem jimfs = Jimfs.newFileSystem(Configuration.unix())) {
final Path tempCRAM = jimfs.getPath("testBAMThroughCRAMRoundTrip" + FileExtensions.CRAM);
SAMFileHeader samHeader = getFileHeader(originalBAMInputFile, referenceFile);
writeRecordsToPath(originalBAMRecords, tempCRAM, referenceFile, samHeader);
writeRecordsToPath(originalBAMRecords, tempCRAM, referenceFile, CRAMTestUtils.addFakeSequenceMD5s(samHeader));

// read the CRAM records back in and compare to the original BAM records
List<SAMRecord> cramRecords = getSAMRecordsFromPath(tempCRAM, referenceFile);
Expand Down
19 changes: 16 additions & 3 deletions src/test/java/htsjdk/samtools/CRAMContainerStreamWriterTest.java
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ private ReferenceSource createReferenceSource() {
}

private void doTest(final List<SAMRecord> samRecords, final ByteArrayOutputStream outStream, final OutputStream indexStream) {
final SAMFileHeader header = createSAMHeader(SAMFileHeader.SortOrder.coordinate);
final SAMFileHeader header = CRAMTestUtils.addFakeSequenceMD5s(createSAMHeader(SAMFileHeader.SortOrder.coordinate));
final ReferenceSource refSource = createReferenceSource();

final CRAMContainerStreamWriter containerStream = new CRAMContainerStreamWriter(outStream, indexStream, refSource, header, "test");
Expand Down Expand Up @@ -102,7 +102,7 @@ public void testCRAMContainerStreamNoIndex() {

@Test(description = "Test CRAMContainerStream aggregating multiple partitions")
public void testCRAMContainerAggregatePartitions() throws IOException {
final SAMFileHeader header = createSAMHeader(SAMFileHeader.SortOrder.coordinate);
final SAMFileHeader header = CRAMTestUtils.addFakeSequenceMD5s(createSAMHeader(SAMFileHeader.SortOrder.coordinate));
final ReferenceSource refSource = createReferenceSource();

// create a bunch of records and write them out to separate streams in groups
Expand Down Expand Up @@ -161,7 +161,7 @@ public void testCRAMContainerStreamWithBaiIndex() throws IOException {
@Test(description = "Test CRAMContainerStream with crai index")
public void testCRAMContainerStreamWithCraiIndex() throws IOException {
final List<SAMRecord> samRecords = createRecords(100);
final SAMFileHeader header = createSAMHeader(SAMFileHeader.SortOrder.coordinate);
final SAMFileHeader header = CRAMTestUtils.addFakeSequenceMD5s(createSAMHeader(SAMFileHeader.SortOrder.coordinate));
try (ByteArrayOutputStream outStream = new ByteArrayOutputStream();
ByteArrayOutputStream indexStream = new ByteArrayOutputStream()) {
doTestWithIndexer(samRecords, outStream, header, new CRAMCRAIIndexer(indexStream, header));
Expand All @@ -171,6 +171,19 @@ public void testCRAMContainerStreamWithCraiIndex() throws IOException {
}
}

@Test(expectedExceptions=RuntimeException.class)
public void testRejectDictionaryWithMissingMD5s() {
final SAMFileHeader samHeader = createSAMHeader(SAMFileHeader.SortOrder.coordinate);
final ByteArrayOutputStream outStream = new ByteArrayOutputStream();
final ReferenceSource refSource = createReferenceSource();
try {
new CRAMContainerStreamWriter(outStream, null, refSource, samHeader, "test");
} catch (final RuntimeException e) {
Assert.assertTrue(e.getMessage().contains("missing the required MD5 checksum"));
throw e;
}
}

private void checkCRAMContainerStream(ByteArrayOutputStream outStream, ByteArrayOutputStream indexStream, String indexExtension) throws IOException {
// write the file out
final File cramTempFile = File.createTempFile("cramContainerStreamTest", ".cram");
Expand Down
Loading