Unified Reader
seqair reads BAM, bgzf-compressed SAM, and CRAM files through a format-agnostic reader interface that auto-detects the input format and dispatches to the appropriate parser, while presenting a uniform API to the rest of the codebase.
Sources: The unified reader design is seqair-specific. Format magic bytes come from [SAM1] §4.2 (BAM:
BAM\1), [SAM1] §4.1 (BGZF:1f 8b), and [CRAM3] §6 (CRAM:CRAM). Sort order detection uses theSOtag from [SAM1] §1.3 "The header section" (@HDline). See References.
Goals
- Backward compatibility: existing code that calls
IndexedBamReader::open()continues to work unchanged. - Auto-detection:
IndexedReader::open(path)inspects the file to determine BAM vs SAM vs CRAM and opens the appropriate reader. - Uniform downstream API: all three formats populate the same
RecordStoreviafetch_into(). The pileup engine, call pipeline, and all downstream code are format-agnostic. - Forkable: all reader types support
fork()for thread-safe parallel processing with shared immutable state.
Format detection
IndexedReader::open(path) MUST auto-detect the file format by inspecting magic bytes:
- Bytes
1f 8b(gzip magic) → verify BGZF structure (extra field withBCsubfield). If not BGZF, reject with error: "file is gzip-compressed but not BGZF; usebgzipinstead ofgzip." If BGZF, decompress the first block: - Starts with
BAM\x01→ BAM format - Starts with
@(0x40) → bgzf-compressed SAM - Otherwise → reject with error naming supported formats
- Bytes
43 52 41 4d(CRAM) → CRAM format - Byte
@(0x40) at position 0 → uncompressed SAM. Reject with error: "uncompressed SAM files cannot be indexed; compress withbgzip file.samthen index withsamtools index file.sam.gz." - Otherwise → return an error listing supported formats (BAM, bgzf-compressed SAM, CRAM)
After format detection, the reader MUST locate the appropriate index file:
- BAM:
.bai,.bam.bai, or.csi(CSI supports references > 512 Mbp that BAI cannot) - SAM:
.tbior.csi(tabix or CSI index) - CRAM:
.crai(CRAM index)
If the format is detected but no matching index is found, the error MUST name the expected index extension and suggest the tool to create it (samtools index for BAM/CRAM, samtools index or tabix for SAM).
Reader enum
The unified reader MUST be an enum dispatching to format-specific readers, not a trait object. This avoids dynamic dispatch overhead in the hot path and keeps the type concrete for fork().
The unified reader MUST expose:
header() -> &BamHeader— all formats produce the same header type (target names, lengths, tid lookup). The SAM and CRAM parsers convert their native header representations toBamHeader.fetch_into(tid, start, end, store) -> Result<usize>— populates aRecordStorewith records overlapping the region, identically to the BAM path.fork() -> Result<Self>— creates a lightweight copy sharing immutable state.shared() -> &Arc<_>— access to shared state forArc::ptr_eqtesting.
For a given BAM file and its SAM/CRAM representations of the same data, fetch_into for the same region MUST produce records with equivalent logical content: same positions, flags, sequences, qualities, and CIGAR operations. Aux tags MUST contain the same set of tag names and values, but tag ordering and integer type codes (e.g., c vs i for small values) MAY differ between formats because BAM writers choose specific integer widths that SAM text cannot preserve.
RecordStore integration
Currently RecordStore::push_raw() takes raw BAM bytes. For SAM and CRAM, records arrive as parsed fields rather than BAM binary.
The RecordStore MUST provide push_fields() (or equivalent) that accepts pre-parsed record fields:
- pos, end_pos, flags, mapq, seq_len, matching_bases, indel_bases (fixed fields)
- qname bytes
- CIGAR as packed BAM-format u32 ops (SAM/CRAM parsers convert to this representation)
- sequence as
&[Base](already the enum type stored in the bases slab — no conversion needed) - quality bytes
- aux tag bytes in BAM binary format (SAM/CRAM parsers serialize tags to this format)
This avoids a wasteful round-trip through BAM binary encoding.
push_fields() and push_raw() MUST produce identical SlimRecord fixed fields and identical name/bases/cigar/qual slab contents for the same logical record. Aux tag slab contents MAY differ in integer type codes (a SAM-derived i:42 may serialize as c while the original BAM used C), since the original BAM writer's type choice is not recoverable from SAM text. Tests MUST verify fixed-field and non-aux-slab equivalence by pushing the same record both ways.
Shared header type
SAM headers are text (the @HD, @SQ, @RG, @PG, @CO lines). CRAM containers embed SAM header text. BamHeader already stores header_text: String and parses @SQ lines for targets.
BamHeader MUST gain a from_sam_text(text: &str) constructor that parses SAM header text directly, without requiring BGZF/BAM framing. This is used by both the SAM reader (header lines at start of file) and the CRAM reader (SAM header block in file header container).
Sort order validation
Indexed random access assumes coordinate-sorted data. If the @HD header line contains SO:unsorted or SO:queryname, the reader MUST return an error explaining that indexed region queries require coordinate-sorted input. SO:coordinate or absent SO (common in older files) MUST be accepted.
Fork semantics per format
BAM fork: shares Arc<BamShared> (index + header), opens fresh File handle.
SAM fork: shares Arc holding parsed tabix/CSI index + header, opens fresh File handle for BGZF reading.
CRAM fork: shares Arc holding CRAI index + header, opens fresh File handle for container reading. Each fork MUST have its own FASTA reader (via fasta_reader.fork()) for thread-safe reference lookups. Reference caching (r[cram.perf.reference_caching]) MUST be per-fork, not shared.
Readers: alignment + reference bundle
The Readers struct bundles an IndexedReader (alignment) with an IndexedFastaReader (reference) in a single type. This eliminates the need for separate FASTA path passing — CRAM can access the reference it needs for sequence reconstruction, and all formats have uniform open/fork semantics.
Readers::open(alignment_path, fasta_path) MUST auto-detect the alignment format (via r[unified.detect_format]), open the appropriate reader, and open the FASTA reader. For CRAM, the fasta_path is passed to IndexedCramReader::open() for sequence reconstruction. For BAM/SAM, the FASTA reader is opened but not used by the alignment reader.
Readers::fork() MUST fork both the alignment reader and the FASTA reader, returning a new Readers with independent I/O handles but shared immutable state (indices, headers). The CRAM fork gets its own FASTA reader via IndexedFastaReader::fork().
Readers MUST expose:
header() -> &BamHeader— delegates to the alignment reader's header.fetch_into(tid, start, end, store) -> Result<usize>— delegates to the alignment reader.fasta() -> &IndexedFastaReaderandfasta_mut() -> &mut IndexedFastaReader— direct access for callers that need reference sequences independently of the alignment reader (e.g., the call pipeline's segment fetching).alignment() -> &IndexedReaderandalignment_mut() -> &mut IndexedReader— direct access when needed.
IndexedReader::open(path) MUST continue to work for BAM and SAM files without a FASTA path. CRAM detection in IndexedReader::open() MUST return an error explaining that CRAM requires a reference and suggesting Readers::open() instead. This preserves backward compatibility for code that only needs BAM/SAM.
API surface
See r[io.non_exhaustive_enums] and r[io.minimal_public_api] in general.md for the general rules. For this module, IndexedReader, FormatDetectionError, and ReaderError are the primary enums subject to r[io.non_exhaustive_enums].