Unified Reader
seqair reads BAM, bgzf-compressed SAM, and CRAM files through a format-agnostic reader interface that auto-detects the input format and dispatches to the appropriate parser, while presenting a uniform API to the rest of the codebase.
Sources: The unified reader design is seqair-specific. Format magic bytes come from [SAM1] §4.2 (BAM:
BAM\1), [SAM1] §4.1 (BGZF:1f 8b), and [CRAM3] §6 (CRAM:CRAM). Sort order detection uses theSOtag from [SAM1] §1.3 "The header section" (@HDline). See References.
Goals
- Backward compatibility: existing code that calls
IndexedBamReader::open()continues to work unchanged. - Auto-detection:
IndexedReader::open(path)inspects the file to determine BAM vs SAM vs CRAM and opens the appropriate reader. - Uniform downstream API: all three formats populate the same
RecordStoreviafetch_into(). The pileup engine, call pipeline, and all downstream code are format-agnostic. - Forkable: all reader types support
fork()for thread-safe parallel processing with shared immutable state.
Format detection
IndexedReader::open(path) MUST auto-detect the file format by inspecting magic bytes:
- Bytes
1f 8b(gzip magic) → verify BGZF structure (extra field withBCsubfield). If not BGZF, reject with error: "file is gzip-compressed but not BGZF; usebgzipinstead ofgzip." If BGZF, decompress the first block: - Starts with
BAM\x01→ BAM format - Starts with
@(0x40) → bgzf-compressed SAM - Otherwise → reject with error naming supported formats
- Bytes
43 52 41 4d(CRAM) → CRAM format - Byte
@(0x40) at position 0 → uncompressed SAM. Reject with error: "uncompressed SAM files cannot be indexed; compress withbgzip file.samthen index withsamtools index file.sam.gz." - Otherwise → return an error listing supported formats (BAM, bgzf-compressed SAM, CRAM)
After format detection, the reader MUST locate the appropriate index file:
- BAM:
.bai,.bam.bai, or.csi(CSI supports references > 512 Mbp that BAI cannot) - SAM:
.tbior.csi(tabix or CSI index) - CRAM:
.crai(CRAM index)
If the format is detected but no matching index is found, the error MUST name the expected index extension and suggest the tool to create it (samtools index for BAM/CRAM, samtools index or tabix for SAM).
Reader enum
The unified reader MUST be an enum dispatching to format-specific readers, not a trait object. This avoids dynamic dispatch overhead in the hot path and keeps the type concrete for fork().
The unified reader MUST expose:
header() -> &BamHeader— all formats produce the same header type (target names, lengths, tid lookup). The SAM and CRAM parsers convert their native header representations toBamHeader.fetch_into(tid, start, end, store) -> Result<usize>— populates aRecordStorewith records overlapping the region, identically to the BAM path.fork() -> Result<Self>— creates a lightweight copy sharing immutable state.shared() -> &Arc<_>— access to shared state forArc::ptr_eqtesting.
For a given BAM file and its SAM/CRAM representations of the same data, fetch_into for the same region MUST produce records with equivalent logical content: same positions, flags, sequences, qualities, and CIGAR operations. Aux tags MUST contain the same set of tag names and values, but tag ordering and integer type codes (e.g., c vs i for small values) MAY differ between formats because BAM writers choose specific integer widths that SAM text cannot preserve.
RecordStore integration
Currently RecordStore::push_raw() takes raw BAM bytes. For SAM and CRAM, records arrive as parsed fields rather than BAM binary.
The RecordStore MUST provide push_fields() (or equivalent) that accepts pre-parsed record fields:
- pos, end_pos, flags, mapq, seq_len, matching_bases, indel_bases (fixed fields)
- qname bytes
- CIGAR as packed BAM-format u32 ops (SAM/CRAM parsers convert to this representation)
- sequence as
&[Base](already the enum type stored in the bases slab — no conversion needed) - quality bytes
- aux tag bytes in BAM binary format (SAM/CRAM parsers serialize tags to this format)
This avoids a wasteful round-trip through BAM binary encoding.
push_fields() and push_raw() MUST produce identical SlimRecord fixed fields and identical name/bases/cigar/qual slab contents for the same logical record. Aux tag slab contents MAY differ in integer type codes (a SAM-derived i:42 may serialize as c while the original BAM used C), since the original BAM writer's type choice is not recoverable from SAM text. Tests MUST verify fixed-field and non-aux-slab equivalence by pushing the same record both ways.
Shared header type
SAM headers are text (the @HD, @SQ, @RG, @PG, @CO lines). CRAM containers embed SAM header text. BamHeader already stores header_text: String and parses @SQ lines for targets.
BamHeader MUST gain a from_sam_text(text: &str) constructor that parses SAM header text directly, without requiring BGZF/BAM framing. This is used by both the SAM reader (header lines at start of file) and the CRAM reader (SAM header block in file header container).
Sort order validation
Indexed random access assumes coordinate-sorted data. If the @HD header line contains SO:unsorted or SO:queryname, the reader MUST return an error explaining that indexed region queries require coordinate-sorted input. SO:coordinate or absent SO (common in older files) MUST be accepted.
Fork semantics per format
BAM fork: shares Arc<BamShared> (index + header), opens fresh File handle.
SAM fork: shares Arc holding parsed tabix/CSI index + header, opens fresh File handle for BGZF reading.
CRAM fork: shares Arc holding CRAI index + header, opens fresh File handle for container reading. Each fork MUST have its own FASTA reader (via fasta_reader.fork()) for thread-safe reference lookups. Reference caching (r[cram.perf.reference_caching]) MUST be per-fork, not shared.
Readers: alignment + reference bundle
The Readers struct bundles an IndexedReader (alignment) with an IndexedFastaReader (reference) in a single type. This eliminates the need for separate FASTA path passing — CRAM can access the reference it needs for sequence reconstruction, and all formats have uniform open/fork semantics. Readers is parameterized on E: CustomizeRecordStore = () so callers can attach a per-record customize value at open time (see r[record_store.customize.trait]); the customize value's keep_record runs at fetch time and compute runs by pileup() after every fetch.
Readers::open(alignment_path, fasta_path) MUST auto-detect the alignment format (via r[unified.detect_format]), open the appropriate reader, and open the FASTA reader. For CRAM, the fasta_path is passed to IndexedCramReader::open() for sequence reconstruction. For BAM/SAM, the FASTA reader is opened but not used by the alignment reader. Readers::open is only available when E = (); for non-trivial customizers use open_customized.
Readers::<E>::open_customized(alignment_path, fasta_path, customize) MUST open the readers the same way as Readers::open but carry the user-supplied customize value along for pileup() to invoke on each fetched region. The customize value is stored by ownership inside the Readers struct.
Readers::fork() MUST fork both the alignment reader and the FASTA reader, returning a new Readers with independent I/O handles but shared immutable state (indices, headers). The CRAM fork gets its own FASTA reader via IndexedFastaReader::fork(). When E ≠ (), fork() MUST clone the customize value into the new Readers (the Clone bound on CustomizeRecordStore guarantees this is cheap).
A validated Tid(u32) newtype MUST wrap BAM target ids so downstream code cannot pass an unvalidated u32. Construction is only through the ResolveTid::resolve_tid(header) method, which MUST be implemented for u32 (range-check), &str/String/SmolStr (by-name lookup), and Tid (passthrough). ResolveTid::resolve_tid MUST return a typed TidError on failure (unknown contig name or out-of-range u32).
Segments and pileup
The pileup pipeline is intentionally split into a planning step (segments())
and an execution step (pileup()). The planning step produces a Segment —
a small, pre-resolved tile of the genome. The only way to drive a pileup()
is to hand it a Segment, which makes it impossible to silently fetch a whole
chromosome in one call without thinking about tile size.
The crate MUST expose a Segment type representing one bounded tile of a
genomic region. Segment MUST:
- Be constructible only by
Readers::segments(i.e. no public constructor). - Carry a validated
Tid, the contig name (SmolStr), an inclusive[start, end]range asPos0, anoverlap_startandoverlap_end(u32, in bases), and the contig's last valid 0-based position (contig_last_pos: Pos0). - Implement
Clone + Debug + PartialEq + Eq + Hashand beSend. It MUST NOT carry FASTA bases or any per-fetch buffer (so segments can be cheaply collected, shipped between threads, or stored). - Expose accessors:
tid() -> Tid,contig() -> &SmolStr,start() -> Pos0,end() -> Pos0(inclusive),len() -> u32(=end - start + 1),overlap_start() -> u32,overlap_end() -> u32,contig_last_pos() -> Pos0,starts_at_contig_start() -> bool(true iffstart == Pos0::ZERO),ends_at_contig_end() -> bool(true iffend == contig_last_pos). The "contig" predicates check the contig boundary, not the requested target range — for a sub-range query the user's first/last segments do not necessarily satisfy them.
Overlap fields on Segment MUST be expressed as a number of bases shared
with neighboring segments and default to 0 when no overlap is requested.
Overlap rules apply to the requested target range, not the contig:
if a caller asks for ("chr1", 1000, 2000) the iterator MUST NOT extend a
segment to position 900 just because there are bases there — the user did
not ask for them.
- The first segment of every requested target range MUST have
overlap_start == 0. - The last segment of every requested target range MUST have
overlap_end == 0. - Internal segments (segments where the previous/next core is fully inside
the requested range) MUST carry
overlap_start == overlap_end == requested_overlap. - When a tile's expanded right edge would extend past the requested
range's
end, the actualoverlap_endMUST equal the clamped distancetile_end - core_end, which is< requested_overlap. Same on the left foroverlap_start.
Segment MUST expose core_range() -> RangeInclusive<Pos0> returning the
inclusive sub-range of [start, end] excluding both overlaps — this is
the part of the segment a downstream tool should treat as "owned" by this
tile when deduplicating against neighbors.
SegmentOptions MUST expose:
SegmentOptions::new(max_len: NonZeroU32) -> Self— setsoverlap = 0.with_overlap(self, overlap: u32) -> Result<Self, SegmentOptionsError>— returnsErr(SegmentOptionsError::OverlapTooLarge { max_len, overlap })whenoverlap >= max_len.get()(would produce zero forward progress).max_len() -> NonZeroU32,overlap() -> u32accessors.
SegmentOptions MUST NOT implement Default — every caller is required to
commit to a max_len, since the absence of a sensible universal default is
the whole point of forcing a planning step.
IntoSegmentTarget MUST be a sealed trait (external crates cannot
implement it) that resolves a user-supplied target against the header
into one or more contig ranges. Implementations MUST be provided for at
least:
&str/String/SmolStr/Tid/u32— entire contig, half-open[0, contig_len)mapped to inclusive[Pos0::ZERO, contig_last_pos].&RegionString— parsedchrom:start-end, with missing start defaulting to position 1 (0-based 0) and missing end to the contig's last base.(R, Pos0, Pos0)whereR: ResolveTid— explicit inclusive range[start, end]against the resolved tid.()— every contig in header order whose length is non-zero.
An empty contig (length 0) MUST return ReaderError::EmptyContig for
single-target conversions and MUST be silently skipped for () (whole-genome).
Resolution failures (unknown contig, out-of-range tid, start > end,
end > contig_last_pos) MUST surface as typed ReaderError variants — never
by silently clamping.
Readers::segments(target: impl IntoSegmentTarget, opts: SegmentOptions)
MUST return Result<impl Iterator<Item = Segment>, ReaderError>. The
iterator MUST:
- Cover every base of every input range exactly once across
core_range()of consecutive tiles — i.e. the union ofcore_range()of all yielded segments equals the union of input ranges, with no overlap and no gap. - Produce tile cores (
Segment::core_range()) of length<= max_len, never splitting a tile across contigs. Each tile's full[start, end]is its core expanded byoverlapbases on each side, clipped to the requested target range; internal tiles therefore havelen() == max_len + 2 * overlap, while edge tiles are shorter. - Set
overlap_start/overlap_endperr[unified.segment_overlap]. The actual[start, end]of every tile is its core range expanded byoverlapbases on each side, clamped to the requested target range (not to the contig). At the requested range's edges this clamping reduces the corresponding overlap field to 0. - When the remaining contig after a tile's core would be
<= overlapbases long, the next tile's core MAY be shorter thanmax_len - 2*overlap; in the limiting case the last tile of a contig is one tile covering the remainder. The iterator MUST NOT yield zero-length tiles. - Yield tiles in
(tid, start)order. For multi-contig targets, all tiles for one contig MUST be emitted contiguously before moving on. - Borrow
&selfonly — building the iterator MUST NOT mutably borrowReaders, so callers can build the plan once and re-acquire&mut selffor eachpileup(&segment)call.
Readers::pileup(segment: &Segment) -> Result<PileupEngine<E::Extra>, ReaderError>
is the only entry point for driving a pileup. It MUST:
- Validate header consistency. Look up
segment.contig()against the current header. Ifheader.tid(segment.contig().as_str())does not return the same numeric tid stored insegment.tid(), returnReaderError::SegmentHeaderMismatch { contig, expected_tid }. Additionally, ifheader.target_len(tid) - 1does not equalsegment.contig_last_pos().as_u64(), returnReaderError::SegmentContigLengthMismatch { contig, segment_last_pos, header_last_pos }. Together these catch the foot-gun where aSegmentbuilt against one [Readers] is fed to a different [Readers] whose header doesn't match — both contig-order and contig-length differences across reference panels are surfaced as typed errors instead of silent zero-record fetches. - Call
fetch_into_customizedwithsegment.tid().as_u32(),segment.start(),segment.end()to load records into the internalRecordStore<E::Extra>, passing the customize value through.computeandkeep_recordboth run inline during push: rejected records are rolled back with zero slab waste and kept records already have their extras populated. - Fetch the reference sequence for
[segment.start(), segment.end()]inclusive via the FASTA reader, using the contig name carried by the segment (no header lookup). The fetch MUST use a u64-bounded path sosegment.end() == Pos0::max_value()does not silently truncate the last base. - Take the store (now
RecordStore<E::Extra>with populated extras) and construct aPileupEngine<E::Extra>with the fetched reference sequence pre-attached viaset_reference_seq.
No separate apply_customize pass is needed. The returned engine is
ready for pileups() iteration with no further configuration required.
There MUST NOT be a pileup(tid, start, end) overload — callers wanting
a one-shot region build a Segment via Readers::segments.
Each format reader (IndexedBamReader, IndexedSamReader, IndexedCramReader, and the format-agnostic IndexedReader) MUST expose fetch_into_customized<E: CustomizeRecordStore>(tid, start, end, store, customize) -> Result<FetchCounts> in addition to fetch_into. The reader MUST forward customize to RecordStore::push_raw/push_fields so its keep_record runs at push time. FetchCounts { fetched, kept } reports records produced by the reader's built-in overlap/unmapped checks (fetched) vs those that also passed keep_record (kept). Existing fetch_into MUST remain a thin wrapper passing &mut () (whose default keep_record returns true) so its signature and behavior are unchanged.
FetchCounts MUST be #[derive(Debug, Clone, Copy, PartialEq, Eq, Default)] with fetched: usize and kept: usize fields. kept <= fetched MUST always hold. The struct is re-exported from crate::reader::FetchCounts for callers that need to report filter statistics.
Readers MUST expose:
header() -> &BamHeader— delegates to the alignment reader's header.segments(target, opts) -> Result<impl Iterator<Item = Segment>>— seer[unified.readers_segments]. The only way to obtain aSegment.pileup(&Segment) -> Result<PileupEngine<E::Extra>>— seer[unified.readers_pileup].fetch_into(tid, start, end, store) -> Result<usize>— delegates to the alignment reader. Always loads into aRecordStore<()>(for custom extras, usepileupdirectly — extras are populated inline at push time).fasta() -> &IndexedFastaReaderandfasta_mut() -> &mut IndexedFastaReader— direct access for callers that need reference sequences independently of the alignment reader (e.g., the call pipeline's segment fetching).alignment() -> &IndexedReaderandalignment_mut() -> &mut IndexedReader— direct access when needed.customize() -> &E/customize_mut() -> &mut E— direct access to inspect or reset the customize value's state between regions.
IndexedReader::open(path) MUST continue to work for BAM and SAM files without a FASTA path. CRAM detection in IndexedReader::open() MUST return an error explaining that CRAM requires a reference and suggesting Readers::open() instead. This preserves backward compatibility for code that only needs BAM/SAM.
API surface
See r[io.non_exhaustive_enums] and r[io.minimal_public_api] in general.md for the general rules. For this module, IndexedReader, FormatDetectionError, and ReaderError are the primary enums subject to r[io.non_exhaustive_enums].