Base-typed Sequence Decoding
Sources: [SAM1] §4.2.4 "SEQ and QUAL encoding" — 4-bit nibble-to-base mapping (
=ACMGRSVTWYHKDBN). TheBaseenum, SIMD acceleration, and direct-to-Basedecode table are seqair-specific design choices; the upstream spec defines only the nibble codes. See References.
Background
BAM sequences are stored as 4-bit nibbles (two bases per byte). The standard decode table maps nibbles to ASCII characters (=ACMGRSVTWYHKDBN). Downstream, the ASCII bytes must be converted to Base enum values (A, C, G, T, Unknown) at every pileup position — billions of times for whole-genome data.
By decoding directly into Base values at record construction time, we:
- Eliminate the per-position
Base::from(u8)conversion (a LUT lookup per read per position) - Get compile-time type safety — a
Baseis guaranteed to be one of 5 valid values, not an arbitrary byte - Collapse IUPAC ambiguity codes (M, R, W, S, Y, K, etc.) and
=toBase::Unknownat decode time, instead of at each use site
Decode table
A 16-entry lookup table MUST map 4-bit BAM nibble codes directly to Base discriminant values: 1→A(65), 2→C(67), 4→G(71), 8→T(84), and all other codes (0, 3, 5, 6, 7, 9–14, 15) → Unknown(78).
Table invariant
Every entry in DECODE_BASE_TYPED (16 entries) and DECODE_PAIR_TYPED (256 × 2 = 512 bytes) MUST be a valid Base discriminant: A(65), C(67), G(71), T(84), or Unknown(78). This MUST be verified by an exhaustive test covering all entries.
Decoding function
decode_bases(packed_seq, seq_len) MUST decode a 4-bit packed BAM sequence into a Vec<Base>. The function SHOULD use the SIMD-accelerated decode path (SSSE3/NEON) with a Base-valued lookup table for identical throughput to the ASCII decoder.
Storage in RecordStore
The RecordStore MUST store decoded bases in a dedicated slab (Vec<Base>) separate from the data slab. The data slab continues to hold cigar, qual, and aux data. The seq_at method MUST return Base instead of u8.
PileupAlignment
PileupAlignment.base MUST be of type Base, not u8. This eliminates the Base::from(u8) conversion in the metrics accumulation hot loop.
ASCII-to-Base batch conversion
Reference sequences (FASTA) and SAM text sequences arrive as ASCII bytes, not 4-bit packed nibbles. These need the same one-time batch conversion to Base that BAM sequences get via decode_bases, but using an ASCII-aware path instead of the nibble decode table.
Base::from_ascii_vec(Vec<u8>) -> Vec<Base> MUST convert a vector of ASCII bytes to Base values in-place (reusing the allocation). Maps A/a→A, C/c→C, G/g→G, T/t→T, all other bytes→Unknown. The function MUST NOT allocate a new vector — it operates on the input's buffer and transmutes the result. This is safe because Base is repr(u8) and the function only writes valid Base discriminant bytes (65, 67, 71, 78, 84).
The ASCII batch converter MUST use SIMD acceleration (SSSE3 on x86_64, NEON on aarch64) with a scalar fallback, following the same dispatch pattern as decode_bases. This is the required path for all bulk u8→Base conversions outside BAM 4-bit decoding: FASTA reference sequences, SAM text sequences, and any other ASCII-encoded base data.
The SIMD ASCII converter MUST produce identical output to applying Base::from(u8) element-wise. This MUST be verified by property-based tests covering arbitrary byte values and lengths including SIMD boundary lengths (0, 1, 15, 16, 31, 32, 33, 63, 64, 65, 128).
FromStr validation
FromStr for Base MUST accept only a single character A, C, G, T, or N (case-insensitive) after trimming whitespace. Multi-character input (after trimming) MUST return Err(BaseError::MultipleChars). Any single character that is not A/C/G/T/N MUST return Err(BaseError::InvalidBase(byte)) where byte is the first byte of the trimmed input. Empty input MUST return Err(BaseError::Empty).
Error display
BaseError variants MUST display the actual invalid value.
InvalidBase(byte) SHOULD format as "Invalid base: 0x{byte:02x}".
Empty SHOULD format as "Empty".