Record Encoder
Sources: The record encoder provides a unified, format-agnostic API for writing VCF/BCF records with compile-time state enforcement. A single
Writertype handles all output formats (VCF, VCF.gz, BCF). Field definitions provide a single source of truth for header construction, key resolution, and documentation generation. See also VCF Header for string dictionary assignment.
Unified Writer
A single Writer<W, S> type MUST support all output formats (VCF text, BGZF-compressed VCF, BCF binary). The output format MUST be selected at construction time via OutputFormat. There MUST NOT be separate BcfWriter / VcfWriter types in the public API.
Writer::new(inner: W, format: OutputFormat) MUST create a writer in the Unstarted state. No header is required at construction time. Index co-production MUST be automatic for compressed formats (VcfGz → TBI, Bcf → CSI).
The writer MUST use a typestate pattern: Writer<W, Unstarted> only exposes write_header(), which consumes the writer and returns Writer<W, Ready>. Writer<W, Ready> exposes begin_record() and finish().
write_header(self, header: &VcfHeader) MUST consume the Unstarted writer, write the header to the output, set up the index builder, and return a Ready writer. The header MUST NOT be stored — all field keys are pre-resolved at registration time.
finish(self) on Writer<W, Ready> MUST flush all buffered data, write the BGZF EOF marker (for compressed formats), finalize the index builder, and return Result<Option<IndexBuilder>>.
Field Definitions
Field definitions MUST carry all metadata needed for header construction, key resolution, and documentation: name (&'static str), Number, ValueType, and description (&'static str). Field definitions MUST be parameterized by a value type marker to enable type-safe key resolution.
Three definition types MUST be provided:
InfoFieldDef<V>— INFO field definition, producesInfoKey<V>on registrationFormatFieldDef<V>— FORMAT field definition, producesFormatKey<V>on registrationFilterFieldDef— FILTER definition, producesFilterIdon registration
Field definitions MUST be constructible as const values so that downstream crates can define them at compile time.
A FieldDescription trait MUST provide read access to name, number, value type, and description. Both InfoFieldDef<V> and FormatFieldDef<V> MUST implement this trait (via a type-erased form) to enable documentation generation from heterogeneous collections of field definitions.
Field Registration
VcfHeaderBuilder MUST provide register_info, register_format, and register_filter methods that accept a field definition, add the corresponding header entry, and return a resolved typed key. Each method is only available in its corresponding typestate phase (register_filter in the Filters phase, register_info in the Infos phase, register_format in the Formats phase). This combines header construction and key resolution in a single step, eliminating the possibility of declaring a header field without resolving its key or vice versa, and enforces BCF string dictionary ordering at compile time.
VcfHeaderBuilder MUST provide a register_contig method (available in the Contigs phase) that adds a contig to the header and returns a ContigId carrying both the integer tid and the contig name.
Field Identifiers and Keys
FieldId MUST carry both a BCF dictionary index (u32) and a string name (SmolStr). The BCF encoder uses the dictionary index; the VCF text encoder uses the name. Both are resolved at registration time, not per-record.
ContigId MUST carry both the integer tid (u32) and the contig name (SmolStr).
FilterId MUST carry both the BCF dictionary index and the filter name. A FilterId::PASS constant MUST be provided with dictionary index 0 and name "PASS".
Typed Keys
Typed keys MUST encode the VCF value type at the Rust type level via uninhabited marker enums. This prevents passing a float value where an integer key is expected, or using an INFO key in a FORMAT context.
The following key types MUST be provided:
InfoKey<Scalar<i32>>— single integer INFO valueInfoKey<Scalar<f32>>— single float INFO valueInfoKey<Arr<i32>>— integer array INFO valueInfoKey<Arr<f32>>— float array INFO valueInfoKey<Flag>— flag INFO value (no data)InfoKey<Str>— string INFO valueInfoKey<OptArr<i32>>— integer array with optional (missing) elementsFormatKey<Gt>— genotype FORMAT valueFormatKey<Scalar<i32>>— single integer FORMAT valueFormatKey<Scalar<f32>>— single float FORMAT value
Each typed key MUST provide an encode method that delegates to the corresponding encoder trait method. INFO keys MUST accept &mut impl InfoEncoder and a single value. FORMAT keys MUST accept &mut impl FormatEncoder and a slice of values (one per sample):
FormatKey<Gt>::encode(&self, enc: &mut impl FormatEncoder, gts: &[Genotype])FormatKey<Scalar<T>>::encode(&self, enc: &mut impl FormatEncoder, values: &[T])
Typestate Record Encoder
The record encoder MUST use a typestate pattern to enforce the correct calling sequence at compile time. A single RecordEncoder<'a, S> type MUST handle both BCF and VCF encoding via internal enum dispatch. The encoder MUST be parameterized by a state type that restricts which methods are available.
Three record states MUST be provided:
Begun— record has been started (fixed fields written). Only filter methods are available.Filtered— filters have been written. INFO methods and state transition methods are available.WithSamples— sample count has been declared. FORMAT methods andemit()are available.
State transitions MUST consume the encoder by value and return the encoder in the new state, preventing use of the old state:
writer.begin_record(contig, pos, alleles, qual)→RecordEncoder<Begun>filter_pass(self)/filter_fail(self)/no_filter(self)onBegun→Filteredbegin_samples(self, n)onFiltered→WithSamplesemit(self)onFiltered(no samples) orWithSamples→ consumed (borrow released)
The encoder types MUST be marked #[must_use] to warn at compile time if a record is silently discarded without calling emit().
The RecordEncoder type MUST NOT expose the writer's W: Write type parameter. The VCF text path MUST use &mut dyn Write internally so that the encoder type is RecordEncoder<'a, S> with no extra generic parameters.
begin_record() MUST be a method on Writer<W, Ready>. It MUST accept a ContigId, 1-based position, Alleles, and optional quality. It MUST clear all per-record state and write/buffer the fixed fields. Overflow checks on position and allele counts MUST use checked conversions and return typed errors.
Exactly one filter method MUST be called per record: filter_pass() for PASS, filter_fail(&[&FilterId]) for one or more failed filters, or no_filter() for not-applied. All three MUST consume the Begun state and return Filtered.
InfoEncoder Trait
An InfoEncoder trait MUST be provided for encoding INFO fields. It MUST be object-safe (all methods use &mut self and concrete parameter types). It MUST be implemented by RecordEncoder<'_, Filtered>.
INFO methods (info_int, info_float, info_ints, info_floats, info_flag, info_string, info_int_opts) MUST accept a &FieldId and the appropriate value. These methods MUST be infallible — they write to in-memory buffers which cannot fail.
InfoEncoder MUST provide n_allele() and n_alt() methods returning the number of alleles and alternate alleles for the current record.
If the same INFO field (identified by FieldId) is encoded more than once within a single record, the encoder SHOULD overwrite the previously-written value. The final output MUST contain at most one entry per distinct INFO field. The replacement field MUST appear after all non-replaced fields (i.e. it moves to the end of the INFO column). The deduplication tracker SHOULD reuse its allocation across records.
FormatEncoder Trait
A FormatEncoder trait MUST be provided for encoding FORMAT fields. It MUST be object-safe. It MUST be implemented by RecordEncoder<'_, WithSamples>.
FORMAT methods (format_gt, format_int, format_float) MUST accept a &FieldId and a slice of values — one element per sample. The slice length MUST equal the sample count declared by begin_samples(). These methods MUST be infallible. For single-sample records, callers pass a 1-element slice.
FormatEncoder MUST provide n_allele() and n_alt() methods returning the number of alleles and alternate alleles for the current record.
If the same FORMAT field (identified by FieldId) is encoded more than once within a single record, the encoder SHOULD overwrite the previously-written value. The final output MUST contain at most one entry per distinct FORMAT field. The replacement field MUST appear after all non-replaced fields (i.e. it moves to the end of the FORMAT column). The deduplication tracker SHOULD reuse its allocation across records.
Emit
emit() MUST finalize the record and write it to the output. For BCF, this patches the fixed header and flushes to BGZF. For VCF text, this writes accumulated FORMAT fields and flushes the line buffer. emit() MUST consume the encoder by value, releasing the writer borrow. emit() is the only encoding method that performs I/O and MAY return an error.
emit() MUST be available on both Filtered (for records without samples) and WithSamples (for records with samples).
Custom Type Encoding
An EncodeInfo trait MUST be provided with an associated Key type and an encode_info(&self, enc: &mut dyn InfoEncoder, key: &Self::Key) method. This allows domain types to encapsulate their VCF encoding logic, including multi-field expansion (e.g., a strand-specific type producing separate OT and OB fields).
An EncodeFormat trait MUST be provided with the same pattern as EncodeInfo, using &mut dyn FormatEncoder. Implementations that produce no output for certain values (e.g., unknown methylation status) simply do not call any encoder methods, and the FORMAT key does not appear.
EncodeInfo and EncodeFormat MUST use &mut dyn InfoEncoder and &mut dyn FormatEncoder respectively (not &mut impl) so that implementations do not need to be generic over the encoder type.
Format-Specific Encoding
The BCF arm of the encoder MUST write INFO fields to shared_buf and FORMAT fields to indiv_buf following all rules from the BCF Writer spec: typed value encoding, smallest int type selection, missing sentinels, field-major FORMAT layout, GT binary encoding.
The VCF arm of the encoder MUST write tab-delimited text following all rules from the VCF Text Format spec: semicolon-separated INFO, colon-separated FORMAT keys and values, percent-encoding, float precision.
All internal buffers MUST be reused across records. After warmup, encoding MUST NOT allocate.
Equivalence
For the same logical record, encoding as BCF and as VCF text MUST produce output that, when read back, yields identical field values (within floating-point formatting precision).