Redesign context AST: typed NodeBody, Role as grammar roles, tests

Role is now just System/User/Assistant — maps 1:1 to the grammar.
Leaf types are NodeBody variants: Content, Thinking, ToolCall,
ToolResult, Memory, Dmn, Log. Each variant renders itself; no Role
needed on leaves. AstNode is Leaf(NodeLeaf) | Branch{role, children}.
ContextState holds four Vec<AstNode> sections directly.

Moved tool call XML parsing from api/parsing.rs into context_new.rs
so all grammar knowledge lives in one place.

Tokenizer encode() now returns empty vec when uninitialized instead
of panicking, so tests work without the tokenizer file.

26 tests: XML parsing, incremental streaming (char-by-char feeds
found and fixed a lookahead bug), rendering for all node types,
tokenizer round-trip verification.

Co-Authored-By: Proof of Concept <poc@bcachefs.org>
This commit is contained in:
Kent Overstreet 2026-04-08 13:35:04 -04:00
parent 6730d136d4
commit f1397b7783
2 changed files with 752 additions and 339 deletions

File diff suppressed because it is too large Load diff

View file

@ -25,17 +25,21 @@ pub fn init(path: &str) {
TOKENIZER.set(t).ok(); TOKENIZER.set(t).ok();
} }
/// Get the global tokenizer. Panics if not initialized. /// Get the global tokenizer. Returns None if not initialized.
fn get() -> &'static Tokenizer { fn get() -> Option<&'static Tokenizer> {
TOKENIZER.get().expect("tokenizer not initialized — call tokenizer::init() first") TOKENIZER.get()
} }
/// Tokenize a raw string, returning token IDs. /// Tokenize a raw string, returning token IDs.
/// Returns empty vec if the tokenizer is not initialized.
pub fn encode(text: &str) -> Vec<u32> { pub fn encode(text: &str) -> Vec<u32> {
get().encode(text, false) match get() {
.unwrap_or_else(|e| panic!("tokenization failed: {}", e)) Some(t) => t.encode(text, false)
.get_ids() .unwrap_or_else(|e| panic!("tokenization failed: {}", e))
.to_vec() .get_ids()
.to_vec(),
None => vec![],
}
} }
/// Tokenize a chat entry with template wrapping: /// Tokenize a chat entry with template wrapping:
@ -59,8 +63,11 @@ pub fn count(text: &str) -> usize {
/// Decode token IDs back to text. /// Decode token IDs back to text.
pub fn decode(ids: &[u32]) -> String { pub fn decode(ids: &[u32]) -> String {
get().decode(ids, true) match get() {
.unwrap_or_else(|e| panic!("detokenization failed: {}", e)) Some(t) => t.decode(ids, true)
.unwrap_or_else(|e| panic!("detokenization failed: {}", e)),
None => String::new(),
}
} }
/// Check if the tokenizer is initialized. /// Check if the tokenizer is initialized.