Redesign context AST: typed NodeBody, Role as grammar roles, tests
Role is now just System/User/Assistant — maps 1:1 to the grammar.
Leaf types are NodeBody variants: Content, Thinking, ToolCall,
ToolResult, Memory, Dmn, Log. Each variant renders itself; no Role
needed on leaves. AstNode is Leaf(NodeLeaf) | Branch{role, children}.
ContextState holds four Vec<AstNode> sections directly.
Moved tool call XML parsing from api/parsing.rs into context_new.rs
so all grammar knowledge lives in one place.
Tokenizer encode() now returns empty vec when uninitialized instead
of panicking, so tests work without the tokenizer file.
26 tests: XML parsing, incremental streaming (char-by-char feeds
found and fixed a lookahead bug), rendering for all node types,
tokenizer round-trip verification.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
This commit is contained in:
parent
6730d136d4
commit
f1397b7783
2 changed files with 752 additions and 339 deletions
File diff suppressed because it is too large
Load diff
|
|
@ -25,17 +25,21 @@ pub fn init(path: &str) {
|
|||
TOKENIZER.set(t).ok();
|
||||
}
|
||||
|
||||
/// Get the global tokenizer. Panics if not initialized.
|
||||
fn get() -> &'static Tokenizer {
|
||||
TOKENIZER.get().expect("tokenizer not initialized — call tokenizer::init() first")
|
||||
/// Get the global tokenizer. Returns None if not initialized.
|
||||
fn get() -> Option<&'static Tokenizer> {
|
||||
TOKENIZER.get()
|
||||
}
|
||||
|
||||
/// Tokenize a raw string, returning token IDs.
|
||||
/// Returns empty vec if the tokenizer is not initialized.
|
||||
pub fn encode(text: &str) -> Vec<u32> {
|
||||
get().encode(text, false)
|
||||
match get() {
|
||||
Some(t) => t.encode(text, false)
|
||||
.unwrap_or_else(|e| panic!("tokenization failed: {}", e))
|
||||
.get_ids()
|
||||
.to_vec()
|
||||
.to_vec(),
|
||||
None => vec![],
|
||||
}
|
||||
}
|
||||
|
||||
/// Tokenize a chat entry with template wrapping:
|
||||
|
|
@ -59,8 +63,11 @@ pub fn count(text: &str) -> usize {
|
|||
|
||||
/// Decode token IDs back to text.
|
||||
pub fn decode(ids: &[u32]) -> String {
|
||||
get().decode(ids, true)
|
||||
.unwrap_or_else(|e| panic!("detokenization failed: {}", e))
|
||||
match get() {
|
||||
Some(t) => t.decode(ids, true)
|
||||
.unwrap_or_else(|e| panic!("detokenization failed: {}", e)),
|
||||
None => String::new(),
|
||||
}
|
||||
}
|
||||
|
||||
/// Check if the tokenizer is initialized.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue