Chester: Reimagining LLM Benchmarking Through Programming Language Design
Have you ever wondered how we can truly measure the creative capabilities of AI models? How do we test whether an LLM can think outside the box when faced with unfamiliar programming paradigms? Well, even if you didn’t ask those questions I had enough free time to make something which could possibly answer those questions. Bear with me.
Meet Chester: a dual-core system that combines hobbyist programming language design with a unique spin on LLM benchmarking. But this isn’t just another “Hello World” benchmark suite. Chester poses a fundamental question: when faced with a completely foreign programming language grammar, can AI models demonstrate true creative problem-solving?
Btw, if you want to check out the code, here’s the github link. Feel free to tweak around the code, feedback and PRs are welcome! Check out the Points of Improvement section for more details on contributing to the project.
The Architecture: Two Engines, One Vision
Chester operates as a dual-core system, each “core” serving a distinct purpose in the benchmarking ecosystem:
The Interpreted Language Core
At its heart, Chester is a clean, dynamically-typed programming language that borrows elegant syntax patterns from Python and JavaScript. The language supports fundamental programming constructs while maintaining simplicity:
let fibonacci = func(n)
if (n <= 1) then
return n
else
return fibonacci(n - 1) + fibonacci(n - 2)
end
end
let numbers = [1, 2, 3, 4, 5]
for i = 0 to length(numbers) then
print(numbers/i)
end
The language implements dynamic typing with runtime type checking, supporting numbers, strings, and lists as core data types. Functions are first-class citizens, and the syntax prioritizes readability over brevity—a deliberate design choice that makes the grammar more predictable for both humans and AI models.
The C core
Makes use of the good ‘ol C programming language as a frame of reference for benchmarking. A nice thing is that the inherent system can be extended to have more than just two cores for multiple frames of references which could provide for a more elaborate testing suite. However for this project I am sticking to the one true god - Rust C.
The RAG-Based Transpilation Engine
Here’s where things get fascinating. The binding glue is a Retrieval-Augmented Generation (RAG) transpilation engine that takes C code as input and attempts to convert it into equivalent Chester code. Now you might ask why force AI into this when simple AST based translations exist. After all, why re-invent the wheel and force AI into it, right? Well, to answer your question this isn’t simple syntax translation—it requires understanding algorithmic intent and adapting it to Chester’s funky paradigms. While Chester is Turing Complete, it is still not stable enough for most programming paradigms. There is no proper token-to-token matching. That is where the fun lies.
The engine maintains a knowledge base of C-to-Chester translation patterns, but when faced with novel code structures, it must creatively adapt solutions using Chester’s available constructs. This forces AI models to demonstrate genuine problem-solving rather than pattern matching.
The Benchmarking Methodology
Traditional LLM benchmarks often test pattern recognition and memorization. Give an AI model a Python problem, and it can draw from millions of similar examples in its training data. But what happens when you present that same algorithmic challenge in a language the model has never encountered?
Chester’s benchmarking methodology works by:
- Input Translation: Taking functionally correct C code and challenging models to produce equivalent Chester implementations
- Iterative Testing: Both the original and generated code are run through multiple rounds, with their outputs compared for accuracy. The iteration count is a notable metric here.
- Multi-Model Comparison: Testing various models against identical challenges to identify true creative capabilities
A brief “sample benchmark run”—for example, take a tiny C function (e.g. computing factorial or Fibonacci), show its reference output, and then format this as a small table like:
Model | Iteration | ExactMatch | BLEU | Hallucination |
---|---|---|---|---|
OpenAI-GPT4 | 3 | 1.00 | 0.89 | 0.00 |
Gemini-3n-4B | 5 | 0.00 | 0.45 | 0.30 |
ExactMatch
is a binary check that tells us whether the generated Chester code produces exactly the same output as the reference C implementation—this ensures absolute functional correctness. BLEU
(a common machine‐translation score) measures n-gram overlap between the model’s output and the reference, so even if variable names differ or minor syntactic choices vary, we still capture partial similarity and fluency. Finally, the Hallucination
Index quantifies how many lines in the generated code don’t appear in the reference solution—this reveals whether the model is inventing spurious or unnecessary constructs rather than following the intended algorithm.
Here’s how the transpilation works:
-
The system maintains a curated database of C-to-Chester translation patterns, the official grammar laid out for C and a similar one for Chester. This knowledge base serves as the retrieval component of the RAG system. What this does is that it provides sufficient context to the LLM for generating and translating from C to Chester.
-
When presented with C code, the engine performs semantic analysis to understand the code’s intent, data flow, and algorithmic approach. To be fair though, C code isn’t English prose and for the most part will tend to be deterministic in nature so there isn’t much left to user interpretation. So this part of the benchmark should ideally remain constant no matter the model being used.
-
The RAG system retrieves relevant translation patterns from its knowledge base, but here’s the crucial part: when exact matches don’t exist, it must creatively adapt existing patterns to handle novel situations. The knowledge base contains some examples for C to Chester code translations, but at the end of the day when faced with advanced paradigms the model will have to make something up on its own - in true developer fashion, dare I say.
The Unique Constraint: Forced Creativity
What makes Chester’s benchmarking approach kind of new is its constraint-driven methodology. When an AI model encounters Chester for the first time, it cannot rely on memorized patterns. Instead, it must:
- Parse and understand Chester’s unique syntax rules
- Map familiar algorithmic concepts to unfamiliar language constructs
- Generate creative solutions when direct translations aren’t possible
- Adapt to Chester’s specific paradigms (like its
to
loop syntax orthen
/end
blocks)
This creates a test of creative problem-solving ability; something traditional benchmarks struggle to measure effectively.
Multi-Model Testing Framework
The benchmarking suite tests multiple AI models simultaneously, including:
- Azure OpenAI
- OpenAI
- Gemini
- DeepSeek V3 Base
- Deepseek R1 0528 Qwen3 8B
- Sarvam AI: Sarvam-M
- Google: Gemma 3n 4B
- Meta: Llama 3.3 8B Instruct
- Microsoft: Phi 4 Reasoning Plus
- THUDM: GLM Z1 32B
The selection of models covers all the bases - from reasoning to code gen - while also being cheap (most are free on OpenRouter :)). Each model attempts the same C-to-Chester translation challenges, and their outputs are compared for correctness, and efficiency.
Here’s a rough flow of the entire process:
Language Design Philosophy: Simplicity with Purpose
Coming onto the language itself, Chester’s syntax reflects careful design decisions that serve the benchmarking mission:
Explicit Keywords
Where many languages use symbols, Chester uses words. Loop structures use for i = 0 to N then
instead of for(int i = 0; i < N; i++)
. This wordiness isn’t accidental—it forces AI models to understand semantic meaning rather than relying on familiar symbolic patterns.
Consistent Block Structure
Every control structure follows the same keyword -> logic -> then -> body -> end
pattern. This consistency provides enough structure for AI models to learn the pattern while maintaining enough uniqueness to prevent direct code translation.
Dynamic Typing
Variables are dynamically typed but follow predictable rules. The let
keyword introduces variables, functions are declared with func
, and operations behave as expected. This balance provides creative freedom while maintaining logical consistency.
The Technical Deep Dive: Implementation Architecture
What can be written in JavaScript, will eventually be written in JavaScript… and then rewritten in TypeScript.
Chester’s interpreter follows classical language implementation patterns, built in TypeScript. The architecture consists of several key components:
Built-In Function Library
Chester’s standard library exposes a set of native functions and a handful of predefined global variables, all initialized in globalSymbolTable
. They are implemented in TypeScript (via BuiltInFunction
and NumberValue
constructors) and made available by name in the interpreter’s top‐level scope. Below is the complete list along with a brief description of each, including aliases and common pitfalls:
Predefined Global Variables
null
- Value:
NumberValue(0)
- Represents a null/zero value.
- Pitfall: Using
null
in arithmetic yields0
. If you intended an actual “no-value” marker, check explicitly fornull
rather than treating it like an uninitialized variable.
- Value:
false
- Value:
NumberValue.false
(internallyNumberValue(0)
) - Used as the canonical boolean false.
- Pitfall: Because
false
is internally0
, any arithmetic comparison must use==
or!=
properly—for example,if (false) then … end
always skips the branch.
- Value:
true
- Value:
NumberValue.true
(internallyNumberValue(1)
) - Used as the canonical boolean true.
- Pitfall: Likewise,
true
is1
. Mixingtrue
with integers in arithmetic (e.g.true + 2
) coercestrue
to1
, so the result is3
.
- Value:
MATH_PI
- Value:
NumberValue.MATH_PI
(≈ 3.141592653589793) - A constant for π.
- Pitfall: Since Chester has no built-in trigonometry functions, you must import or implement them yourself if you need
sin()
,cos()
, etc.
- Value:
Built-in Functions
print(expr)
- Signature:
print(Expression) -> void
- Description: Serializes and writes the evaluated value of
expr
to standard output (with no trailing newline). - Behavior:
- If
expr
is a number, prints its numeric string. - If
expr
is a string, prints the string literally. - If
expr
is a list, prints a bracketed, comma-separated representation. - If
expr
isnull
or undefined, prints"null"
. - If
expr
is a function, prints something like<function>
.
- If
- Pitfall: Doesn’t append a newline. To print and return a newline, use
printReturn(expr)
instead.
- Signature:
printReturn(expr)
- Signature:
printReturn(Expression) -> void
- Description: Same as
print(expr)
but appends a newline after printing. - Pitfall: Because it always adds
\n
, avoid chainingprintReturn
calls if you want to control spacing manually.
- Signature:
input()
- Signature:
input() -> String
- Description: Reads one line of text from standard input, returns it as a Chester string.
- Pitfall: Blocks execution until the user enters a line. If you call
input()
inside a loop without a prompt, it may appear hung.
- Signature:
inputInt()
- Signature:
inputInt() -> Number
- Description: Reads one line of text from standard input, attempts to parse it as an integer.
- Behavior:
- On valid parse (e.g.,
"42"
), returnsNumberValue(42)
. - On invalid parse (e.g.,
"foo"
or""
), throws aRuntimeError: ParseError
.
- On valid parse (e.g.,
- Pitfall: Always validate user input before using it in arithmetic. Wrap
inputInt()
in a conditional ortry/catch
style check if you expect non‐numeric entries.
- Signature:
clear()
(alias:cls()
)- Signature:
clear() -> void
- Description: Clears the console/terminal screen (host‐language behavior).
- Pitfall: On some environments (e.g., certain IDEs),
clear()
may have no visible effect. Rely on it only in REPLs or terminals known to support ANSI‐based clears.
- Signature:
isNum(expr)
- Signature:
isNum(Expression) -> Boolean
- Description: Returns
true
ifexpr
is a numeric value (including booleans) at runtime. Otherwise, returnsfalse
. - Pitfall: Since
true
andfalse
are internally numbers (1
and0
),isNum(true)
returnstrue
. If you want to test for strictly integer vs. boolean, you must inspect the raw value yourself.
- Signature:
isStr(expr)
- Signature:
isStr(Expression) -> Boolean
- Description: Returns
true
ifexpr
is a Chester string. Otherwise, returnsfalse
. - Pitfall: Chester strings are distinct from lists of characters. If you build a single‐element list
["hello"]
,isStr
returnsfalse
.
- Signature:
isList(expr)
- Signature:
isList(Expression) -> Boolean
- Description: Returns
true
ifexpr
is a list (even if it’s empty). Otherwise,false
. - Pitfall: An empty list
[]
yieldstrue
. If your code branches onisList(x)
, ensure you also checklength(x) > 0
if you expect non‐empty lists.
- Signature:
isFunc(expr)
- Signature:
isFunc(Expression) -> Boolean
- Description: Returns
true
ifexpr
is a function closure. Otherwise,false
. - Pitfall: If you pass a built-in function (e.g.,
print
) toisFunc
, it returnstrue
. If you want to differentiate user‐defined vs. built‐in, you must inspect the function’s metadata.
- Signature:
append(list, value)
- Signature:
append(List, *) -> List
- Description: Returns a new list equal to
list
withvalue
appended to its end. Does not mutate the original. - Pitfall: If the first argument is not a list, throws
RuntimeError: TypeError
. Because lists are dynamically typed, watch out for mixing types (e.g.,append(5, "x")
errors).
- Signature:
pop(list)
- Signature:
pop(List) -> *
- Description: Removes and returns the last element of the given list. Mutates the original list in-place.
- Behavior:
- If
list
is non-empty, removes its last element and returns it. - If
list
is empty, throwsRuntimeError: IndexError
.
- If
- Pitfall: Because it mutates, avoid calling
pop
on a shared reference if you need to preserve the original list.
- Signature:
concat(list1, list2)
- Signature:
concat(List, List) -> List
- Description: Returns a new list formed by concatenating
list1
followed bylist2
. Does not mutate either argument. - Pitfall: If either
list1
orlist2
is not a list, throwsRuntimeError: TypeError
.
- Signature:
length(expr)
- Signature:
length(Expression) -> Number
- Description: If
expr
is a string or a list, returns its length. Otherwise, throwsRuntimeError: TypeError
. - Pitfall: Applying
length
to a number or function triggers aTypeError
. If in doubt, guard withisList(expr) or isStr(expr)
.
- Signature:
run(filename)
- Signature:
run(String) -> *
- Description: Attempts to locate and execute a file named
filename.ct
. Returns the value of its last expression. - Behavior:
- If the file does not exist, throws
RuntimeError: FileNotFound
. - If the file has syntax errors, throws
RuntimeError: SyntaxError
. - If the file executes successfully, returns the result of its last expression or
null
if empty.
- If the file does not exist, throws
- Pitfall: Always include the
.ct
extension when calling from within Chester; otherwise, the interpreter still looks for<filename>.ct
.
- Signature:
Points of Improvement
-
First of all, one major variable for the scores will of course be the RAG. If the retrieval is not 100% accurate (which is kinda impossible for me at this stage) there is obviously going to be a visible effect on the results obtained. That’s where the open source aspect of this project works (I hope). This project is by no means finished and is constantly on the lookout for contributions- be it computational, code, or methodology related. Feel free to check the code out, add in your tweaks and let me know what results you get for your particular combination of vector store + embeddings + LLM!
-
There are possible improvements for the methodology itself or even the results drawn. While BLEU, iteration count, and hallucination are the major outcomes I had in mind when it comes to inferences for this particular test it might be helpful to add other parameters which would possibly tie this test into the results of others so as to create a cohesive final image.
-
Finally, the data might also be lacking here. Feel free to add more instances of C-to-Chester conversions to the data!
Conclusion
Well, in this entire charade a new programming language was written just to test out your models. Was it worth it? I’d say yeah (barely).
While there are established tests for how good of a code generator modern LLMs are and how well they are able to beat humans at math, there still is not a standard way of measuring the “creativity” of models when it comes to utilizing work-arounds for fundamental concepts - here the concept being programming.
As always, the code is open source and available for experimentation. Try challenging your favorite AI models with Chester’s unique constraints.