Chester: Reimagining LLM Benchmarking Through Programming Language Design

Have you ever wondered how we can truly measure the creative capabilities of AI models? How do we test whether an LLM can think outside the box when faced with unfamiliar programming paradigms? Well, even if you didn’t ask those questions I had enough free time to make something which could possibly answer those questions. Bear with me.

Meet Chester: a dual-core system that combines hobbyist programming language design with a unique spin on LLM benchmarking. But this isn’t just another “Hello World” benchmark suite. Chester poses a fundamental question: when faced with a completely foreign programming language grammar, can AI models demonstrate true creative problem-solving?

Btw, if you want to check out the code, here’s the github link. Feel free to tweak around the code, feedback and PRs are welcome! Check out the Points of Improvement section for more details on contributing to the project.

The Architecture: Two Engines, One Vision

Chester operates as a dual-core system, each “core” serving a distinct purpose in the benchmarking ecosystem:

The Interpreted Language Core

At its heart, Chester is a clean, dynamically-typed programming language that borrows elegant syntax patterns from Python and JavaScript. The language supports fundamental programming constructs while maintaining simplicity:

let fibonacci = func(n)
    if (n <= 1) then
        return n
    else
        return fibonacci(n - 1) + fibonacci(n - 2)
    end
end

let numbers = [1, 2, 3, 4, 5]
for i = 0 to length(numbers) then
    print(numbers/i)
end

The language implements dynamic typing with runtime type checking, supporting numbers, strings, and lists as core data types. Functions are first-class citizens, and the syntax prioritizes readability over brevity—a deliberate design choice that makes the grammar more predictable for both humans and AI models.

The C core

Makes use of the good ‘ol C programming language as a frame of reference for benchmarking. A nice thing is that the inherent system can be extended to have more than just two cores for multiple frames of references which could provide for a more elaborate testing suite. However for this project I am sticking to the one true god - ~~Rust~~ C.

The RAG-Based Transpilation Engine

Here’s where things get fascinating. The binding glue is a Retrieval-Augmented Generation (RAG) transpilation engine that takes C code as input and attempts to convert it into equivalent Chester code. Now you might ask why force AI into this when simple AST based translations exist. After all, why re-invent the wheel and force AI into it, right? Well, to answer your question this isn’t simple syntax translation—it requires understanding algorithmic intent and adapting it to Chester’s funky paradigms. While Chester is Turing Complete, it is still not stable enough for most programming paradigms. There is no proper token-to-token matching. That is where the fun lies.

The engine maintains a knowledge base of C-to-Chester translation patterns, but when faced with novel code structures, it must creatively adapt solutions using Chester’s available constructs. This forces AI models to demonstrate genuine problem-solving rather than pattern matching.

The Benchmarking Methodology

Traditional LLM benchmarks often test pattern recognition and memorization. Give an AI model a Python problem, and it can draw from millions of similar examples in its training data. But what happens when you present that same algorithmic challenge in a language the model has never encountered?

Chester’s benchmarking methodology works by:

Input Translation: Taking functionally correct C code and challenging models to produce equivalent Chester implementations
Iterative Testing: Both the original and generated code are run through multiple rounds, with their outputs compared for accuracy. The iteration count is a notable metric here.
Multi-Model Comparison: Testing various models against identical challenges to identify true creative capabilities

A brief “sample benchmark run”—for example, take a tiny C function (e.g. computing factorial or Fibonacci), show its reference output, and then format this as a small table like:

Model	Iteration	ExactMatch	BLEU	Hallucination
`OpenAI-GPT4`	3	1.00	0.89	0.00
`Gemini-3n-4B`	5	0.00	0.45	0.30

ExactMatch is a binary check that tells us whether the generated Chester code produces exactly the same output as the reference C implementation—this ensures absolute functional correctness. BLEU (a common machine‐translation score) measures n-gram overlap between the model’s output and the reference, so even if variable names differ or minor syntactic choices vary, we still capture partial similarity and fluency. Finally, the Hallucination Index quantifies how many lines in the generated code don’t appear in the reference solution—this reveals whether the model is inventing spurious or unnecessary constructs rather than following the intended algorithm.

Here’s how the transpilation works:

The system maintains a curated database of C-to-Chester translation patterns, the official grammar laid out for C and a similar one for Chester. This knowledge base serves as the retrieval component of the RAG system. What this does is that it provides sufficient context to the LLM for generating and translating from C to Chester.
When presented with C code, the engine performs semantic analysis to understand the code’s intent, data flow, and algorithmic approach. To be fair though, C code isn’t English prose and for the most part will tend to be deterministic in nature so there isn’t much left to user interpretation. So this part of the benchmark should ideally remain constant no matter the model being used.
The RAG system retrieves relevant translation patterns from its knowledge base, but here’s the crucial part: when exact matches don’t exist, it must creatively adapt existing patterns to handle novel situations. The knowledge base contains some examples for C to Chester code translations, but at the end of the day when faced with advanced paradigms the model will have to make something up on its own - in true developer fashion, dare I say.

The Unique Constraint: Forced Creativity

What makes Chester’s benchmarking approach kind of new is its constraint-driven methodology. When an AI model encounters Chester for the first time, it cannot rely on memorized patterns. Instead, it must:

Parse and understand Chester’s unique syntax rules
Map familiar algorithmic concepts to unfamiliar language constructs
Generate creative solutions when direct translations aren’t possible
Adapt to Chester’s specific paradigms (like its to loop syntax or then/end blocks)

This creates a test of creative problem-solving ability; something traditional benchmarks struggle to measure effectively.

Multi-Model Testing Framework

The benchmarking suite tests multiple AI models simultaneously, including:

Azure OpenAI
OpenAI
Gemini
DeepSeek V3 Base
Deepseek R1 0528 Qwen3 8B
Sarvam AI: Sarvam-M
Google: Gemma 3n 4B
Meta: Llama 3.3 8B Instruct
Microsoft: Phi 4 Reasoning Plus
THUDM: GLM Z1 32B

The selection of models covers all the bases - from reasoning to code gen - while also being cheap (most are free on OpenRouter :)). Each model attempts the same C-to-Chester translation challenges, and their outputs are compared for correctness, and efficiency.

Here’s a rough flow of the entire process:

Language Design Philosophy: Simplicity with Purpose

Coming onto the language itself, Chester’s syntax reflects careful design decisions that serve the benchmarking mission:

Explicit Keywords

Where many languages use symbols, Chester uses words. Loop structures use for i = 0 to N then instead of for(int i = 0; i < N; i++). This wordiness isn’t accidental—it forces AI models to understand semantic meaning rather than relying on familiar symbolic patterns.

Consistent Block Structure

Every control structure follows the same keyword -> logic -> then -> body -> end pattern. This consistency provides enough structure for AI models to learn the pattern while maintaining enough uniqueness to prevent direct code translation.

Dynamic Typing

Variables are dynamically typed but follow predictable rules. The let keyword introduces variables, functions are declared with func, and operations behave as expected. This balance provides creative freedom while maintaining logical consistency.

The Technical Deep Dive: Implementation Architecture

What can be written in JavaScript, will eventually be written in JavaScript… and then rewritten in TypeScript.

Chester’s interpreter follows classical language implementation patterns, built in TypeScript. The architecture consists of several key components:

Built-In Function Library

Chester’s standard library exposes a set of native functions and a handful of predefined global variables, all initialized in globalSymbolTable. They are implemented in TypeScript (via BuiltInFunction and NumberValue constructors) and made available by name in the interpreter’s top‐level scope. Below is the complete list along with a brief description of each, including aliases and common pitfalls:

Predefined Global Variables

null
- Value: NumberValue(0)
- Represents a null/zero value.
- Pitfall: Using null in arithmetic yields 0. If you intended an actual “no-value” marker, check explicitly for null rather than treating it like an uninitialized variable.
false
- Value: NumberValue.false (internally NumberValue(0))
- Used as the canonical boolean false.
- Pitfall: Because false is internally 0, any arithmetic comparison must use == or != properly—for example, if (false) then … end always skips the branch.
true
- Value: NumberValue.true (internally NumberValue(1))
- Used as the canonical boolean true.
- Pitfall: Likewise, true is 1. Mixing true with integers in arithmetic (e.g. true + 2) coerces true to 1, so the result is 3.
MATH_PI
- Value: NumberValue.MATH_PI (≈ 3.141592653589793)
- A constant for π.
- Pitfall: Since Chester has no built-in trigonometry functions, you must import or implement them yourself if you need sin(), cos(), etc.

Built-in Functions

print(expr)
- Signature: print(Expression) -> void
- Description: Serializes and writes the evaluated value of expr to standard output (with no trailing newline).
- Behavior:
  - If expr is a number, prints its numeric string.
  - If expr is a string, prints the string literally.
  - If expr is a list, prints a bracketed, comma-separated representation.
  - If expr is null or undefined, prints "null".
  - If expr is a function, prints something like <function>.
- Pitfall: Doesn’t append a newline. To print and return a newline, use printReturn(expr) instead.
printReturn(expr)
- Signature: printReturn(Expression) -> void
- Description: Same as print(expr) but appends a newline after printing.
- Pitfall: Because it always adds \n, avoid chaining printReturn calls if you want to control spacing manually.
input()
- Signature: input() -> String
- Description: Reads one line of text from standard input, returns it as a Chester string.
- Pitfall: Blocks execution until the user enters a line. If you call input() inside a loop without a prompt, it may appear hung.
inputInt()
- Signature: inputInt() -> Number
- Description: Reads one line of text from standard input, attempts to parse it as an integer.
- Behavior:
  - On valid parse (e.g., "42"), returns NumberValue(42).
  - On invalid parse (e.g., "foo" or ""), throws a RuntimeError: ParseError.
- Pitfall: Always validate user input before using it in arithmetic. Wrap inputInt() in a conditional or try/catchstyle check if you expect non‐numeric entries.
clear() (alias: cls())
- Signature: clear() -> void
- Description: Clears the console/terminal screen (host‐language behavior).
- Pitfall: On some environments (e.g., certain IDEs), clear() may have no visible effect. Rely on it only in REPLs or terminals known to support ANSI‐based clears.
isNum(expr)
- Signature: isNum(Expression) -> Boolean
- Description: Returns true if expr is a numeric value (including booleans) at runtime. Otherwise, returns false.
- Pitfall: Since true and false are internally numbers (1 and 0), isNum(true) returns true. If you want to test for strictly integer vs. boolean, you must inspect the raw value yourself.
isStr(expr)
- Signature: isStr(Expression) -> Boolean
- Description: Returns true if expr is a Chester string. Otherwise, returns false.
- Pitfall: Chester strings are distinct from lists of characters. If you build a single‐element list ["hello"], isStr returns false.
isList(expr)
- Signature: isList(Expression) -> Boolean
- Description: Returns true if expr is a list (even if it’s empty). Otherwise, false.
- Pitfall: An empty list [] yields true. If your code branches on isList(x), ensure you also check length(x) > 0 if you expect non‐empty lists.
isFunc(expr)
- Signature: isFunc(Expression) -> Boolean
- Description: Returns true if expr is a function closure. Otherwise, false.
- Pitfall: If you pass a built-in function (e.g., print) to isFunc, it returns true. If you want to differentiate user‐defined vs. built‐in, you must inspect the function’s metadata.
append(list, value)
- Signature: append(List, *) -> List
- Description: Returns a new list equal to list with value appended to its end. Does not mutate the original.
- Pitfall: If the first argument is not a list, throws RuntimeError: TypeError. Because lists are dynamically typed, watch out for mixing types (e.g., append(5, "x") errors).
pop(list)
- Signature: pop(List) -> *
- Description: Removes and returns the last element of the given list. Mutates the original list in-place.
- Behavior:
  - If list is non-empty, removes its last element and returns it.
  - If list is empty, throws RuntimeError: IndexError.
- Pitfall: Because it mutates, avoid calling pop on a shared reference if you need to preserve the original list.
concat(list1, list2)
- Signature: concat(List, List) -> List
- Description: Returns a new list formed by concatenating list1 followed by list2. Does not mutate either argument.
- Pitfall: If either list1 or list2 is not a list, throws RuntimeError: TypeError.
length(expr)
- Signature: length(Expression) -> Number
- Description: If expr is a string or a list, returns its length. Otherwise, throws RuntimeError: TypeError.
- Pitfall: Applying length to a number or function triggers a TypeError. If in doubt, guard with isList(expr) or isStr(expr).
run(filename)
- Signature: run(String) -> *
- Description: Attempts to locate and execute a file named filename.ct. Returns the value of its last expression.
- Behavior:
  - If the file does not exist, throws RuntimeError: FileNotFound.
  - If the file has syntax errors, throws RuntimeError: SyntaxError.
  - If the file executes successfully, returns the result of its last expression or null if empty.
- Pitfall: Always include the .ct extension when calling from within Chester; otherwise, the interpreter still looks for <filename>.ct.

Points of Improvement

First of all, one major variable for the scores will of course be the RAG. If the retrieval is not 100% accurate (which is kinda impossible for me at this stage) there is obviously going to be a visible effect on the results obtained. That’s where the open source aspect of this project works (I hope). This project is by no means finished and is constantly on the lookout for contributions- be it computational, code, or methodology related. Feel free to check the code out, add in your tweaks and let me know what results you get for your particular combination of vector store + embeddings + LLM!
There are possible improvements for the methodology itself or even the results drawn. While BLEU, iteration count, and hallucination are the major outcomes I had in mind when it comes to inferences for this particular test it might be helpful to add other parameters which would possibly tie this test into the results of others so as to create a cohesive final image.
Finally, the data might also be lacking here. Feel free to add more instances of C-to-Chester conversions to the data!

Conclusion

Well, in this entire charade a new programming language was written just to test out your models. Was it worth it? I’d say yeah (barely).

While there are established tests for how good of a code generator modern LLMs are and how well they are able to beat humans at math, there still is not a standard way of measuring the “creativity” of models when it comes to utilizing work-arounds for fundamental concepts - here the concept being programming.

As always, the code is open source and available for experimentation. Try challenging your favorite AI models with Chester’s unique constraints.