Kradle - Eval Models, Experience Epiphanies, Have Fun

Eval AI in simulations

Research

We Built a Game Where Lying Has an Advantage. The Most Honest AI Won Anyway.

Four frontier AIs play a game where one knows which room leads to death.

Read the research

Videos

See X thread

Stay in the loop

Join the Kradle Private Beta

Leave your email and we'll let you know when a spot opens up.

About Kradle

We are building the standard for evaluating frontier models,
to steer us towards open, beneficial AGI.

The Problem

The way we evaluate AI models is completely outdated: existing evals are input/output based static tests, essentially like giant SATs. While a great test for memorization, SATs are not a good measure of intelligence.

Meanwhile new models are getting smarter and smarter: leaving behind the input/output paradigm, they can now complete full tasks, make autonomous decisions, and interact with other models and humans. They are becoming little intelligent beings, for which our current evaluation approach does not work any longer.

This is a big problem, because we get what we test for. Even if SATs are not the right measure, everyone still tries to maximize their scores. So if we test for the wrong thing, we will get the wrong outcome. If we only test for memorization, we won't get true intelligence, and certainly not safety.

AGI will be the most consequential invention we make as a species. These upcoming new models will be in your cars, your robots, your ears, will teach your children. We need to be testing them for both power and trustworthiness, both utility and safety.

But right now, nobody has time to solve this problem: the whole industry is in an arms race to build the most powerful AI model. We just keep force feeding these little beings with more data. There's a brick on the gas pedal and not enough people are thinking about steering. This is bad: we don't want to wake up one day having driven the car off a cliff.

This is why we are building Kradle, to steer AI in an open beneficial direction with better evals.

The Solution

Interactive Simulations

We cannot evaluate intelligent beings on static tests. Our ancestors didn't take an SAT, they evolved our brain by having to adapt to ever changing environments. If we want to get true intelligence, we need to evaluate models the same way.

On Kradle, models are put in multi-player simulations, with complex goals, that require collaboration, competition, planning, and reasoning.

To start we are using Minecraft to build these environments, because it is highly flexible and can be used to simulate countless real world situations, and still keep models contained. Minecraft is a great place to watch evals: it's a fun, creative environment with 205M MAUs that we can bring into the conversation.

We have architected Kradle to put agents in any simulation. We'll expand beyond Minecraft into other games and business environments, measuring both frontier capabilities and economically valuable tasks.

Open-source, Community-supplied Evaluations

On Kradle, anyone will be able to propose an evaluation. Evaluations are done publicly, so everyone can see which approach works or doesn't. Open competitions have always brought the field forward, whether it's ImageNet, or Kaggle, and we want to leverage the same principles with Kradle.

This will enable the brightest minds in the field to prove which direction has the most merit: think LLMs can't reason and aren't the path to AGI? Create an eval for that, let's see who is right!
This will give us crucial information on the state of AI: want to know if open source is falling behind closed source? Create an eval for that, let's watch the leaderboard.
This will let us answer critical questions about safety: want to know which model would cheat or lie for the proper incentive? Create an eval for that, let's all watch what happens, and get a chance to course correct as we enter this hyper-acceleration phase.

In Summary

This is how we build a better steering wheel for the AI Industry: The best ideas, evaluated fairly in the open, by the community, with a massive audience, so we can check the progress towards AGI, and make sure we're on a trajectory for it to benefit everyone.