We built AstaBench to give the field a shared, transparent way to measure whether AI can do rigorous scientific work.
We’re pleased to see adoption with the @AISecurityInst via Inspect Evals and @GenReasoning, which added an AstaBench task to OpenReward.
🎉 We're now supporting the Agent Data Protocol as a default agentic trajectory format.
Any trajectories you log to @OpenReward can be exported in the ADP format.
Thanks to @gneubig@yueqi_song for the collaboration!
🧪 We’re experimenting with new features that allow for easier sampling with popular agentic harnesses.
Core use cases:
- Collecting diverse agentic midtraining data
- Evaluating the latest models on agentic environments
Try it out!
🔥🐴 Firehorse.
Run any model with any harness on any @OpenReward environment.
⚖️ Evaluate the latest models on environment endpoints.
🗂️ Collect agentic data for midtraining and SFT from open models.
🧪 Early experimental library. More support soon.
Link below.
🎲 Introducing KellyBench, a new long-horizon evaluation for frontier models.
KellyBench evaluates models within a year long sports betting market, a challenging and highly non-stationary environment.
Every frontier model we test loses money. They struggle to design ML
Recently, I integrated @OpenReward into SkyRL (@NovaSkyAI), including an example demonstrating training with @modal. To verify the code, I ran several experiments—which proved to be a highly enriching experience! 😋
github.com/NovaSky-AI/Sky…
timelapse 27 :)
- submitted the rust reasoning algo env to meta rl hack, (actually built a python then moved to the rust one) created rust dataset around 1000 problems will make it next to 2.5k
- define the whole reward logic not the optimal i think designed the way validation works, will refine it & push to @PrimeIntellect & @OpenReward envs.
- have some other tasks as well, deadline is Tomorrow so need to finish this
- this week was a pretty rough like peak locked in, so will chill & and just relax for few days
Introducing GLM-5.1: The Next Level of Open Source
- Top-Tier Performance: #1 in open source and #3 globally across SWE-Bench Pro, Terminal-Bench, and NL2Repo.
- Built for Long-Horizon Tasks: Runs autonomously for 8 hours, refining strategies through thousands of iterations.
🌍 Environments of the Week
The theme this week...environments for science 👩🔬.
First up, LLM-SR Bench by @ParshinShojaee et al is an environment for evaluating language model agents on scientific equation discovery tasks.
openreward.ai/parshinsh/llms…
🌍 Environments of the Week
The theme this week...environments for science 👩🔬.
First up, LLM-SR Bench by @ParshinShojaee et al is an environment for evaluating language model agents on scientific equation discovery tasks.
openreward.ai/parshinsh/llms…
We've had a lot of fun building this benchmark (asking LLMS to run a startup), which gives the clearest signal on LLMs' "long-term coherence" ability. We observe that frontier models have significant variance on this benchmark, showing that long-term execution is still
🪐 Researcher Credits
We’re announcing researcher credits for OpenReward: helping researchers develop the next generation of environments and evaluations.
Read more and apply below.
gr.inc/releases/resea…
🌍 Environments of the Week
It's been a week since we launched @OpenReward. Here are some of our favourite environments this week - some newly added, some heavily used, and some hidden gems.
First, the most used environment of the week is EndlessTerminals by @gandhikanishk with
🌍 Environments of the Week
It's been a week since we launched @OpenReward. Here are some of our favourite environments this week - some newly added, some heavily used, and some hidden gems.
First, the most used environment of the week is EndlessTerminals by @gandhikanishk with 830k+ tool calls.
openreward.ai/kanishk/Endles…
🧵
Cool idea from @AashaySachdeva: unified environment interfaces like @OpenReward can enable LLM meta-learning research!
Pleased with where things are going with more parts of the stack accessible publically. For e.g. I now look forward to weekly @tinkerapi roundups as much as John Oliver episodes!
Played around with this. This was exactly something I was looking for!
Tried a few things -
Creating an env - pretty dope! end to end claude was able to port it from github with only minor issues. One shotted @ShashwatGoel7 OpenForecaster env here. A lot more people should
Played around with this. This was exactly something I was looking for!
Tried a few things -
Creating an env - pretty dope! end to end claude was able to port it from github with only minor issues. One shotted @ShashwatGoel7 OpenForecaster env here. A lot more people should contribute their own envs. I hope they launch monetisation here.
Running a curator over env tasks during RL - When there are so many tasks, which one should you focus on? This is the auto-curriculum/meta-learning bit. I am still not able to beat random/pass@k but I think signals are there over long run this will help with diversity. This obviously has a power law, every run will have top envs dominating but I feel those 20% random tasks will give a big boost to any model.
optimise the GEPA optimiser - gepa is great but pretty slow. What if we could teach a model to do this better? This was in my list for so long, finally with openreward was able to attempt it.
Introducing OpenReward.
🌍 330+ RL environments through one API
⚡ Autoscaled sandbox compute
🍒 4.5M+ unique RL tasks
🚂 Works like magic with Tinker, Miles, Slime
Link and thread below.
.@benchflow_ai started in 09/24 as unity for benchmarks and a hosting hub with early users from Stanford and Princeton. 4 months before R1 dropped
We stopped after 9 months with 0 traction.
Today our latest work SkillsBench is #1 trending on @OpenReward. Game of eval is just on
Introducing OpenReward.
🌍 330+ RL environments through one API
⚡ Autoscaled sandbox compute
🍒 4.5M+ unique RL tasks
🚂 Works like magic with Tinker, Miles, Slime
Link and thread below.
OpenReward serves hundreds of RL environments through a single API with autoscaled compute. Plug into Tinker to train agents on millions of tasks from anywhere.
x.com/GenReasoning/s…
🤝 OpenReward is interoperable with any training library.
Here we use the SETA environment by @Eigent_AI. We use @tinkerapi for model compute and @OpenReward for environment compute.
This allows you to run agentic RL training from a laptop.
github.com/OpenRewardAI/o….
379 Followers 5K FollowingStay ontologically verified. The map is also part of the territory. Software Engineer https://t.co/ULVG2RS8oX | M.S. Computer Science @GeorgiaTech
96 Followers 4K Followingnot a recursive self improvement system. i swear. i also write on @substack. prev @quantinuumqc (nasdaq: qnt) @alchemy @ycombinator (-backed startup)
30 Followers 1K FollowingEntrepreneur & Founder on mission to help enterprises ORGANIZE, MAKE SENSE, TAKE ACTION and GROW with products that make tapping data assets real easy
58 Followers 2K FollowingΩWNÆTHER: The Everything, Everyday App That Rewards. Multi AI, Personal AI Layers, Agentic AI, and AI OS. Intelligent A.I. Infrastructure & AI Orchestration
11K Followers 1K FollowingCo-founder and CEO @GenReasoning. Previously lots of other things like: reasoning lead Meta AI, Llama 3/2, Galactica, Papers with Code.