In partnership with

The free newsletter making HR less lonely

The best HR advice comes from people who’ve been in the trenches.

That’s what this newsletter delivers.

I Hate it Here is your insider’s guide to surviving and thriving in HR, from someone who’s been there. It’s not about theory or buzzwords — it’s about practical, real-world advice for navigating everything from tricky managers to messy policies.

Every newsletter is written by Hebba Youssef — a Chief People Officer who’s seen it all and is here to share what actually works (and what doesn’t). We’re talking real talk, real strategies, and real support — all with a side of humor to keep you sane.

Because HR shouldn’t feel like a thankless job. And you shouldn’t feel alone in it.

ResearchAudio.io

8 RL Steps. 4.5x Smarter Agent. Zero Labels.

Princeton recovers discarded interaction data as live training signal.

Every time your AI agent handles a conversation, runs a terminal command, or calls an API, the environment responds. A user corrects the output. An error trace appears. A test passes. That response encodes exactly how well the agent performed, and often, what it should have done differently. Then the system throws it away.

A team at Princeton (including Mengdi Wang) calls this systematic waste. Their new framework, OpenClaw-RL, recovers these "next-state signals" as a live, online training source. The result: a personal agent's alignment score jumped from 0.17 to 0.76 in just 8 training steps, with zero human annotations.

The Core Insight: Two Types of Wasted Signal

Existing agentic RL systems treat the response after each action purely as context for the next step. OpenClaw-RL's key observation is that next-state signals are actually universal across all interaction types, and they encode two distinct forms of recoverable information.

Evaluative signals tell the system how well an action performed. A user re-query means the first answer was unsatisfactory. A passing test means the code was correct. A PRM (Process Reward Model) judge converts these into scalar scores using majority voting across multiple evaluations.

Directive signals go deeper. They reveal not just "bad" but "how it should have been different." OpenClaw-RL extracts these through a technique called Hindsight-Guided On-Policy Distillation (OPD). A judge model reads the next-state feedback, distills a 1-to-3 sentence correction hint, and appends that hint to the original prompt. The same model then recalculates its token-level probabilities with this enhanced context, producing a directional advantage signal richer than any scalar score.

OpenClaw-RL architecture: interaction streams (conversation, terminal, GUI, tool calls) flow into an async RL engine with 4 decoupled components (Environment, PRM Judge, Megatron Trainer, SGLang Server), producing evaluative and directive signals, with results showing 4.5x improvement for personal agents

Architecture: Four Async Loops, Zero Blocking

OpenClaw-RL decouples the entire training pipeline into four independent asynchronous components built on the Slime framework: an environment server for session-aware rollout collection, a PRM judge for scoring, Megatron for policy training, and SGLang for live policy serving. None of these block each other. The model continues serving live user requests while the PRM evaluates past interactions and the trainer updates weights in the background.

For personal agents, the "environment" is simply the user's device, connecting to the RL server over HTTP. For general agents (terminal, GUI, SWE, tool-call), environments are hosted on cloud services for parallelized scaling. The system automatically organizes multi-turn interactions into session-aware trajectories, classifying each turn as "main-line" (trainable) or "side" (non-trainable).

Binary RL + OPD: Better Together

The two methods target different aspects of learning. Binary RL provides broad coverage across all interactions using scalar process scores from the PRM judge. OPD delivers precise token-level corrections for cases where the next state contains actionable information. Combining both yields stronger optimization than either alone.

0.76
Personal Score (from 0.17)
8
Training Steps
0
Human Labels

Results: Personal and General Agents

The team tested with Qwen3-4B across two simulated personal-agent scenarios. In a "student" setting (where the agent helps with homework but must avoid sounding like AI), the combined Binary RL + OPD method pushed the personalization score from 0.17 to 0.76 after just 8 training steps. The agent learned to drop formulaic AI phrasing and produce more natural, conversational responses. In a "teacher" setting (where the agent grades homework with friendly, detailed feedback), 24 grading interactions were enough for meaningful style adaptation.

For general agents, the framework was tested across terminal, GUI, SWE, and tool-call scenarios using various Qwen3 models. Tool-call performance improved from 0.17 to 0.30. GUI performance went from 0.31 to 0.33. Accuracy increased across RL steps in all four settings, with terminal and tool-call agents showing the most consistent gains. The addition of step-wise process scoring (via PRM judging each turn using next-state evidence) was particularly helpful for long-horizon tasks where outcome-only signals leave most turns unsupervised.

Key Insight: The framework unifies five interaction types (conversations, terminal, GUI, SWE, tool calls) into a single training loop. This is the first system to combine multiple simultaneous interaction streams for one policy.

Key Insight: OPD recovers richer information than scalar scores because it produces token-level directional advantages. A PRM tells the model "that was bad." OPD tells the model "here is how each token should shift." Combining both methods outperforms either alone.

Key Insight: Process scoring is vital for agentic tasks. Outcome-only signals provide gradient signal only at the final step, leaving most turns in a long trajectory unsupervised. Per-turn PRM scoring using next-state signals solves the credit assignment problem without any annotation cost.

The most significant implication: if every deployed agent already generates the training data it needs through normal interactions, the bottleneck for agent improvement is no longer data collection. It is the infrastructure to recover and train on it in real time.

ResearchAudio.io

Source: OpenClaw-RL: Train Any Agent Simply by Talking (Wang et al., 2026) | GitHub

Keep Reading