Field Notes · Robotics · 2026

Training a Custom Quadruped from Scratch in Isaac Lab

A short, opinionated walkthrough. From an empty machine to a 12-DoF robot that learns to step. Reference robot: ZHAW Wild V1. Things that differ from the NVIDIA examples are flagged inline.

BY Zaniyar Jahany CONTACT jaha@zhaw.ch STACK Isaac Sim 5.1 · Isaac Lab · PPO
part 01

1 Sim vs. Lab, in plain words

The two names confuse everyone. Here is the short version.

ToolWhat it isYou touch it when
Isaac SimThe simulator. PhysX physics engine plus a USD scene graph plus a Python API.You want a 3D world with gravity, collisions, lights, cameras.
Isaac LabA framework on top of Isaac Sim, focused on RL training. Manager-based environments, vectorised envs, reward terms, terrain generators.You want to train a policy. You almost never script Isaac Sim by hand once you use Lab.
rsl_rlThe PPO trainer that ships with Isaac Lab. Other choices: rl_games, skrl. PPO is the default for locomotion.You hit train.py.

Mental model if you come from web dev: Isaac Sim is the renderer plus physics. Isaac Lab is the game framework that wires up player input, scoring, level reset, and parallel instances. PPO is the agent that figures out the controls.

part 02

2 Install on Windows

The clean path that worked here is the pip install of Isaac Sim plus Isaac Lab from source, all in one Python 3.11 venv. No Docker, no Visual Studio, no source build of Sim. Big install, but reproducible.

What lives where

Isaac Lab repo : C:\Users\jaha\robotic\IsaacLab
Virtual env    : C:\Users\jaha\env_isaaclab
Python         : 3.11
Isaac Sim      : 5.1.0
torch          : 2.7.0+cu128

The five commands that matter

py -3.11 -m venv C:\Users\jaha\env_isaaclab
git clone https://github.com/isaac-sim/IsaacLab.git C:\Users\jaha\robotic\IsaacLab

C:\Users\jaha\env_isaaclab\Scripts\python.exe -m pip install ^
  "isaacsim[all,extscache]==5.1.0" --extra-index-url https://pypi.nvidia.com

C:\Users\jaha\env_isaaclab\Scripts\python.exe -m pip install -U ^
  torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 ^
  --index-url https://download.pytorch.org/whl/cu128

cd C:\Users\jaha\robotic\IsaacLab
.\isaaclab.bat -i
Path length matters on Windows. Keep everything under C:\Users\jaha\ or enable long paths. Deeply nested install paths break the cache extractor without a useful error.
PowerShell vs cmd.exe. Mixing them silently does the wrong thing. In PowerShell use $env:VAR = "value", not set VAR=value. The line continuation char is a backtick, not a caret.
part 03

3 First examples to run

Before touching your own robot, run two NVIDIA examples to confirm the install. All three scripts below ship with Isaac Lab and live in the cloned repo under IsaacLab/scripts/. stock isaac lab

Spawn a primitive

cd C:\Users\jaha\robotic\IsaacLab
C:\Users\jaha\env_isaaclab\Scripts\python.exe scripts\tutorials\00_sim\spawn_prims.py

A box appears in a scene. Read the script top to bottom. This is your scene.add(mesh) equivalent.

Drive an articulation

C:\Users\jaha\env_isaaclab\Scripts\python.exe scripts\tutorials\01_assets\run_articulation.py

A robot appears, joints driven by code. The class to look for is ArticulationAction. Joints are the things RL will eventually control.

Train a built-in robot, end to end

C:\Users\jaha\env_isaaclab\Scripts\python.exe scripts\reinforcement_learning\rsl_rl\train.py ^
  --task=Isaac-Velocity-Rough-Anymal-C-v0 --headless

This trains Anymal-C on rough terrain. If the terminal prints rising mean rewards, the install is healthy. You can interrupt at any point and play the latest checkpoint with play.py.

part 04

4 Designing in Fusion 360

Fusion is great for building the body. It is bad at exporting it. Plan for both phases.

Build it as components, not bodies

Each link must be its own Component. Bodies inside one component become a single rigid link in URDF, which is almost never what you want. For the four legs of ZHAW Wild, use prefixes fl_, fr_, rl_, rr_ on every component name. You will love yourself in three hours.

Joint discipline

Three things to set before export

  1. Set the world origin sensibly. The base link will sit at this origin.
  2. Apply real materials so mass and inertia are not zero. Steel is rarely correct for hobby parts. Use ABS or aluminum.
  3. Pre-orient the robot to Z-up if your exporter offers it. Otherwise you fix this later in code.
For ZHAW Wild specifically the hierarchy was: base_link → 4 × (hip_link, thigh_link, shank_link, foot_link). Three rotational joints per leg: haa, hfe, kfe. That is the Spot and Anymal convention, which means RL configs from those robots transfer cleanly.
part 05

5 Exporting to URDF

Fusion has no native URDF export. Use the ACDC4Robot add-in. It writes URDF in the Gazebo flavor, which is identical to what Isaac Lab consumes. Three tabs to set: choose URDF as the format, pick an output folder, click run. The add-in walks the component tree and emits one <link> per component plus one <joint> per Fusion joint.

Why URDF, not the USD that Fusion can spit out

A USD exported straight from CAD looks fine in a viewer and silently fails in Isaac Lab. The same three sins show up every time:

The fix is the same every time: regenerate URDF, hand that to Isaac Lab, and let Lab bake a compliant USD itself. Isaac Lab adds the ArticulationRootAPI, contact-reporting hooks, and inertial tensors during conversion. Going through URDF is the only path that gives you these for free.

What to fix in the URDF after export

Naming convention is not optional. If your joints come out as revolute_joint_body_uleg_fl, every single regex in your actuator config and reward config has to match that. The Spot pattern of fl_haa, fl_hfe, fl_kfe matches a one-line regex like ".*_hfe". Worth a 10-minute rename pass right after export.
part 06

6 Loading into Isaac Lab

Two files own the robot side: an asset config and a task registration. For ZHAW Wild they sit at:

IsaacLab\source\isaaclab_assets\isaaclab_assets\robots\zhawild.py
IsaacLab\source\isaaclab_tasks\...\velocity\config\zhawild\__init__.py
IsaacLab\source\isaaclab_tasks\...\velocity\config\zhawild\flat_env_cfg.py

Open the live files: zhawild.py (asset) · __init__.py (task registration) · flat_env_cfg.py (env) · rsl_rl_ppo_cfg.py (PPO).

The asset config view full file

ZHAWILD_CFG = ArticulationCfg(
    spawn=sim_utils.UrdfFileCfg(
        asset_path="path/to/quadruped_zhawild_v1/quadruped2.urdf",
        rigid_props=...,
        articulation_props=...,
    ),
    init_state=ArticulationCfg.InitialStateCfg(
        pos=(0.0, 0.0, 0.75),
        rot=(0.7071068, 0.7071068, 0.0, 0.0),    # Y-up to Z-up
        joint_pos={".*": 0.0},
    ),
    actuators={
        "legs": ImplicitActuatorCfg(
            joint_names_expr=[".*_haa", ".*_hfe", ".*_kfe"],
            stiffness=80.0, damping=2.0,
        ),
    },
)
ZHAW Wild · live URDF preview quadruped2.urdf · 17 links · 12 revolute
booting…
torso (alu) hip / thigh shank tube ankle block clevis foot (rubber)
drag · rotate  |  wheel · zoom  |  hover · inspect

The viewer parses the same STL meshes Isaac Lab loads, in the same joint hierarchy. Drag the sliders to drive the hfe (hip flexion) and kfe (knee) joints across all four legs. Source: quadruped2.urdf + 11 STL meshes.

Why URDF is passed in, not USD

Isaac Lab attaches contact-reporting APIs to the foot links during URDF conversion. That is what makes the gait, air-time, slip, and body-contact reward terms work. If you pre-convert to USD outside the loop, you skip that step and your foot contacts will be empty. Pass the URDF, let Lab convert it once, cache it.

Wheel-legged robots need a second actuator group. Wheels are continuous joints meant to spin freely. If they sit on a stiff position-PD with target zero, the robot fights its own rolling motion. The fix is two groups: legs with normal gains, and wheels with stiffness=0 plus light damping. ZHAW Wild has no wheels, so one actuator group is enough.
part 07

7 Building the training environment

An Isaac Lab env is a Python class made of seven configs. Read this once, then it is just lego.

ConfigWhat it definesGame-dev analogue
SceneCfgGround, lights, robot, sensors.The level.
CommandsCfgTarget velocity vector resampled every N seconds.The joystick input.
ActionsCfgWhat the policy outputs. Usually joint position targets.The buttons.
ObservationsCfgInput tensor: base velocity, gravity vector, joint state, last action, optional height scan.What the player sees.
RewardsCfgPlus and minus terms accumulated every step.The scoring.
TerminationsCfgWhen to end the episode and respawn.Game over.
EventCfgDomain randomisation: friction, mass, push events.Random chaos so it generalises.

The simulation loop, end to end

RESET? timeout / fall POLICY obs → action WRITE → PhysX STEP sim.step()

One iteration of the inner loop is one physics tick across all 1024 parallel envs. PPO collects 24 ticks per env, runs a gradient update, and starts the next iteration. That is the heartbeat.

Terrain

Isaac Lab ships a procedural terrain generator. Three shapes you actually want:

Obstacles

Movable obstacles are just rigid bodies you spawn into the scene with collision and mass. Boxes and cylinders work fine. They are not part of the policy observation, the robot just bumps into them. Useful for stress testing and for the kick-disturbance scenario described later.

Domain randomisation

The single thing that makes a sim-trained policy survive in the real world. In ZHAW Wild we randomise:

Each parallel env gets a different draw. With 1024 envs, the policy sees thousands of variations every minute.

part 08

8 Reward shaping & the crawl trap

PPO does exactly what your reward function says, never what you meant. This was the most expensive lesson of the whole project. On Robot20, our 2-DoF custom robot, 15 hours of GPU time produced a policy that maximised every term we wrote, while crawling on its lower-leg shins like skis. No standing, no foot clearance, healthy gait reward, full speed reward.

Three failures that compound

  1. Gait reward checks alternation, not posture. A crawling robot with alternating shin contact also satisfies it. Combine the gait term with a base-height target.
  2. Termination only on torso and thigh. The lower-leg segment was supposed to be the foot. Lying flat on it was perfectly legal. Add bottom-link termination, or a strong slip penalty when the bottom is in flat contact.
  3. Geometry locks you in. A 2-joint leg has no abduction and no ankle. Standing is hard, crawling is easy. No reward function can overcome that. ZHAW Wild with its three rotational DoF per leg does not have this problem.

Reward terms that work for a classical quadruped

TermSignReads as
track_lin_vel_xy_exp+1.0Match the commanded forward speed. Main objective.
track_ang_vel_z_exp+0.5Match the commanded turn rate.
feet_air_time+0.125Lift feet off the ground. Encourages stepping over sliding.
lin_vel_z_l2−2.0Do not bounce vertically.
ang_vel_xy_l2−0.05Do not wobble.
action_rate_l2−0.01Do not jerk joints around.
undesired_contacts−1.0Do not touch ground with thighs or torso.

A second example from this run · the bunny hop recording · 9 s

ZHAW Wild · earlier checkpoint of the same run. The robot moves forward, but it bounces. All four legs leave the ground in sync and land in sync. Net forward motion happens, the velocity reward is collected, no body contact occurs. The policy has discovered hopping, not walking.

Why it hops & what would fix it

The bunny-hop is the exact same kind of reward exploit as the crawl, just one rung higher. The policy satisfies every term we wrote without any of the things that make a gait look right.

Why it works for PPOWhat it tells us is missing
Velocity reward only checks the base. Hopping forward at 0.5 m/s scores the same as walking forward at 0.5 m/s. Add a foot-contact-time term that rewards at least N feet on the ground at any time. Hopping has 0 ground contacts mid-flight.
Gait reward checks alternation per pair. Synchronised four-foot lift-off looks like alternation just with phase 0 between pairs. Penalise contact-pattern symmetry across diagonals. A trot has fl+rr down while fr+rl swing. A hop has all four matched.
Vertical-velocity penalty lin_vel_z_l2 exists but the weight is small (-2.0 against a +1.0 forward reward). At training time it gets paid like a tax. Bump vertical-velocity penalty, or add a base-height tracking term (target ~0.55 m, hard-clip the deviation).
Action-rate penalty is per-step. Hopping uses one big synchronised extension followed by relaxation. Few action transitions per second. Add an air-time-per-foot term that rewards individual feet being in the air briefly, not all four at once.

In short: the policy has no incentive to stagger the legs. Every reward we wrote is satisfiable by a perfectly in-phase bounce. Adding a single "never less than two feet on the ground" contact constraint, either as a hard penalty or a contact-schedule reward, breaks the hop and forces a real trot. NVIDIA's Anymal and Spot configs do this with a feet_air_time term tuned per-foot and a flat_orientation_l2 term that punishes the pitch wobble that a hop produces on landing.

Rule of thumb. A reward term should fail loudly when the behaviour you want is absent. If a crawling robot can satisfy your gait reward, your gait reward is broken. Cross-check every term against a degenerate behaviour before running 14 hours of training.
part 09

9 Running PPO & reading the log

train.py and play.py ship with Isaac Lab. stock isaac lab The custom piece is the task ID Isaac-Velocity-Flat-ZHAWild-v0, which gets registered when our zhawild_task package is on the import path.

Train

cd C:\Users\jaha\robotic\IsaacLab
C:\Users\jaha\env_isaaclab\Scripts\python.exe ^
  scripts\reinforcement_learning\rsl_rl\train.py ^
  --task=Isaac-Velocity-Flat-ZHAWild-v0 --headless --num_envs=1024

Play a checkpoint

C:\Users\jaha\env_isaaclab\Scripts\python.exe ^
  scripts\reinforcement_learning\rsl_rl\play.py ^
  --task=Isaac-Velocity-Flat-ZHAWild-Play-v0 --num_envs=1 --real-time ^
  --checkpoint logs\rsl_rl\zhawild_flat\2026-04-26_17-46-09\model_1500.pt

Five numbers to watch

MetricWhat it meansHealthy direction
Mean rewardSum of all terms, averaged across envs and steps.Climbs, then plateaus. Absolute number is meaningless across reward configs.
Mean episode lengthSteps survived before fall or timeout.Climbs toward 1000 (the 20-second cap).
Body-contact fractionEpisodes that ended because torso or thigh hit the floor.Drops toward zero. This is balance.
Mean noise stdExploration noise on the action distribution.Shrinks slowly. Stays positive.
Value-function lossCritic fit quality.Stable. Sudden growth is a divergence warning.

For ZHAW Wild, after 2 h 11 min and 37 M timesteps, mean reward went from 0.9 to 48.7, episode length from 14 to 292, body-contact from 100 percent to 73 percent. The policy is learning. It still falls in three out of four episodes. That is the next training target.

The numbers run 2026-04-26_17-46-09

IterRewardEp lenTimeoutBody-cont XY errYaw errStdGaitLin-velOrient
00.89140.000.430.030.020.600.040.02−0.03
501.37180.001.000.050.030.420.060.03−0.05
1001.90220.001.000.060.040.360.060.04−0.07
50042.532510.200.800.500.390.191.670.56−0.76
100037.242210.280.720.420.320.141.550.47−0.64
150048.702920.270.730.540.550.122.390.77−0.95
151043.452610.270.730.390.320.121.470.46−0.60

Highlighted row: the checkpoint model_1500.pt shown in the kick video below. The last row is the final iteration before the run was stopped, slightly worse than 1500. That is why we never replay the last checkpoint.

Charts

Reward and episode length over iterations
Fig 9.1 · Mean reward and mean episode length over training iterations. Both climb steeply between iter 100 and 500, the "policy figured out balance" phase. The plateau after ~iter 600 is the new ceiling under the current reward setup.
Termination causes over iterations
Fig 9.2 · Termination causes. Body-contact (the robot fell) starts near 100 % and drops to ~73 %. Timeout (survived the 20 s episode cap) climbs to ~27 %. Sum is always 1, every episode ends one of two ways. Driving body-contact further down is the next training target.
Velocity tracking errors
Fig 9.3 · Joystick tracking errors. XY velocity error and yaw error rise initially. Once the robot stops collapsing it actually attempts to follow the commanded velocity, and missing the target shows up as error. Lower would be better, but non-zero is normal: the policy trades exact velocity tracking against staying upright.
Selected reward terms
Fig 9.4 · Selected reward terms. The gait reward (alternating-foot bonus) climbing from 0.04 to 2.4 is the strongest single signal that the policy discovered coordinated foot timing. Linear-velocity reward climbs alongside it. The orientation penalty stays mildly negative throughout, the robot wobbles during steps, which matches the high body-contact rate.
How to read these together. Reward up + episode length up + gait up + std down = healthy training. Body-contact still high + orientation penalty still negative = the limiting factor is balance, not speed-tracking. The next iteration of training should target balance specifically: stronger orientation penalty, base-height term, or the contact-pattern fixes from Section 8.
Two failure modes I have hit. First, RuntimeError: normal expects all elements of std >= 0.0 when the action-noise std drifts negative. Patch it in PPO by clamping std to a small positive minimum. Second, NaN drift in value-function loss after thousands of healthy iterations. Save checkpoints every 50 iterations, and never replay the last one. Pick a mid-run model that still has good metrics.
part 10

10 The kick test

Reward curves do not tell you whether the policy is robust. A kick test does. The training-time push event covers the small disturbances. For a controlled scientific test you want repeatable kicks with known direction and magnitude.

Why keyboard, not mouse

Mouse-click kicks need a viewport raycast. In Isaac Sim 5.1.0 the viewport mouse events are intercepted by the selection and manipulator system before any custom subscriber sees them. Left-click on the robot just selects its prim. Keyboard events go through carb.input which is not affected by UI focus and works as long as the Sim window itself is active.

Bindings

IKick forward (+X)
KKick backward (−X)
JKick left (+Y)
LKick right (−Y)
UUpward boost (+Z)
RReset env 0
PPrint pose & velocity

I/J/K/L instead of WASD because WASD is reserved for the viewport fly-mode camera.

Run play_kickable.py custom

C:\Users\jaha\env_isaaclab\Scripts\python.exe ^
  path\to\code\scripts\play_kickable.py ^
  --task=Isaac-Velocity-Flat-ZHAWild-Play-v0 --num_envs=1 --real-time ^
  --checkpoint logs\rsl_rl\zhawild_flat\2026-04-26_17-46-09\model_1500.pt ^
  --kick_speed 2.0

What it looks like recording · 2:25

ZHAW Wild · model_1500.pt · flat-plane playback. The robot has learned that staying on its feet beats touching the ground with the base. Watch the recovery after lateral kicks: it puts a foot out sideways, catches the impulse, and resumes tracking the commanded velocity.

What the kicks tell you

TestSequenceReads off
Side pushwalk forward, press J or LRecovery time and tracking error after the impulse.
Forward shovestanding, press IWhether it absorbs the shove or topples.
Backward shovestanding, press KBackwards stability. Almost always the weakest direction.
Vertical droppress U, observe landingFoot-impact handling.

The implementation pushes via robot.write_root_velocity_to_sim(...), the same call the training-time push event uses. Disturbances are physically equivalent to what the policy was trained against, just on demand.

part 11

What I would do differently

The custom-robot RL pipeline works end to end. URDF in, policy out, kick test for robustness. The remaining work is reward design and physical design. Both of those compound. Both of those are also where the interesting research lives.

ZJ
Zaniyar Jahany ZHAW · jaha@zhaw.ch
appendix

A Source files

Every Python file referenced in this post is bundled next to it. Click to read in the browser, right-click to download.

Robot & assets

Isaac Lab asset config

Training task package

Drop the whole code/zhawild_task/ folder into IsaacLab/source/isaaclab_tasks/isaaclab_tasks/manager_based/locomotion/velocity/config/zhawild/. Isaac Lab auto-imports task packages on startup, which is when gym.register(...) runs.

Runtime scripts

This page