Not every foundation model needs to be gigantic. We trained a 1.5M-parameter neural network to control the body of a humanoid robot. It takes a lot of subconscious processing for us humans to walk, maintain balance, and maneuver our arms and legs into desired positions. We capture this “subconsciousness” in HOVER, a single model that learns how to coordinate the motors of a humanoid robot to support locomotion and manipulation. We trained HOVER in NVIDIA Isaac, a GPU-powered simulation suite that accelerates physics by 10,000x faster than real time. To put the number in perspective, the robots undergo 1 year of intense training in a virtual “dojo”, but take only ~50 minutes of wall clock time on one GPU card. The neural net then transfers zero-shot to the real world without finetuning. HOVER can be *prompted* for various types of high-level motion instructions that we call “control modes”. To name a few: - Head and hand poses - can be captured by XR devices like Apple Vision Pro. - Whole-body poses - via MoCap or RGB camera. - Whole-body joint angles - Exoskeleton. - Root velocity command - Joysticks. What HOVER enables: - A unified interface for us to control the robot using whichever input devices are convenient at hand. - An easier way to collect whole-body teleoperation data for training. - An upstream Vision-Language-Action model to provide motion instructions, which HOVER translates to low-level motor signals at high frequency. HOVER supports any humanoid that can be simulated in Isaac. Bring your own robot, and watch it come to life! It's a big teamwork from NVIDIA GEAR Lab and collaborators: Tairan He, Wenli Xiao, Toru Lin, Zhengyi Luo, Zhenjia Xu, Zhenyu Jiang, Jan Kautz, Changliu Liu, Guanya Shi, Xiaolong Wang Team leads: Jim Fan, Yuke Zhu Website: https://lnkd.in/g6WrJyRC Paper: https://lnkd.in/g99yWTPa
Jim Fan But why a humanoid? Why do you push a robot's capability into a limited capacity of a human ergonomics... Try to put wheels and add a third arm observe the efficiency and please delete this head. It only makes the centre of gravity higher. If you need to place the controller of the robot try the bottom of the robot. I believe Lucas is a genius making C-3PO the stupid one. Why do you think the ideal form of an universal robot is a human being?
For embodiment agents/reinforcement learning models we can train to SOTA with much fewer parameters than models for image generation or language modeling. It’s interesting how a seemingly more complex task requires much less bandwidth to solve. Like an image model might need billions of params to generate in-distribution data from a PB scale dataset, but for a highly complex physical problem- the policies can be pennies in comparison. Scaling laws sorta apply, but it’s very diminishing after a certain point. I saw similar limits at quilter when training rl placer agents- beyond a million parameters was unnecessarily large.
There is no use case for humanoid robots. As a matter of fact, they are a danger (physical) to real people (think babies, handicapped, etc). This is just sci fi fantasy work...how about doing something useful?
I love to seeing more focus on robotics! It has so much potential to transform and improve lives, such as for the disabled. Using AI to solve unsolved problems should be the top focus for the industry.
Progress is promising. We’re researching analog technology to train localized network-like systems that improve kinetic response in robotic joints. Additionally, we’re developing feedback sensors designed to relay information to local networks before reaching the central AI within nanoseconds. Reducing latency in a robot's wiring is critical, as every millisecond saved allows the AI to respond within the necessary timeframe—similar to the peripheral nervous system’s role in the human body, where rapid, localized responses enhance overall reaction speed. Analog systems operate in real time, and analog AI is expected to emerge within the next five years, further advancing these capabilities.
Have been saying for a couple of years now and with results: Vertically (small model) trained then horizontally chained.
This is actually a more "native" use of GPUs than making them do transformer math for LLMs
Human’s key advantage is our one-shot ability for us to learn to grab and interface with the world. How fast and precise is this in the real world? How close do you think we are to a foundational model that reaches fast one one-shot reactions to the real world? Would be a good time to make a benchmark!
틱톡 라이트 앱에 가입하면 별을 채우고 보상을 받을 수 있어요! https://lite.tiktok.com/t/ZSjkfL46T/
Science, from artisanal to automated | High Content & High Throughput Assay Development | Laboratory Automation
4moJim Fan so does it work in actual robots?