Singapore-Hong Kong Team Speeds Up AI for Energy Robotics

Researchers from the National University of Singapore and the University of Hong Kong have developed a new framework aimed at improving the real-time performance of large-scale Vision-Language-Action (VLA) models. These models, which combine visual data, language, and action, offer significant potential for various applications, including robotics and autonomous systems in the energy sector. However, their high inference latency has limited their use to low-frequency batch-and-execute paradigms, which can lead to failures in dynamic environments where targets move during the execution window.

The team, led by Yuteng Sun and including Haoran Wang, Ruofei Bai, Zhengguo Li, Jun Li, Meng Yee Chuah, and Wei Yun Yau, introduced TIDAL, a hierarchical framework designed to decouple semantic reasoning from high-frequency actuation. TIDAL operates as a backbone-agnostic module for diffusion-based VLAs, using a dual-frequency architecture to redistribute the computational budget. This means it can be integrated into existing systems without requiring a complete overhaul.

TIDAL features a low-frequency macro-intent loop that caches semantic embeddings and a high-frequency micro-control loop that interleaves single-step flow integration with execution. This design allows for approximately 9 Hz control updates on edge hardware, a significant improvement over the approximately 2.4 Hz baselines. To address the latency shift, the researchers introduced a temporally misaligned training strategy where the policy learns predictive compensation using stale semantic intent alongside real-time proprioception. Additionally, they incorporated a differential motion predictor to handle the insensitivity of static vision encoders to velocity.

The practical applications of TIDAL in the energy sector are promising. For instance, it could enhance the performance of robotic systems used in inspection, maintenance, and repair of energy infrastructure, such as offshore wind turbines or pipelines. The improved real-time control could also benefit autonomous systems in dynamic environments, such as drone-based inspections or robotic systems operating in hazardous conditions.

Experiments conducted by the researchers showed a 2x performance gain over open-loop baselines in dynamic interception tasks. While there was a marginal regression in static success rates, the approach yielded a 4x increase in feedback frequency and extended the effective horizon of semantic embeddings beyond the native action chunk size. Under non-paused inference protocols, TIDAL remained robust where standard baselines failed due to latency.

The research was published in the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), a premier international conference in the fields of computer vision and pattern recognition. The findings represent a significant step forward in the development of real-time, high-performance VLA models, with potential applications across various industries, including energy.

Source: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

This article is based on research available at arXiv.

Related Posts