Current giant language fashions (LLMs) developments sparked a rising analysis curiosity in device assisted LLMs fixing real-world challenges, which requires complete analysis of tool-use capabilities. Whereas earlier works targeted on both evaluating over stateless internet providers (RESTful API), primarily based on a single flip consumer immediate, or an off-policy dialog trajectory, ToolSandbox contains stateful device execution, implicit state dependencies between instruments, a built-in consumer simulator supporting on-policy conversational analysis and a dynamic analysis technique for intermediate and last milestones over an arbitrary trajectory. We present that open supply and proprietary fashions have a big efficiency hole, and complicated duties like State Dependency, Canonicalization and Inadequate Info outlined in ToolSandbox are difficult even probably the most succesful SOTA LLMs, offering brand-new insights into tool-use LLM capabilities.