
OpenAI recently launched its new generation intelligent agent programming model, GPT-5.1-Codex-Max, replacing the previous GPT-5.1-Codex as the default model for the Codex integrated interface. This upgrade significantly improves long-term inference capabilities, interaction efficiency, and real-time performance, and surpasses the Google Gemini 3 Pro in multiple benchmark tests, attracting widespread attention in the AI development field.
In terms of performance, Codex-Max leads across the board in key programming tests: its accuracy in SWE-Bench Verified (solving real-world software problems) reaches 77.9%, slightly higher than Gemini's 76.2%; in the Terminal-Bench 2.0 test, it wins with 58.1% to 54.2%; and in the highly competitive LiveCodeBench Pro coding Elo test, the two models are tied (2439 points). This achievement signifies OpenAI's continued leading advantage in the field of AI programming.
Technically, Codex-Max introduces an innovative mechanism called "Compaction," which intelligently preserves critical context and discards redundant details, supporting continuous operation of millions of tokens without performance degradation. Thanks to this technology, the model successfully completed complex tasks (such as multi-step code refactoring) for over 24 hours in internal testing, while improving token efficiency by 30% and reducing latency and cost. Currently, the model has been integrated into OpenAI's Codex CLI, internal code review tools, and other development environments, supporting real-time interactive scenarios such as reinforcement learning training.
Despite its powerful capabilities, OpenAI emphasizes that Codex-Max is a coding "assistant," not a replacement. The model runs in a sandbox environment by default, with network access disabled, and generates detailed logs for developer verification. It's worth noting that ordinary users need to subscribe to ChatGPT Plus/Pro or the Enterprise Edition to use it; the public API is not yet open. OpenAI revealed that 95% of its internal engineers use Codex weekly, and the average number of pull requests has increased by 70% since adoption.