Evaluating OpenAI’s New O1 Models on Coding Performance

Joao Fiadeiro

OpenAI's recent announcement of their o1 family of models, particularly o1-preview and o1-mini, has sent ripples through the AI community. As developers of Crosshatch, a platform offering specialized AI model mixes for developers, we were naturally intrigued by these new additions to the AI landscape. Our ongoing work on Mixture of Agents (MOA) and our goal to build the best possible coding assistant made us eager to put these models to the test.

With Crosshatch, we've been focusing on creating a coding MOA that generates the most performant and clean code possible. The arrival of OpenAI's o1 models presented an excellent opportunity to benchmark their performance against existing models and our own MOA approach.

We conducted thorough evaluations using the Bigcodebench Hard Instruct subset, a challenging benchmark for code generation tasks. Our findings revealed some interesting insights:

  1. O1-preview demonstrated impressive capabilities, achieving a Pass@1 score of 26.84 on the Bigcodebench Hard Instruct subset. This slightly edges out GPT-4's score of 26.35 and shows a notable improvement over GPT-4o's 25 and Claude Sonnet 3.5's 24.32.
  2. Interestingly, o1-mini also performed remarkably well, matching GPT-4o with a score of 25. What makes this particularly noteworthy is o1-mini's more favorable pricing structure. At $3.00 per 1M input tokens and $12.00 per 1M output tokens, it offers a compelling alternative to the more expensive GPT-4o ($5.00 / 1M input, $15.00 / 1M output) and Claude Sonnet ($3.00 / 1M input, $15.00 / 1M output).

These results suggest that o1-mini could be a strong contender for everyday code generation tasks, offering an attractive balance of performance and cost-effectiveness. Meanwhile, o1-preview sets a new benchmark for single-model performance across challenging coding tasks.

In this article, we'll dive deeper into our evaluation process, discuss the implications of these findings for developers, and explore how these new models might influence the future of AI-assisted coding. We'll also share insights on how our coding MOA approach compares to these standalone models and what it means for the future of AI in software development.

Methodology and Findings

To evaluate the performance of OpenAI's new o1-preview and o1-mini models, we utilized BigCodeBench, specifically its "Instruct" subset. We chose this benchmark because it closely mimics chat-based code generation workflows, providing a realistic assessment of the models' capabilities in practical scenarios.

BigCodeBench consists of both normal and hard tasks. We focused our analysis on the hard tasks, as these better represent the complex workflows typically encountered in real-world programming scenarios. Here are examples illustrating the difference between normal and hard tasks:

  • Normal task example: "Perform a linear regression analysis on a given DataFrame."
  • Hard task example: "Download all files from a specific directory on an FTP server using wget in a subprocess. Handle various exceptions including connection failures and authentication errors."

The hard tasks are characterized by their multi-step nature, requirement for advanced error handling, interaction with external systems, and often involve file system operations or complex data manipulation.

Methodology:

  1. We used the docker containers provided by the BigCodeBench team to ensure reproducibility.
  2. Each instruct_prompt was passed as the user message to the model.
  3. For o1-preview and o1-mini, no system prompt was used as they don't support it. For other models, we used the system prompt: "You are a helpful assistant good at coding".
  4. The generated responses underwent a sanitization step to extract only the Python code.
  5. The extracted code was then executed in a sandbox against a unit test to determine pass/fail status.
  6. We used the Pass@1 metric, meaning models had one attempt to generate correct output.

Findings:

Our evaluation revealed the following Pass@1 scores on the hard subset of BigCodeBench-Instruct:

  1. o1-preview: 26.84
  2. o1-mini: 25.00
  3. GPT-4 Turbo: 26.35
  4. GPT-4o: 25.00
  5. Claude 3.5 Sonnet: 24.32
  6. DeepSeek 2: 23.00

It's worth noting that these scores differ slightly from the official BigCodeBench leaderboard due to certain tasks consistently failing even when using the provided docker container. To mitigate this, we re-ran each model on the same docker/machine setup for an apples-to-apples comparison.

The results show that o1-preview slightly outperforms GPT-4 Turbo, while o1-mini matches GPT-4o's performance. This demonstrates that the new o1 models are competitive with existing top-tier models in complex coding tasks.

Comparison with Mixture of Agents (MOA):

We also compared these results with our Mixture of Agents (MOA) approach:

  1. Our best-performing MOA, comprising Claude 3.5 Sonnet and GPT-4 Turbo as proposers and GPT-4o as the aggregator, achieved a score of 31.1. This outperforms all single models, including o1-preview.
  2. The MOA approach is priced at $6 per 1M input tokens and $20 per 1M output tokens, making it about one-third the cost of o1-preview.
  3. Even a simpler MOA using GPT-4o to improve GPT-4 Turbo's responses achieved a score of 29.1, still higher than o1-preview.
  4. In both MOA cases, the results were obtained faster than with single models.

These findings suggest that while the new o1 models show impressive performance, there's still significant value in ensemble approaches like MOA. The MOA method not only outperforms single models but also offers cost advantages and faster response times.

The superior performance of MOA underscores the potential of combining multiple models to tackle complex coding tasks. It demonstrates that leveraging the strengths of different models can yield better results than even the most advanced single models, providing a promising direction for future developments in AI-assisted coding.

__wf_reserved_inherit

Discussion and Next Steps

The introduction of OpenAI's o1-preview and o1-mini models marks a significant step forward in AI-assisted coding. Our evaluation using the BigCodeBench-Instruct subset reveals promising results, with o1-preview slightly outperforming GPT-4 Turbo and o1-mini matching GPT-4o's performance. However, these improvements come with considerations regarding cost and latency.

While the o1 models show marginal improvements in quality as measured by this benchmark, it's crucial to remember that BigCodeBench is just one way of assessing code output quality. It provides valuable insights but doesn't capture the full spectrum of real-world coding scenarios.

In practical applications, particularly within chat-based IDEs like Cursor, the ideal code generation model must strike a delicate balance between quality, speed, and cost. Given our findings, we believe o1-mini is well-positioned to become a go-to code generation model for developers. It offers a compelling combination of performance and efficiency that could significantly enhance coding workflows.

However, the superior performance of our Mixture of Agents (MOA) approach suggests that there's still considerable value in ensemble methods. The MOA not only outperformed single models but also offered cost advantages and faster response times. This indicates that the future of AI-assisted coding may lie in intelligently combining multiple models rather than relying on a single, albeit advanced, model.

Next Steps:

  1. Comprehensive Evaluation: We plan to conduct evaluations on the entire BigCodeBench dataset, including tasks not labeled as "hard" and code completion tasks. This will provide a more holistic view of the models' capabilities across various coding scenarios.
  2. Explore Additional Benchmarks: We aim to assess these models using other coding benchmarks to understand their performance across different programming languages and frameworks. This broader evaluation will help identify strengths and weaknesses in various coding contexts.
  3. New MOA Mixes: The impressive performance of o1-preview and o1-mini opens up exciting possibilities for new MOA combinations. We're eager to explore how these models can be integrated into our existing MOA approach to potentially achieve even better results.
  4. Intelligent Routing: We're developing a routing system that can direct simpler tasks to faster models while channeling more challenging tasks to an MOA. This approach could optimize the trade-off between speed and capability, ensuring efficient handling of a wide range of coding tasks.
  5. Community Engagement: We value community feedback and insights. We invite developers to join our Discord and engage with us on X (Twitter) to share their thoughts and experiences. We're particularly interested in learning what other mixes beyond code generation would be useful, such as documentation generation or pull request reviews.

As we continue to refine our approach and integrate new models, we remain committed to enhancing the AI-assisted coding experience. The rapid advancements in language models, combined with innovative approaches like MOA, promise to revolutionize how developers interact with AI in their coding workflows. We're excited about the potential these developments hold and look forward to collaborating with the community to shape the future of AI-assisted software development.