Queryloop

Research

Building a Coding Agent to Solve SWE-Bench

Zain ul Abideen
January 17, 2025
8 min read

Learn how we improved our approach to solving SWE-bench problems by flipping the process—making code changes first and then generating patches.

In our first attempt to solve SWE-bench problems, we ran into a lot of issues because the patches were being created before the actual fixes were applied by an LLM. This approach caused problems like inconsistent formatting and errors slipping through. So, we decided to flip the process — make the changes first and then generate the patches.

Blog image

Workflow Overview

The workflow improves upon earlier methods by introducing a structured, tool-integrated pipeline. Each agent handles a specific task, leveraging GPT-4o (via LangChain) for repository navigation and diagnosis, then employing Qwen/QwQ-32B-Preview for context-aware edits. By breaking down the process into clear steps — diagnosis, code editing, edit application, patch generation, and evaluation — the system ensures minimal conflicts, fewer formatting errors, and maintains consistency across the repository.

Workflow Overview

Repository Cloning

The first step is straightforward: clone the target repository and extract relevant problem statements from the SWE dataset. This involves setting up the same environment as in the previous approach, including the project structure and all dependencies needed to run further diagnostic and editing scripts.

Diagnoser Agent

Once the repository is in hand, the Diagnoser Agent steps in. It uses tools integrated with GPT-4o (via LangChain) to navigate the repository, open files, traverse folders, and load contents into memory. This agent analyzes the codebase and outputs structured JSON data detailing:
Below is an example output demonstrating how the Diagnoser Agent identifies specific line ranges and issues within a file:
  • Problem Descriptions
  • Affected File Names
  • Line Numbers
Diagnoser Agent
It uses all the tools essential for navigating the repository like opening files, traversing into folders and loading contents into the memory:
Blog image

Code Editor

This component is actually a chain of agents that takes the diagnosis from the Diagnoser Agent and systematically addresses each affected file. It works like this:
An example of the suggested edit by the LLM is:
  • 1. Read the first chunk of diagnostic information.
  • 2. Open the corresponding file.
  • 3. Feed both the file's content and the diagnosis details (line range, description, and issue) into a grounded (No Tools given) LLM.
  • 4. Suggest an edit to resolve the problem.
Code Editor

Evaluating the Effectiveness of Edits

A key concern in any automated fixing workflow is verifying that the changes truly resolve the underlying problem. Our approach tackles this on two fronts. First, once a fix is applied, we can run the previously failing tests to confirm whether the issue has been resolved in practice. This direct feedback from the test suite offers a definitive "pass" or "fail," leaving little doubt as to the efficacy of the edit.
First, the Diagnoser Agent (which has access to the full repository) identifies the issue by scanning large code chunks. Then, a second, more focused LLM (which does not see the entire repository) reviews the same lines and issue description. This extra review confirms that the suggested change makes sense before we edit the file.
By combining direct feedback from tests and a second review of the proposed fix, we gain more confidence that our solution is correct and that we aren't introducing new problems.

Edit Applier Agent

After the Code Editor proposes an edit, the Edit Applier Agent sets out to implement it. Initial attempts replaced entire code blocks, which often caused problems with indentation, overlapping comment sections, and other structural issues like accidentally changing the structure of code below or above the edited block. To address this, the Edit Applier Agent now follows this approach:
This iterative approach greatly reduces collisions and maintains integrity between functions, comments, and surrounding lines. An example of line-by-line editing:
  • Generates suggested edits line-by-line to preserve the file's structure.
  • Updates each line individually, respecting function boundaries and comment blocks.
  • Runs a linter afterward (e.g., Black) to catch syntax or formatting errors.
  • Loops back to the Suggestion Agent for a revised fix if any issues are found.
Edit Applier Agent

Patch Generation Agent

Once the code is successfully updated, the Patch Generation Agent is invoked to create Git-based patches. Since these patches are created through git internal command and not LLM, these have no errors as we faced in our approach.

Submission and Evaluation

Finally, the newly generated patch — along with repository metadata and logs — is submitted for SWE evaluation.

Key Improvements in Our New Strategy

Building on lessons from the previous workflows, our new strategy introduces several critical enhancements to reduce errors.

Enhanced Precision

  • Line-by-line replacements: By updating code on a line-by-line basis, our system minimizes disruptions to the surrounding logic and structure. This approach reduces the risk of errors, especially where indentation, comments, and boundary conditions could otherwise be inadvertently modified, so by doing this we are relevant to fixing only what's faulty.
  • Intelligent formatting: Post-edit linting and formatting checks ensure that each code segment maintains structural integrity, preventing additional errors from being introduced while fixing one error.

Optimized Workflow

  • Iterative loops: Each proposed fix is rigorously validated. If an issue arises, the system loops back to refine the recommendation, ensuring validated solutions before proceeding.
  • Direct file edits and Git-based patches: By working directly with the repository and generating patches through Git commands, the process is streamlined and creates a precise record of changes so that our patch doesn't fail at SWE evaluations.

Accurate Diagnostics

  • Structured JSON outputs: The Diagnoser Agent's clearly formatted outputs pin down the exact location of each issue, enabling precise fixes. These structured diagnostics reduce guesswork and as we have idea of where the issues are we can pin point those files and then apply systemetic grounded approach to fix those file other than blindly traversing the files in a loop.

How Our Approach Stands Out in Autonomous Software Engineering

Granular Diagnostics

  • Comprehensive file issue coverage: While many competitors rely purely on LLM-generated patches, our system goes deeper — it identifies and addresses issues in a comprehensive, file-specific manner.
  • Manual verification: We incorporate opportunities for human oversight to ensure that even the most subtle bugs are caught before finalizing any fix.

Repository-Centric Approach

  • Tailored for repositories: Our solution is designed around repository-level diagnostics, focusing on how files interrelate and interact within a larger codebase.
  • Optimized tools and workflows: With specialized modules for reading, writing, and linting code, the entire pipeline is highly precise in addressing repository requirements.

Current Operations

The development of new agents and modules continues to refine this automated workflow, aiming to further refine the boundaries of both reliability and agility.

Issue Recreator Agent

The SWE dataset provides details (FAIL_TO_PASS) about failed test files and the associated functions. At this stage, our focus is on recreating these errors to validate the fixes:
  • Error Recreation: Using the SWE dataset, this agent identifies the failed test files and re-runs them to check if the errors have been resolved after applying the fix.
  • Sandboxed Validation: Since dependencies can vary across repositories, we are developing a virtual sandbox module to isolate each test environment. Within this sandbox, the repository will be installed, and the test suite will be re-run to determine whether the error persists or has been resolved.
Issue Recreator Agent
AISoftware EngineeringSWE-benchCode EditingLLMAutomationBug FixingQueryloop