The Inner Workings of GPT’s file_search Tool

Reverse Engineering

The file_search tool allows GPT models to extract precise information from uploaded documents using structured queries and provides citations for verification.

Listen

When you upload a document to a GPT model and ask a question, the artificial intelligence uses a specialized tool called file search to find the exact answers you need.

Behind the scenes, the model takes your question and turns it into a structured search, generating up to five different queries. The first query matches your original question, rewritten to resolve any ambiguity. The other queries use alternative phrasing or synonyms to ensure no relevant details are missed. For instance, if you ask about employee turnover in a certain year, the tool will also search for terms like attrition statistics or staff turnover figures.

Once the tool finds the information, it cites the source. This citation references the specific search result and the name of your document, allowing you to easily verify the facts.

This tool is incredibly useful when you need exact data points or direct quotes, making the model's responses highly accurate and transparent. Keep in mind that file search is strictly for retrieving existing data. It does not broadly summarize or interpret content, and it relies on specific queries to avoid pulling in irrelevant noise. Ultimately, it is a practical way to get reliable, verified answers directly from your files.

The `file_search` tool enables GPT models to extract specific information directly from documents uploaded by users. This feature is essential when user queries require precise answers based explicitly on the contents of these documents.

The exact hidden system instruction is as follows:

{
  "Purpose": "Use `file_search.msearch` to answer user questions based on uploaded files.",

  "Structure": {
    "Format": {
      "queries": [
        "first query",
        "second query",
        "... up to five queries"
      ]
    },
    "Requirements": [
      "One query must match the user's original question, rewritten only to resolve ambiguity or complete missing context.",
      "Avoid overly broad or short queries that return noise."
    ]
  },

  "Examples": {
    "User Question": "What was Kevin's age?",
    "Queries": [
      "What was Kevin's age?",
      "Kevin age",
      "How old is Kevin?",
      "Kevin birth year",
      "Kevin date of birth"
    ]
  },

  "Citing Results": {
    "Format": " ",
    "Explanation": {
      "3": "Tool message index",
      "13": "Query result index",
      "Filename": "Source document title (no extension)"
    }
  }
}

How the Tool Functions

Upon receiving a file from a user, such as PDFs, CSVs, or plain text documents, the GPT model uses the method file_search.msearch to query document contents. The queries submitted to the tool are structured as JSON objects, containing up to five distinct queries, each carefully crafted to retrieve the exact information requested.

Query Format

Queries must adhere to the following JSON structure:

{
  "queries": [
    "User's exact original question (mandatory)",
    "Alternative phrasing or synonyms (optional)",
    "... additional related queries (up to five total)"
  ]
}

The first query should exactly match or closely reflect the user’s original request. Additional queries refine or broaden the scope as needed.

Example

If a user asks:

“What is the employee turnover rate for 2024?”

The GPT model would send the following structured request:

{
  "queries": [
    "What is the employee turnover rate for 2024?",
    "2024 employee turnover rate",
    "Employee attrition statistics 2024",
    "Staff turnover figures 2024"
  ]
}

Result Citation

Answers retrieved by the file_search tool include structured citations formatted as follows:

4: Index of the response message from the file_search tool.
7: Specific result number within that response.
HR_Report: The name of the original document source (without file extension).

This citation format facilitates direct verification of information by referencing the source document.

Applications and Advantages

The GPT model uses file_search when:

User questions require exact data points or direct quotes from uploaded files.
Responses need factual accuracy grounded explicitly in provided documentation.
Source transparency is crucial for user validation.

By integrating this tool, the GPT model significantly improves the precision, transparency, and reliability of its responses.

Limitations and Best Practices

Queries must be specific; overly general or ambiguous queries yield irrelevant results.
The tool strictly retrieves existing data; it does not summarize or interpret content broadly.
Adherence to the prescribed citation format is essential for clarity and source traceability.

In summary, file_search is a practical retrieval mechanism that allows GPT models to precisely extract and present factual information from user-uploaded documents, ensuring responses are accurate and clearly sourced.

Dan Petrovic · May 27, 01:08