Harnessing GPT-OSS Built-in Tools
Learn how to properly set up vLLM with GPT-OSS built-in tools and integrate it with LibreChat to leverage powerful capabilities.
The OpenAI gpt-oss models come with built-in tools (python and browser) that are deeply integrated into the model’s training. Since these tools are built-in, the inference engine itself must handle them - not your application code.
This blog post explains how to properly set up vLLM with these tools and integrate it with LibreChat to leverage these powerful capabilities.
Built-in tool basics
The gpt-oss series are OpenAI’s open-weight models designed for powerful reasoning, agentic tasks, and versatile developer use cases. They are available in two flavors: gpt-oss-120b (117B parameters) for production use and gpt-oss-20b (21B parameters) for lower latency applications.
The models come with two built-in tools that are deeply integrated into their training:
- python tool: Executes Python code in a stateful Jupyter environment for calculations, data analysis, and generating visualizations. Code runs in the model’s chain of thought and isn’t shown to users unless explicitly intended.
- browser tool: Searches for information, opens web pages, and finds specific content. Includes functions for
search,open, andfindoperations with proper citation formatting.
Using these built-in tools is advantageous because they’re trained directly into the model’s behavior, use the proper analysis channel for seamless reasoning integration, and have been optimized for accuracy and reliability compared to custom functions that would duplicate this functionality. This is because gpt-oss uses the Harmony response format with three distinct channels, and built-in tools are trained to output to a specific channel. The inference engine must parse these channels, route messages appropriately, and filter content correctly:
analysis: Contains the model’s chain of thought (CoT) reasoning. This channel should not be shown to end users as it doesn’t adhere to the same safety standards as final output.commentary: Used for custom function tool calls and occasional preambles when calling multiple functions.final: User-facing messages that represent the actual responses intended for end users.
Install vLLM
For my setup, I run the gpt-oss-20b model on 2× NVIDIA RTX 3090 GPUs (24GB VRAM each), which provides good performance for the smaller variant while maintaining reasonable inference speeds for interactive applications.
Create a new Python environment:
1
2
uv venv --python 3.12 --seed
source .venv/bin/activate
Install vLLM with automatic torch backend detection:
1
uv pip install vllm --torch-backend=auto
Start the MCP tool servers
OpenAI provides reference implementations for the built-in tools through the gpt-oss package. To make the browser tool available to vLLM, we’ll create an MCP (Model Context Protocol) server that wraps the reference implementation.
Create a file named browser_server.py (click to expand)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
import os
from collections.abc import AsyncIterator
from contextlib import asynccontextmanager
from dataclasses import dataclass, field
from typing import Union, Optional
from mcp.server.fastmcp import Context, FastMCP
from gpt_oss.tools.simple_browser import SimpleBrowserTool
from gpt_oss.tools.simple_browser.backend import YouComBackend, ExaBackend
@dataclass
class AppContext:
browsers: dict[str, SimpleBrowserTool] = field(default_factory=dict)
def create_or_get_browser(self, session_id: str) -> SimpleBrowserTool:
if session_id not in self.browsers:
tool_backend = os.getenv("BROWSER_BACKEND", "exa")
if tool_backend == "youcom":
backend = YouComBackend(source="web")
elif tool_backend == "exa":
backend = ExaBackend(source="web")
else:
raise ValueError(f"Invalid tool backend: {tool_backend}")
self.browsers[session_id] = SimpleBrowserTool(backend=backend)
return self.browsers[session_id]
def remove_browser(self, session_id: str) -> None:
self.browsers.pop(session_id, None)
@asynccontextmanager
async def app_lifespan(_server: FastMCP) -> AsyncIterator[AppContext]:
yield AppContext()
# Pass lifespan to server
mcp = FastMCP(
name="browser",
instructions=r"""
Tool for browsing.
The `cursor` appears in brackets before each browsing display: `[{cursor}]`.
Cite information from the tool using the following format:
`【{cursor}†L{line_start}(-L{line_end})?】`, for example: `【6†L9-L11】` or `【8†L3】`.
Do not quote more than 10 words directly from the tool output.
sources=web
""".strip(),
lifespan=app_lifespan,
port=8001,
)
@mcp.tool(
name="search",
title="Search for information",
description=
"Searches for information related to `query` and displays `topn` results.",
)
async def search(ctx: Context,
query: str,
topn: int = 10,
source: Optional[str] = None) -> str:
"""Search for information related to a query"""
browser = ctx.request_context.lifespan_context.create_or_get_browser(
ctx.client_id)
messages = []
async for message in browser.search(query=query, topn=topn, source=source):
if message.content and hasattr(message.content[0], 'text'):
messages.append(message.content[0].text)
return "\n".join(messages)
@mcp.tool(
name="open",
title="Open a link or page",
description="""
Opens the link `id` from the page indicated by `cursor` starting at line number `loc`, showing `num_lines` lines.
Valid link ids are displayed with the formatting: `【{id}†.*】`.
If `cursor` is not provided, the most recent page is implied.
If `id` is a string, it is treated as a fully qualified URL associated with `source`.
If `loc` is not provided, the viewport will be positioned at the beginning of the document or centered on the most relevant passage, if available.
Use this function without `id` to scroll to a new location of an opened page.
""".strip(),
)
async def open_link(ctx: Context,
id: Union[int, str] = -1,
cursor: int = -1,
loc: int = -1,
num_lines: int = -1,
view_source: bool = False,
source: Optional[str] = None) -> str:
"""Open a link or navigate to a page location"""
browser = ctx.request_context.lifespan_context.create_or_get_browser(
ctx.client_id)
messages = []
async for message in browser.open(id=id,
cursor=cursor,
loc=loc,
num_lines=num_lines,
view_source=view_source,
source=source):
if message.content and hasattr(message.content[0], 'text'):
messages.append(message.content[0].text)
return "\n".join(messages)
@mcp.tool(
name="find",
title="Find pattern in page",
description=
"Finds exact matches of `pattern` in the current page, or the page given by `cursor`.",
)
async def find_pattern(ctx: Context, pattern: str, cursor: int = -1) -> str:
"""Find exact matches of a pattern in the current page"""
browser = ctx.request_context.lifespan_context.create_or_get_browser(
ctx.client_id)
messages = []
async for message in browser.find(pattern=pattern, cursor=cursor):
if message.content and hasattr(message.content[0], 'text'):
messages.append(message.content[0].text)
return "\n".join(messages)
Create a file named python_server.py (click to expand)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
from mcp.server.fastmcp import FastMCP
from gpt_oss.tools.python_docker.docker_tool import PythonTool
from openai_harmony import Message, TextContent, Author, Role
# Pass lifespan to server
mcp = FastMCP(
name="python",
instructions=r"""
Use this tool to execute Python code in your chain of thought. The code will not be shown to the user. This tool should be used for internal reasoning, but not for code that is intended to be visible to the user (e.g. when creating plots, tables, or files).
When you send a message containing python code to python, it will be executed in a stateless docker container, and the stdout of that process will be returned to you.
""".strip(),
port=8002,
)
@mcp.tool(
name="python",
title="Execute Python code",
description="""
Use this tool to execute Python code in your chain of thought. The code will not be shown to the user. This tool should be used for internal reasoning, but not for code that is intended to be visible to the user (e.g. when creating plots, tables, or files).
When you send a message containing python code to python, it will be executed in a stateless docker container, and the stdout of that process will be returned to you.
""",
annotations={
# Harmony format don't want this schema to be part of it because it's simple text in text out
"include_in_prompt": False,
})
async def python(code: str) -> str:
tool = PythonTool()
messages = []
async for message in tool.process(
Message(author=Author(role=Role.TOOL, name="python"),
content=[TextContent(text=code)])):
messages.append(message)
return "\n".join([message.content[0].text for message in messages])
Before you start the browser server, you need to get an API key from exa.ai, which the server uses to browse the web.
Start the MCP server using the fastmcp CLI:
1
2
export EXA_API_KEY=YOUR-EXA-KEY-HERE
mcp run -t sse browser_server.py:mcp
This starts the browser tool server on port 8001. You should see output similar to:
1
2
3
4
INFO: Started server process [730909]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://127.0.0.1:8001 (Press CTRL+C to quit)
We can similarly start the Python server:
1
cd gpt-oss-tests/mcp-servers && mcp run -t sse python_server.py:mcp
Start vLLM with browser tool integration
Now we can launch vLLM and configure it to use the browser MCP server we just started. The key parameter is --tool-server which points to our MCP servers on localhost:8001 and localhost:8002, respectively.
1
2
3
4
5
6
7
8
9
10
11
vllm serve openai/gpt-oss-20b \
--tensor-parallel-size 2 \
--max_num_seqs 1 \
--max-model-len 131072 \
--gpu-memory-utilization 0.85 \
--tool-call-parser openai \
--reasoning-parser openai_gptoss \
--enable-auto-tool-choice \
--host 0.0.0.0 \
--port 8000 \
--tool-server localhost:8001,localhost:8002
Once vLLM initializes, you should see output indicating the server is running:
1
2
3
(APIServer pid=732684) INFO: Started server process [732684]
(APIServer pid=732684) INFO: Waiting for application startup.
(APIServer pid=732684) INFO: Application startup complete.
If you scroll up a bit in that terminal window you can verify that the tools have been loaded successfully:
1
(APIServer pid=860267) INFO 10-11 14:56:42 [entrypoints/tool_server.py:140] MCPToolServer initialized with tools: ['browser', 'python']
Test from python
Create file test_builtin_tools.py which uses the OpenAI SDK (click to expand)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
from openai import OpenAI
import json
from pprint import pprint
from rich.console import Console
from rich.panel import Panel
from rich.text import Text
from rich.syntax import Syntax
from rich import box
def display_response_flow(response):
"""
Display the response in a nice, structured format showing:
- Each reasoning step
- Each tool call with its action
- The final assistant message
- Token usage statistics
"""
console = Console()
# Header with status
status_color = {
'completed': 'green',
'in_progress': 'yellow',
'failed': 'red',
'cancelled': 'red'
}.get(response.status, 'cyan')
console.print()
console.print(Panel.fit(
f"[bold cyan]Response Flow: {response.model}[/bold cyan]\n"
f"[bold]Status:[/bold] [{status_color}]{response.status}[/{status_color}]",
border_style="cyan"
))
console.print()
if not hasattr(response, 'output') or not response.output:
console.print("[yellow]No output in response[/yellow]")
return
step_num = 1
reasoning_count = 0
tool_call_count = 0
has_final_message = False
for output_item in response.output:
# Display reasoning blocks
if output_item.type == 'reasoning':
reasoning_count += 1
for content in output_item.content:
if content.type == 'reasoning_text':
console.print(Panel(
Text(content.text, style="italic dim"),
title=f"[bold yellow]💭 Reasoning #{reasoning_count}[/bold yellow]",
border_style="yellow",
box=box.ROUNDED,
padding=(1, 2)
))
console.print()
# Display tool calls
elif output_item.type == 'web_search_call':
tool_call_count += 1
action = output_item.action
status = output_item.status or "unknown"
# Format action details - action is a Pydantic model, not a dict
action_type = getattr(action, 'type', 'N/A')
action_text = f"[bold]Type:[/bold] {action_type}\n"
if hasattr(action, 'query') and action.query:
action_text += f"[bold]Query:[/bold] {action.query}\n"
if hasattr(action, 'url') and action.url:
action_text += f"[bold]URL:[/bold] {action.url}\n"
if hasattr(action, 'pattern') and action.pattern:
action_text += f"[bold]Pattern:[/bold] {action.pattern}\n"
action_text += f"[bold]Status:[/bold] {status}"
# Choose icon based on action type
icon = {
'search': '🔍',
'open_page': '📄',
'find': '🔎',
}.get(action_type, '🔧')
console.print(Panel(
action_text,
title=f"[bold green]{icon} Tool Call #{tool_call_count}[/bold green]",
border_style="green",
box=box.ROUNDED,
padding=(1, 2)
))
console.print()
# Display final message
elif output_item.type == 'message' and output_item.role == 'assistant':
has_final_message = True
for content in output_item.content:
if content.type == 'output_text':
console.print(Panel(
Text(content.text),
title="[bold blue]📨 Final Response[/bold blue]",
border_style="blue",
box=box.DOUBLE,
padding=(1, 2)
))
console.print()
# Warning if no final message
if not has_final_message:
console.print(Panel(
"[bold yellow]⚠️ No final response message found![/bold yellow]\n"
"The model may have been cut off or encountered an error.",
border_style="yellow",
box=box.ROUNDED
))
console.print()
# Usage statistics
if hasattr(response, 'usage') and response.usage:
usage = response.usage
stats_text = f"""[bold]Input tokens:[/bold] {usage.input_tokens:,}
[bold]Output tokens:[/bold] {usage.output_tokens:,}
[bold]Total tokens:[/bold] {usage.total_tokens:,}"""
if hasattr(usage, 'input_tokens_details') and usage.input_tokens_details:
if hasattr(usage.input_tokens_details, 'cached_tokens'):
cached = usage.input_tokens_details.cached_tokens
stats_text += f"\n[bold]Cached tokens:[/bold] {cached:,} ({cached/usage.input_tokens*100:.1f}%)"
if hasattr(usage, 'output_tokens_details') and usage.output_tokens_details:
details = usage.output_tokens_details
if hasattr(details, 'reasoning_tokens'):
stats_text += f"\n[bold]Reasoning tokens:[/bold] {details.reasoning_tokens:,}"
if hasattr(details, 'tool_output_tokens'):
stats_text += f"\n[bold]Tool output tokens:[/bold] {details.tool_output_tokens:,}"
console.print(Panel(
stats_text,
title="[bold magenta]📊 Token Usage[/bold magenta]",
border_style="magenta",
box=box.ROUNDED,
padding=(1, 2)
))
console.print()
# Summary
console.print(Panel.fit(
f"[bold]Summary:[/bold] {reasoning_count} reasoning steps, {tool_call_count} tool calls",
border_style="cyan"
))
console.print()
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY"
)
response = client.responses.create(
model="openai/gpt-oss-20b",
input="How is the weather in Seattle, WA?",
tools=[
{
"type": "code_interpreter",
"container": {
"type": "auto"
}
},
{
"type": "web_search_preview"
}
],
reasoning={
"effort": "medium", # "low", "medium", or "high"
"summary": "detailed" # "auto", "concise", or "detailed"
},
temperature=1.0,
)
# Display the response in a nice format
display_response_flow(response)
# Show raw response for debugging if response looks incomplete
show_raw_debug = response.status != 'completed'
if show_raw_debug: # Set to False to hide raw response
print("\n" + "=" * 80)
print("FULL RAW RESPONSE (for debugging)")
print("=" * 80)
response_dict = response.model_dump() if hasattr(response, 'model_dump') else dict(response)
pprint(response_dict, width=120, depth=10)
The crucial part of the code is telling vLLM to use the built-in tools. Note how we just reference the tool name without specifying an actual implementation, and vLLM internally properly calls the browser tool.
1
2
3
4
5
6
7
8
9
10
11
tools=[
{
"type": "code_interpreter",
"container": {
"type": "auto"
}
},
{
"type": "web_search_preview"
}
]
When you reference these two tools in your tools array, vLLM’s _construct_input_messages_with_harmony function in serving_responses.py translates these OpenAI API names to GPT-OSS’s internal tool names: "browser" and "python" respectively. It checks if the configured tool server (either DemoToolServer for built-in tools via --tool-server demo, or MCPToolServer for external servers via --tool-server localhost:8001,localhost:8002) has these tools available, then constructs the system message in the Harmony format that GPT-OSS was trained on.
Now run the code:
1
python3 ./test_builtin_tools.py
Here is the output for query How is the weather in Seattle, WA?:
And here it is for query Multiply 64548*15151 using builtin python interpreter.:
Install, configure, and start LibreChat
LibreChat is an open-source AI chat platform that provides a unified interface for interacting with multiple AI models and services, including custom endpoints like our vLLM server.
1
2
3
git clone https://github.com/danny-avila/LibreChat.git
cd LibreChat
cp .env.example .env
Create file docker-compose.override.yml:
1
2
3
4
5
6
services:
api:
volumes:
- type: bind
source: ./librechat.yaml
target: /app/librechat.yaml
Create the librechat.yaml file:
1
cp librechat.example.yaml librechat.yaml
Then inside the custom - endpoints section of this file create the following vLLM configuration:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
- name: 'vLLM'
apiKey: 'EMPTY'
baseURL: 'http://host.docker.internal:8000/v1'
models:
default: ['openai/gpt-oss-20b']
fetch: true
titleConvo: true
titleModel: 'current_model'
titleMessageRole: 'user'
summarize: false
summaryModel: 'current_model'
forcePrompt: false
modelDisplayLabel: 'vLLM'
addParams:
web_search: true
tools:
- type: 'web_search_preview'
Then start the containers:
1
docker compose up -d
Note for Ubuntu 25.04 users: If LibreChat can’t connect to vLLM on the host, you may need to add iptables rules to allow Docker containers to access port 8000:
1
2
sudo iptables -I INPUT -i docker0 -p tcp --dport 8000 -j ACCEPT
sudo iptables -I DOCKER-USER -i docker0 -j ACCEPT
To make these rules permanent across reboots, save them with sudo iptables-save > /etc/iptables/rules.v4 (after installing iptables-persistent), add them to a startup script, or configure UFW: sudo ufw allow from 172.17.0.0/16 to any port 8000.
Here are a couple of examples of using the tools in LibreChat:
Happy chatting!