Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Azure OpenAI reasoning models (classic)
Currently viewing:
Foundry (classic) portal version - Switch to version for the new Foundry portal
Azure OpenAI reasoning models are designed to tackle reasoning and problem-solving tasks with increased focus and capability. These models spend more time processing and understanding the user's request, making them exceptionally strong in areas like science, coding, and math compared to previous iterations.
Key capabilities of reasoning models:
- Complex Code Generation: Capable of generating algorithms and handling advanced coding tasks to support developers.
- Advanced Problem Solving: Ideal for comprehensive brainstorming sessions and addressing multifaceted challenges.
- Complex Document Comparison: Perfect for analyzing contracts, case files, or legal documents to identify subtle differences.
- Instruction Following and Workflow Management: Particularly effective for managing workflows requiring shorter contexts.
Prerequisites
An Azure OpenAI reasoning model deployed.
If you use the REST examples:
Install the Azure CLI. For more information, see Install the Azure CLI.
Sign in with
az login, then generate a bearer token and store it in theAZURE_OPENAI_AUTH_TOKENenvironment variable.az account get-access-token --resource https://cognitiveservices.azure.com --query accessToken -o tsv
Usage
These models don't currently support the same set of parameters as other models that use the chat completions API.
Chat completions API
using Azure.Identity;
using OpenAI;
using OpenAI.Chat;
using System.ClientModel.Primitives;
#pragma warning disable OPENAI001 //currently required for token based authentication
BearerTokenPolicy tokenPolicy = new(
new DefaultAzureCredential(),
"https://ai.azure.com/.default");
ChatClient client = new(
model: "o4-mini",
authenticationPolicy: tokenPolicy,
options: new OpenAIClientOptions()
{
Endpoint = new Uri("https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1")
}
);
ChatCompletionOptions options = new ChatCompletionOptions
{
MaxOutputTokenCount = 100000
};
ChatCompletion completion = client.CompleteChat(
new DeveloperChatMessage("You are a helpful assistant"),
new UserChatMessage("Tell me about the bitter lesson")
);
Console.WriteLine($"[ASSISTANT]: {completion.Content[0].Text}");
Microsoft Entra ID:
If you're new to using Microsoft Entra ID for authentication see How to configure Azure OpenAI in Microsoft Foundry Models with Microsoft Entra ID authentication.
from openai import OpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
token_provider = get_bearer_token_provider(
DefaultAzureCredential(), "https://ai.azure.com/.default"
)
client = OpenAI(
base_url = "https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/",
api_key=token_provider,
)
response = client.chat.completions.create(
model="YOUR-DEPLOYMENT-NAME", # replace with your model deployment name
messages=[
{"role": "user", "content": "What steps should I think about when writing my first Python API?"},
],
max_completion_tokens = 5000
)
print(response.model_dump_json(indent=2))
API Key:
import os
from openai import OpenAI
client = OpenAI(
api_key=os.getenv("AZURE_OPENAI_API_KEY"),
base_url="https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/",
)
response = client.chat.completions.create(
model="YOUR-DEPLOYMENT-NAME", # replace with your model deployment name
messages=[
{"role": "user", "content": "What steps should I think about when writing my first Python API?"},
],
max_completion_tokens = 5000
)
print(response.model_dump_json(indent=2))
curl -X POST "https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $AZURE_OPENAI_AUTH_TOKEN" \
-d '{
"model": "gpt-5",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What steps should I think about when writing my first Python API?"}
],
"max_completion_tokens": 1000
}'
Python Chat Completions API Output:
{
"id": "chatcmpl-AEj7pKFoiTqDPHuxOcirA9KIvf3yz",
"choices": [
{
"finish_reason": "stop",
"index": 0,
"logprobs": null,
"message": {
"content": "Writing your first Python API is an exciting step in developing software that can communicate with other applications. An API (Application Programming Interface) allows different software systems to interact with each other, enabling data exchange and functionality sharing. Here are the steps you should consider when creating your first Python API...truncated for brevity.",
"refusal": null,
"role": "assistant",
"function_call": null,
"tool_calls": null
},
"content_filter_results": {
"hate": {
"filtered": false,
"severity": "safe"
},
"protected_material_code": {
"filtered": false,
"detected": false
},
"protected_material_text": {
"filtered": false,
"detected": false
},
"self_harm": {
"filtered": false,
"severity": "safe"
},
"sexual": {
"filtered": false,
"severity": "safe"
},
"violence": {
"filtered": false,
"severity": "safe"
}
}
}
],
"created": 1728073417,
"model": "o1-2024-12-17",
"object": "chat.completion",
"service_tier": null,
"system_fingerprint": "fp_503a95a7d8",
"usage": {
"completion_tokens": 1843,
"prompt_tokens": 20,
"total_tokens": 1863,
"completion_tokens_details": {
"audio_tokens": null,
"reasoning_tokens": 448
},
"prompt_tokens_details": {
"audio_tokens": null,
"cached_tokens": 0
}
},
"prompt_filter_results": [
{
"prompt_index": 0,
"content_filter_results": {
"custom_blocklists": {
"filtered": false
},
"hate": {
"filtered": false,
"severity": "safe"
},
"jailbreak": {
"filtered": false,
"detected": false
},
"self_harm": {
"filtered": false,
"severity": "safe"
},
"sexual": {
"filtered": false,
"severity": "safe"
},
"violence": {
"filtered": false,
"severity": "safe"
}
}
}
]
}
Reasoning effort
Note
Reasoning models have reasoning_tokens as part of completion_tokens_details in the model response. These are hidden tokens that aren't returned as part of the message response content but are used by the model to help generate a final answer to your request. reasoning_effort can be set to low, medium, or high for all reasoning models except o1-mini. The higher the effort setting, the longer the model will spend processing the request, which will generally result in a larger number of reasoning_tokens.
Developer messages
Developer messages ("role": "developer") are functionally the same as system messages.
Adding a developer message to the previous code example would look as follows:
using Azure.Identity;
using OpenAI;
using OpenAI.Chat;
using System.ClientModel.Primitives;
#pragma warning disable OPENAI001 //currently required for token based authentication
BearerTokenPolicy tokenPolicy = new(
new DefaultAzureCredential(),
"https://ai.azure.com/.default");
ChatClient client = new(
model: "o4-mini",
authenticationPolicy: tokenPolicy,
options: new OpenAIClientOptions()
{
Endpoint = new Uri("https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1")
}
);
ChatCompletionOptions options = new ChatCompletionOptions
{
ReasoningEffortLevel = ChatReasoningEffortLevel.Low,
MaxOutputTokenCount = 100000
};
ChatCompletion completion = client.CompleteChat(
new DeveloperChatMessage("You are a helpful assistant"),
new UserChatMessage("Tell me about the bitter lesson")
);
Console.WriteLine($"[ASSISTANT]: {completion.Content[0].Text}");
Microsoft Entra ID:
If you're new to using Microsoft Entra ID for authentication see How to configure Azure OpenAI with Microsoft Entra ID authentication.
from openai import OpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
token_provider = get_bearer_token_provider(
DefaultAzureCredential(), "https://ai.azure.com/.default"
)
client = OpenAI(
base_url="https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/",
api_key=token_provider,
)
response = client.chat.completions.create(
model="YOUR-DEPLOYMENT-NAME", # replace with your model deployment name
messages=[
{"role": "developer", "content": "You are a helpful assistant."},
{"role": "user", "content": "What steps should I think about when writing my first Python API?"},
],
max_completion_tokens=5000,
reasoning_effort="medium", # low, medium, or high
)
print(response.model_dump_json(indent=2))
API Key:
import os
from openai import OpenAI
client = OpenAI(
api_key=os.getenv("AZURE_OPENAI_API_KEY"),
base_url="https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/",
)
response = client.chat.completions.create(
model="gpt-5-mini", # replace with the model deployment name of your o1 deployment.
messages=[
{"role": "developer","content": "You are a helpful assistant."}, # optional equivalent to a system message for reasoning models
{"role": "user", "content": "What steps should I think about when writing my first Python API?"},
],
max_completion_tokens = 5000,
reasoning_effort = "medium" # low, medium, or high
)
print(response.model_dump_json(indent=2))
curl -X POST "https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $AZURE_OPENAI_AUTH_TOKEN" \
-d '{
"model": "gpt-5",
"messages": [
{"role": "developer", "content": "You are a helpful assistant."},
{"role": "user", "content": "What steps should I think about when writing my first Python API?"}
],
"max_completion_tokens": 1000,
"reasoning_effort": "medium"
}'
Python Chat Completions API Output:
{
"id": "chatcmpl-CaODNsQOHoRLcb9JVSKYY1e2Iss5s",
"choices": [
{
"finish_reason": "stop",
"index": 0,
"logprobs": null,
"message": {
"content": "Here’s a practical, beginner‑friendly checklist to guide you through writing your first Python API, from idea to production.
1) Clarify goals and constraints
- Who will use it (internal team, public), what problems it solves, expected traffic, latency requirements.
- Resources you’ll expose (users, orders, etc.) and core operations.
- Non‑functional needs: security, compliance, uptime, scalability.
2) Choose your API style
- REST (most common for CRUD and simple integrations).
- GraphQL (flexible queries, more complex to secure/monitor).
- gRPC (high‑performance, strongly typed, good for service‑to‑service).
- For a first API, REST + JSON is usually best.
3) Design the contract first
- Draft an OpenAPI/Swagger spec: endpoints, request/response schemas, status codes, error model.
- Decide naming conventions, pagination, filtering, sorting.
- Define consistent time/date format (ISO‑8601, UTC), ID format, and field casing.
- Plan versioning strategy (e.g., /v1) and deprecation policy.
4) Plan security and auth
- Pick auth: API keys for simple internal use; OAuth2/JWT for user auth; mTLS for service‑to‑service.
- CORS policy for browsers; HTTPS everywhere; security headers.
- Validate all inputs; avoid leaking stack traces; define rate limits and quotas.
5) Pick your Python stack
- Frameworks: FastAPI (great typing, validation, auto docs), Flask (minimal), Django REST Framework (batteries included).
- ASGI/WSGI server: Uvicorn or Gunicorn.
- Data layer: PostgreSQL + SQLAlchemy/Django ORM; migrations with Alembic/Django migrations.
- Caching: Redis (optional).
- Background jobs: Celery/RQ (if needed).
6) Set up the project
- Create a virtual environment; choose dependency management (pip, Poetry).
- Establish project structure (app, api, models, services, tests).
- Add linting/formatting/type checks: black, isort, flake8, mypy; pre‑commit hooks.
- Configuration via environment variables; secrets via a manager (not in code).
7) Implement core functionality
- Build endpoints that match your spec; keep business logic in a service layer, not in route handlers.
- Schema validation (Pydantic with FastAPI, Marshmallow for Flask).
- Consistent responses and errors; use clear status codes (201 create, 204 no content, 400/404/409/422, 500).
- Pagination and filtering; idempotency for certain POST operations; ETags/conditional requests if useful.
8) Error handling and an error model
- Define a standard error body (code, message, details, correlation_id).
- Log errors with context; don’t expose internal details to clients.
9) Testing strategy
- Unit tests for services/validators.
- Integration tests for endpoints (pytest + httpx/requests) with a test database.
- Contract tests to assert the API matches the OpenAPI spec.
- Mock external services; measure coverage and focus on critical paths.
10) Documentation and developer experience
- Auto‑generated docs (FastAPI provides Swagger/ReDoc).
- Write examples for each endpoint; onboarding and usage notes.
- Keep a changelog and release notes.
11) Observability and reliability
- Structured logging (JSON), include request IDs/correlation IDs.
- Metrics (requests, latency, error rates), health/readiness endpoints.
- Tracing (OpenTelemetry) if you have multiple services.
- Error reporting (Sentry or similar).
12) Deployment and operations
- Containerize with Docker; follow 12‑factor app principles.
- CI/CD pipeline: run tests, build image, deploy, run migrations.
- Choose hosting (Render, Fly.io, Railway, Heroku, AWS/GCP/Azure).
- Configure scaling, connection pools, and timeouts; use a reverse proxy if needed.
13) Performance and data concerns
- Index your database; avoid N+1 queries; use connection pooling.
- Load test key endpoints; profile hotspots.
- Caching strategies where appropriate; consider async I/O for high‑concurrency workloads.
14) Versioning and lifecycle management
- Keep backward compatibility for minor changes; add fields rather than changing semantics.
- Communicate deprecations; sunset old versions with a timeline.
15) Governance, compliance, and safety
- Handle PII correctly; data retention and audit logs if required.
- Least‑privilege DB access; rotate secrets; review third‑party dependencies.
Beginner‑friendly defaults
- FastAPI + Pydantic + Uvicorn
- PostgreSQL + SQLAlchemy + Alembic
- pytest + httpx + coverage
- black, isort, flake8, mypy, pre‑commit
- Docker + simple CI (GitHub Actions) + a managed host
Common pitfalls to avoid
- Inconsistent status codes or error formats.
- Weak input validation and missing authentication.
- Business logic inside route handlers (hard to test/maintain).
- No migrations or tests; no logging/metrics.
- Ignoring pagination and timezones; returning unbounded lists.
If you share whether it’s public vs internal, expected traffic, and preferred framework, I can tailor this to a concrete starter plan and recommended tools.",
"refusal": null,
"role": "assistant",
"annotations": [],
"audio": null,
"function_call": null,
"tool_calls": null
},
"content_filter_results": {
"hate": {
"filtered": false,
"severity": "safe"
},
"protected_material_code": {
"filtered": false,
"detected": false
},
"protected_material_text": {
"filtered": false,
"detected": false
},
"self_harm": {
"filtered": false,
"severity": "safe"
},
"sexual": {
"filtered": false,
"severity": "safe"
},
"violence": {
"filtered": false,
"severity": "safe"
}
}
}
],
"created": 1762788925,
"model": "gpt-5-2025-08-07",
"object": "chat.completion",
"service_tier": null,
"system_fingerprint": null,
"usage": {
"completion_tokens": 2919,
"prompt_tokens": 29,
"total_tokens": 2948,
"completion_tokens_details": {
"accepted_prediction_tokens": 0,
"audio_tokens": 0,
"reasoning_tokens": 1792,
"rejected_prediction_tokens": 0
},
"prompt_tokens_details": {
"audio_tokens": 0,
"cached_tokens": 0
}
},
"prompt_filter_results": [
{
"prompt_index": 0,
"content_filter_results": {
"hate": {
"filtered": false,
"severity": "safe"
},
"jailbreak": {
"filtered": false,
"detected": false
},
"self_harm": {
"filtered": false,
"severity": "safe"
},
"sexual": {
"filtered": false,
"severity": "safe"
},
"violence": {
"filtered": false,
"severity": "safe"
}
}
}
]
}
Reasoning summary
When using the latest reasoning models with the Responses API you can use the reasoning summary parameter to receive summaries of the model's chain of thought reasoning.
Important
Attempting to extract raw reasoning through methods other than the reasoning summary parameter are not supported, may violate the Acceptable Use Policy, and may result in throttling or suspension when detected.
using OpenAI;
using OpenAI.Responses;
using System.ClientModel.Primitives;
using Azure.Identity;
#pragma warning disable OPENAI001 //currently required for token based authentication
BearerTokenPolicy tokenPolicy = new(
new DefaultAzureCredential(),
"https://ai.azure.com/.default");
OpenAIResponseClient client = new(
model: "o4-mini",
authenticationPolicy: tokenPolicy,
options: new OpenAIClientOptions()
{
Endpoint = new Uri("https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1")
}
);
OpenAIResponse response = await client.CreateResponseAsync(
userInputText: "What's the optimal strategy to win at poker?",
new ResponseCreationOptions()
{
ReasoningOptions = new ResponseReasoningOptions()
{
ReasoningEffortLevel = ResponseReasoningEffortLevel.High,
ReasoningSummaryVerbosity = ResponseReasoningSummaryVerbosity.Auto,
},
});
// Get the reasoning summary from the first OutputItem (ReasoningResponseItem)
Console.WriteLine("=== Reasoning Summary ===");
foreach (var item in response.OutputItems)
{
if (item is ReasoningResponseItem reasoningItem)
{
foreach (var summaryPart in reasoningItem.SummaryParts)
{
if (summaryPart is ReasoningSummaryTextPart textPart)
{
Console.WriteLine(textPart.Text);
}
}
}
}
Console.WriteLine("
=== Assistant Response ===");
// Get the assistant's output
Console.WriteLine(response.GetOutputText());
You'll need to upgrade your OpenAI client library for access to the latest parameters.
pip install openai --upgrade
Microsoft Entra ID:
from openai import OpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
token_provider = get_bearer_token_provider(
DefaultAzureCredential(), "https://ai.azure.com/.default"
)
client = OpenAI(
base_url = "https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/",
api_key=token_provider,
)
response = client.responses.create(
input="Tell me about the curious case of neural text degeneration",
model="gpt-5", # replace with model deployment name
reasoning={
"effort": "medium",
"summary": "auto" # auto, concise, or detailed, gpt-5 series do not support concise
},
text={
"verbosity": "low" # New with GPT-5 models
}
)
print(response.model_dump_json(indent=2))
API Key:
import os
from openai import OpenAI
client = OpenAI(
base_url = "https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/",
api_key=os.getenv("AZURE_OPENAI_API_KEY")
)
response = client.responses.create(
input="Tell me about the curious case of neural text degeneration",
model="gpt-5", # replace with model deployment name
reasoning={
"effort": "medium",
"summary": "auto" # auto, concise, or detailed, gpt-5 series do not support concise
},
text={
"verbosity": "low" # New with GPT-5 models
}
)
print(response.model_dump_json(indent=2))
curl -X POST "https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/responses" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $AZURE_OPENAI_AUTH_TOKEN" \
-d '{
"model": "gpt-5",
"input": "Tell me about the curious case of neural text degeneration",
"reasoning": {"summary": "auto"},
"text": {"verbosity": "low"}
}'
{
"id": "resp_689a0a3090808190b418acf12b5cc40e0fc1c31bc69d8719",
"created_at": 1754925616.0,
"error": null,
"incomplete_details": null,
"instructions": null,
"metadata": {},
"model": "gpt-5",
"object": "response",
"output": [
{
"id": "rs_689a0a329298819095d90c34dc9b80db0fc1c31bc69d8719",
"summary": [],
"type": "reasoning",
"encrypted_content": null,
"status": null
},
{
"id": "msg_689a0a33009881909fe0fcf57cba30200fc1c31bc69d8719",
"content": [
{
"annotations": [],
"text": "Neural text degeneration refers to the ways language models produce low-quality, repetitive, or vacuous text, especially when generating long outputs. It’s “curious” because models trained to imitate fluent text can still spiral into unnatural patterns. Key aspects:
- Repetition and loops: The model repeats phrases or sentences (“I’m sorry, but...”), often due to high-confidence tokens reinforcing themselves.
- Loss of specificity: Vague, generic, agreeable text that avoids concrete details.
- Drift and contradiction: The output gradually departs from context or contradicts itself over long spans.
- Exposure bias: During training, models see gold-standard prefixes; at inference, they must condition on their own imperfect outputs, compounding errors.
- Likelihood vs. quality mismatch: Maximizing token-level likelihood doesn’t align with human preferences for diversity, coherence, or factuality.
- Token over-optimization: Frequent, safe tokens get overused; certain phrases become attractors.
- Entropy collapse: With greedy or low-temperature decoding, the distribution narrows too much, causing repetitive, low-entropy text.
- Length and beam search issues: Larger beams or long generations can favor bland, repetitive sequences (the “likelihood trap”).
Common mitigations:
- Decoding strategies:
- Top-k, nucleus (top-p), or temperature sampling to keep sufficient entropy.
- Typical sampling and locally typical sampling to avoid dull but high-probability tokens.
- Repetition penalties, presence/frequency penalties, no-repeat n-grams.
- Contrastive decoding (and variants like DoLa) to filter generic continuations.
- Min/max length, stop sequences, and beam search with diversity/penalties.
- Training and alignment:
- RLHF/DPO to better match human preferences for non-repetitive, helpful text.
- Supervised fine-tuning on high-quality, diverse data; instruction tuning.
- Debiasing objectives (unlikelihood training) to penalize repetition and banned patterns.
- Mixture-of-denoisers or latent planning to improve long-range coherence.
- Architectural and planning aids:
- Retrieval-augmented generation to ground outputs.
- Tool use and structured prompting to constrain drift.
- Memory and planning modules, hierarchical decoding, or sentence-level control.
- Prompting tips:
- Ask for concise answers, set token limits, and specify structure.
- Provide concrete constraints or content to reduce generic filler.
- Use “say nothing if uncertain” style instructions to avoid vacuity.
Representative papers/terms to search:
- Holtzman et al., “The Curious Case of Neural Text Degeneration” (2020): nucleus sampling.
- Welleck et al., “Neural Text Degeneration with Unlikelihood Training.”
- Li et al., “A Contrastive Framework for Decoding.”
- Su et al., “DoLa: Decoding by Contrasting Layers.”
- Meister et al., “Typical Decoding.”
- Ouyang et al., “Training language models to follow instructions with human feedback.”
In short, degeneration arises from a mismatch between next-token likelihood and human preferences plus decoding choices; careful decoding, training objectives, and grounding help prevent it.",
"type": "output_text",
"logprobs": null
}
],
"role": "assistant",
"status": "completed",
"type": "message"
}
],
"parallel_tool_calls": true,
"temperature": 1.0,
"tool_choice": "auto",
"tools": [],
"top_p": 1.0,
"background": false,
"max_output_tokens": null,
"max_tool_calls": null,
"previous_response_id": null,
"prompt": null,
"prompt_cache_key": null,
"reasoning": {
"effort": "minimal",
"generate_summary": null,
"summary": "detailed"
},
"safety_identifier": null,
"service_tier": "default",
"status": "completed",
"text": {
"format": {
"type": "text"
}
},
"top_logprobs": null,
"truncation": "disabled",
"usage": {
"input_tokens": 16,
"input_tokens_details": {
"cached_tokens": 0
},
"output_tokens": 657,
"output_tokens_details": {
"reasoning_tokens": 0
},
"total_tokens": 673
},
"user": null,
"content_filters": null,
"store": true
}
Note
Even when enabled, reasoning summaries are not guaranteed to be generated for every step/request. This is expected behavior.
Python lark
GPT-5 series reasoning models have the ability to call a new custom_tool called lark_tool. This tool is based on Python lark and can be used for more flexible constraining of model output.
Responses API
{
"model": "gpt-5-2025-08-07",
"input": "please calculate the area of a circle with radius equal to the number of 'r's in strawberry",
"tools": [
{
"type": "custom",
"name": "lark_tool",
"format": {
"type": "grammar",
"syntax": "lark",
"definition": "start: QUESTION NEWLINE ANSWER
QUESTION: /[^\
?]{1,200}\\?/
NEWLINE: /\
/
ANSWER: /[^\
!]{1,200}!/"
}
}
],
"tool_choice": "required"
}
Microsoft Entra ID:
from openai import OpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
token_provider = get_bearer_token_provider(
DefaultAzureCredential(), "https://ai.azure.com/.default"
)
client = OpenAI(
base_url = "https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/",
api_key=token_provider,
)
response = client.responses.create(
model="gpt-5", # replace with your model deployment name
tools=[
{
"type": "custom",
"name": "lark_tool",
"format": {
"type": "grammar",
"syntax": "lark",
"definition": "start: QUESTION NEWLINE ANSWER
QUESTION: /[^\
?]{1,200}\\?/
NEWLINE: /\
/
ANSWER: /[^\
!]{1,200}!/"
}
}
],
input=[{"role": "user", "content": "Please calculate the area of a circle with radius equal to the number of 'r's in strawberry"}],
)
print(response.model_dump_json(indent=2))
API Key:
import os
from openai import OpenAI
client = OpenAI(
base_url = "https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/",
api_key=os.getenv("AZURE_OPENAI_API_KEY")
)
response = client.responses.create(
model="gpt-5", # replace with your model deployment name
tools=[
{
"type": "custom",
"name": "lark_tool",
"format": {
"type": "grammar",
"syntax": "lark",
"definition": "start: QUESTION NEWLINE ANSWER
QUESTION: /[^\
?]{1,200}\\?/
NEWLINE: /\
/
ANSWER: /[^\
!]{1,200}!/"
}
}
],
input=[{"role": "user", "content": "Please calculate the area of a circle with radius equal to the number of 'r's in strawberry"}],
)
print(response.model_dump_json(indent=2))
Output:
{
"id": "resp_689a0cf927408190b8875915747667ad01c936c6ffb9d0d3",
"created_at": 1754926332.0,
"error": null,
"incomplete_details": null,
"instructions": null,
"metadata": {},
"model": "gpt-5",
"object": "response",
"output": [
{
"id": "rs_689a0cfd1c888190a2a67057f471b5cc01c936c6ffb9d0d3",
"summary": [],
"type": "reasoning",
"encrypted_content": null,
"status": null
},
{
"id": "msg_689a0d00e60c81908964e5e9b2d6eeb501c936c6ffb9d0d3",
"content": [
{
"annotations": [],
"text": "“strawberry” has 3 r’s, so the radius is 3.
Area = πr² = π × 3² = 9π ≈ 28.27 square units.",
"type": "output_text",
"logprobs": null
}
],
"role": "assistant",
"status": "completed",
"type": "message"
}
],
"parallel_tool_calls": true,
"temperature": 1.0,
"tool_choice": "auto",
"tools": [
{
"name": "lark_tool",
"parameters": null,
"strict": null,
"type": "custom",
"description": null,
"format": {
"type": "grammar",
"definition": "start: QUESTION NEWLINE ANSWER
QUESTION: /[^\
?]{1,200}\\?/
NEWLINE: /\
/
ANSWER: /[^\
!]{1,200}!/",
"syntax": "lark"
}
}
],
"top_p": 1.0,
"background": false,
"max_output_tokens": null,
"max_tool_calls": null,
"previous_response_id": null,
"prompt": null,
"prompt_cache_key": null,
"reasoning": {
"effort": "medium",
"generate_summary": null,
"summary": null
},
"safety_identifier": null,
"service_tier": "default",
"status": "completed",
"text": {
"format": {
"type": "text"
}
},
"top_logprobs": null,
"truncation": "disabled",
"usage": {
"input_tokens": 139,
"input_tokens_details": {
"cached_tokens": 0
},
"output_tokens": 240,
"output_tokens_details": {
"reasoning_tokens": 192
},
"total_tokens": 379
},
"user": null,
"content_filters": null,
"store": true
}
Chat Completions
{
"messages": [
{
"role": "user",
"content": "Which one is larger, 42 or 0?"
}
],
"tools": [
{
"type": "custom",
"name": "custom_tool",
"custom": {
"name": "lark_tool",
"format": {
"type": "grammar",
"grammar": {
"syntax": "lark",
"definition": "start: QUESTION NEWLINE ANSWER
QUESTION: /[^\
?]{1,200}\\?/
NEWLINE: /\
/
ANSWER: /[^\
!]{1,200}!/"
}
}
}
}
],
"tool_choice": "required",
"model": "gpt-5-2025-08-07"
}
Availability
Region availability
| Model | Region | Limited access |
|---|---|---|
gpt-5.4-mini |
Global Standard: East US2 Sweden Central South Central US Poland Central |
No access request needed. |
gpt-5.4.nano |
Global Standard: East US2 Sweden Central South Central US Poland Central Datazone Standard: East US2 South Central US |
No access request needed. |
gpt-5.4-pro |
Model availability | Request access: Limited access model application. If you already have access to a limited access model no request is required. |
gpt-5.4 |
Model availability | Request access: Limited access model application. If you already have access to a limited access model no request is required. |
gpt-5.3-codex |
Model availability | Request access: Limited access model application. If you already have access to a limited access model no request is required. |
gpt-5.2-codex |
Model availability | Request access: Limited access model application. If you already have access to a limited access model no request is required. |
gpt-5.2 |
Model availability | Request access: Limited access model application. If you already have access to a limited access model no request is required. |
gpt-5.1-codex-max |
Model availability | Access is no longer restricted for this model. |
gpt-5.1 |
Model availability | Access is no longer restricted for this model. |
gpt-5.1-chat |
Model availability | No access request needed. |
gpt-5.1-codex |
Model availability | Access is no longer restricted for this model. |
gpt-5.1-codex-mini |
Model availability | No access request needed. |
gpt-5-pro |
Model availability | Access is no longer restricted for this model. |
gpt-5-codex |
Model availability | Access is no longer restricted for this model. |
gpt-5 |
Model availability | Access is no longer restricted for this model. |
gpt-5-mini |
Model availability | No access request needed. |
gpt-5-nano |
Model availability | No access request needed. |
o3-pro |
Model availability | Request access: Limited access model application. If you already have access to a limited access model no request is required. |
codex-mini |
Model availability | No access request needed. |
o4-mini |
Model availability | No access request needed to use the core capabilities of this model. Request access: o4-mini reasoning summary feature |
o3 |
Model availability | Request access: Limited access model application |
o3-mini |
Model availability. | Access is no longer restricted for this model. |
o1 |
Model availability. | Access is no longer restricted for this model. |
API & feature support
| Feature | gpt-5.4-nano, 2026-03-17 | gpt-5.4-mini, 2026-03-17 | gpt-5.4-pro | gpt-5.4, 2026-03-05 | gpt-5.3-codex, 2026-02-24 | gpt-5.2-codex, 2026-01-14 | gpt-5.2, 2025-12-11 | gpt-5.1-codex-max, 2025-12-04 | gpt-5.1, 2025-11-13 | gpt-5.1-chat, 2025-11-13 | gpt-5.1-codex, 2025-11-13 | gpt-5.1-codex-mini, 2025-11-13 | gpt-5-pro, 2025-10-06 | gpt-5-codex, 2025-09-011 | gpt-5, 2025-08-07 | gpt-5-mini, 2025-08-07 | gpt-5-nano, 2025-08-07 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Developer Messages | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Structured Outputs | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | |
| Context Window | 400,000 Input: 272,000 Output: 128,000 |
400,000 Input: 272,000 Output: 128,000 |
1,050,000 Input: 922,000 Output: 128,000 |
1,050,000 Input: 922,000 Output: 128,000 |
400,000 Input: 272,000 Output: 128,000 |
400,000 Input: 272,000 Output: 128,000 |
400,000 Input: 272,000 Output: 128,000 |
400,000 Input: 272,000 Output: 128,000 |
400,000 Input: 272,000 Output: 128,000 |
128,000 Input: 111,616 Output: 16,384 |
400,000 Input: 272,000 Output: 128,000 |
400,000 Input: 272,000 Output: 128,000 |
400,000 Input: 272,000 Output: 128,000 |
400,000 Input: 272,000 Output: 128,000 |
400,000 Input: 272,000 Output: 128,000 |
400,000 Input: 272,000 Output: 128,000 |
400,000 Input: 272,000 Output: 128,000 |
| Reasoning effort7 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅6 | ✅4 | ✅ | ✅ | ✅ | ✅5 | ✅ | ✅ | ✅ | ✅ |
| Image input | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Chat Completions API | ✅ | ✅ | - | ✅ | - | - | ✅ | - | ✅ | ✅ | - | - | - | - | ✅ | ✅ | ✅ |
| Responses API | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | |
| Functions/Tools | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | |
| Parallel Tool Calls1 | ✅ | ✅ | - | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | - | ✅ | ✅ | ✅ | ✅ |
max_completion_tokens 2 |
✅ | ✅ | - | ✅ | - | - | ✅ | - | ✅ | ✅ | - | - | - | - | ✅ | ✅ | ✅ |
| System Messages 3 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Reasoning summary | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Streaming | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | - | ✅ | ✅ | ✅ | ✅ |
1 Parallel tool calls are not supported when reasoning_effort is set to minimal
2 Reasoning models will only work with the max_completion_tokens parameter when using the Chat Completions API. Use max_output_tokens with the Responses API.
3 The latest reasoning models support system messages to make migration easier. You should not use both a developer message and a system message in the same API request.
4 gpt-5.1 reasoning_effort defaults to none. When upgrading from previous reasoning models to gpt-5.1 keep in mind that you may need to update your code to explicitly pass a reasoning_effort level if you want reasoning_effort to occur.
5 gpt-5-pro only supports reasoning_effort high, this is the default value even when not explicitly passed to the model.
6 gpt-5.1-codex-max adds support for a new reasoning_effort level of xhigh which is the highest level that reasoning effort can be set to.
7 gpt-5.2, gpt-5.1, gpt-5.1-codex, gpt-5.1-codex-max, and gpt-5.1-codex-mini support 'None' as a value for the reasoning_effort parameter. If you wish to use these models to generate responses without reasoning, set reasoning_effort='None'. This setting can increase speed.
NEW GPT-5 reasoning features
| Feature | Description |
|---|---|
reasoning_effort |
xhigh is only supported with gpt-5.1-codex-max minimal is only supported with the original GPT-5 reasoning models. minimal is not supported with gpt-5.1 or greater * Options: none, minimal, low, medium, high, xhigh |
verbosity |
A new parameter providing more granular control over how concise the model's output will be. Options: low, medium, high. |
preamble |
GPT-5 series reasoning models have the ability to spend extra time "thinking" before executing a function/tool call. When this planning occurs the model can provide insight into the planning steps in the model response via a new object called the preamble object.Generation of preambles in the model response is not guaranteed though you can encourage the model by using the instructions parameter and passing content like "You MUST plan extensively before each function call. ALWAYS output your plan to the user before calling any function" |
| allowed tools | You can specify multiple tools under tool_choice instead of just one. |
| custom tool type | Enables raw text (non-json) outputs |
lark_tool |
Allows you to use some of the capabilities of Python lark for more flexible constraining of model responses |
* gpt-5-codex also does not support reasoning_effort minimal.
For more information, we also recommend reading OpenAI's GPT-5 prompting cookbook guide and their GPT-5 feature guide.
| Feature | codex-mini, 2025-05-16 | o3-pro, 2025-06-10 | o4-mini, 2025-04-16 | o3, 2025-04-16 | o3-mini, 2025-01-31 | o1, 2024-12-17 |
|---|---|---|---|---|---|---|
| Developer Messages | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Structured Outputs | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Context Window | Input: 200,000 Output: 100,000 |
Input: 200,000 Output: 100,000 |
Input: 200,000 Output: 100,000 |
Input: 200,000 Output: 100,000 |
Input: 200,000 Output: 100,000 |
Input: 200,000 Output: 100,000 |
| Reasoning effort | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Image input | ✅ | ✅ | ✅ | ✅ | - | ✅ |
| Chat Completions API | - | - | ✅ | ✅ | ✅ | ✅ |
| Responses API | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Functions/Tools | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Parallel Tool Calls | - | - | - | - | - | - |
max_completion_tokens 1 |
✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| System Messages 2 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Reasoning summary | ✅ | - | ✅ | ✅ | - | - |
| Streaming 3 | ✅ | - | ✅ | ✅ | ✅ | - |
1 Reasoning models will only work with the max_completion_tokens parameter when using the Chat Completions API. Use max_output_tokens with the Responses API.
2 The latest o* series model support system messages to make migration easier. When you use a system message with o4-mini, o3, o3-mini, and o1 it will be treated as a developer message. You should not use both a developer message and a system message in the same API request.
3 Streaming for o3 is limited access only.
Note
- To avoid timeouts background mode is recommended for
o3-pro. o3-prodoes not currently support image generation.
Not Supported
The following are currently unsupported with reasoning models:
temperature,top_p,presence_penalty,frequency_penalty,logprobs,top_logprobs,logit_bias,max_tokens
Markdown output
By default the o3-mini and o1 models will not attempt to produce output that includes markdown formatting. A common use case where this behavior is undesirable is when you want the model to output code contained within a markdown code block. When the model generates output without markdown formatting you lose features like syntax highlighting, and copyable code blocks in interactive playground experiences. To override this new default behavior and encourage markdown inclusion in model responses, add the string Formatting re-enabled to the beginning of your developer message.
Adding Formatting re-enabled to the beginning of your developer message does not guarantee that the model will include markdown formatting in its response, it only increases the likelihood. We have found from internal testing that Formatting re-enabled is less effective by itself with the o1 model than with o3-mini.
To improve the performance of Formatting re-enabled you can further augment the beginning of the developer message which will often result in the desired output. Rather than just adding Formatting re-enabled to the beginning of your developer message, you can experiment with adding a more descriptive initial instruction like one of the examples below:
Formatting re-enabled - please enclose code blocks with appropriate markdown tags.Formatting re-enabled - code output should be wrapped in markdown.
Depending on your expected output you may need to customize your initial developer message further to target your specific use case.
Feedback
Was this page helpful?
No
Need help with this topic?
Want to try using Ask Learn to clarify or guide you through this topic?
Additional resources
- Last updated on 2026-03-17
