GLM-4.5 Air: Official Model Guide

What is GLM‑4.5 Air?

GLM‑4.5 Air is the cost‑efficient tier of Zhipu AI’s GLM‑4.5 series. Zhipu positions GLM‑4.5 and GLM‑4.5 Air as foundational models for agent‑oriented applications, meaning they are built to handle tool invocation, web browsing, and complex workflows rather than only casual chat. GLM‑4.5 Air keeps the same overall design intent as the flagship GLM‑4.5 model but trims the parameter footprint to make the model faster and significantly cheaper to run at scale.

The GLM‑4.5 family focuses on three themes: reasoning, coding, and agentic execution. For teams that need to deliver these capabilities across many users, GLM‑4.5 Air provides the most accessible entry point. It is intended for production systems where throughput and cost predictability are as important as accuracy.

Official model ID and core specifications

The Zhipu API documentation lists the model identifier as `glm-4.5-air`. It is a text‑only model with a 128K context window and a maximum output length of 96K tokens. These constraints define the upper bound for long‑context applications such as document analysis, multi‑turn planning, and agent workflows.

Parameter	Official value
Model ID	glm-4.5-air
Context window	128K tokens
Max output tokens	96K tokens
Language support	English and Chinese
Input modality	Text
Output modality	Text

Architecture and training pipeline

GLM‑4.5 Air uses a Mixture‑of‑Experts (MoE) architecture. Zhipu reports that the model has 106B total parameters with 12B active parameters per forward pass. This allows GLM‑4.5 Air to retain high capability while dramatically lowering inference cost compared to dense models. The MoE design is a key reason the model can scale to long‑context workloads without the same hardware footprint as the full GLM‑4.5 tier.

The training pipeline follows the same pattern as GLM‑4.5: pretraining on 15 trillion tokens of general‑domain data, followed by targeted fine‑tuning for reasoning, coding, and agent‑ specific tasks. Zhipu also applies reinforcement learning to strengthen reasoning quality and improve the model’s reliability when handling tool calls and multi‑step workflows.

Core capability set in the GLM‑4.5 series

Zhipu’s GLM‑4.5 overview emphasizes a common capability set across the series. GLM‑4.5 Air inherits these foundations, which means the model is more than a fast chat engine. It is designed to integrate with real systems and operate as part of tool‑driven workflows.

Hybrid reasoning modes for complex versus lightweight requests.
Tool invocation APIs for function‑style calls.
Web browsing support for data‑backed answers.
Streaming output for lower perceived latency.
Strong performance in software engineering and front‑end development tasks.

These capabilities make GLM‑4.5 Air suitable for agent pipelines where you need a balance of reasoning, structured tool use, and fast responsiveness. In practice, you can combine streaming output with concise prompts to build UIs that feel immediate without sacrificing accuracy.

Agent‑oriented capabilities

Zhipu frames GLM‑4.5 Air as a foundation model for agents. That focus shows up in several official capability areas: tool invocation, web browsing, software engineering, and front‑end development. In practice, these features allow GLM‑4.5 Air to coordinate multi‑step tasks such as fetching external data, generating code, or orchestrating multi‑ tool pipelines in an agent system.

If you are building an agent that must select tools, write code, or plan a workflow, GLM‑4.5 Air is designed to fit those constraints. It provides a cost‑efficient option that still benefits from the GLM‑4.5 family’s agent‑first training emphasis.

Hybrid reasoning modes

GLM‑4.5 Air supports hybrid reasoning modes. The documentation describes two execution paths: Thinking Mode for complex reasoning or tool usage, and Non‑Thinking Mode for faster, lightweight responses. These modes can be toggled via the `thinking.type` parameter, using `enabled` or `disabled` values. Dynamic thinking is enabled by default, which means the model decides when to reason more deeply based on the prompt.

In production, this gives teams a simple trade‑off knob. When latency matters most, disable thinking. When reliability matters more than speed, enable thinking explicitly or accept the default dynamic behavior. Because GLM‑4.5 Air is already optimized for efficiency, hybrid reasoning helps you fine‑tune the balance between cost and quality.

Long‑context design and output budgeting

The 128K context window makes GLM‑4.5 Air useful for long‑form tasks such as contract review, multi‑document synthesis, or large codebase analysis. Its 96K maximum output length is also unusually high, allowing detailed reports or structured outputs without aggressive truncation. Still, long outputs are expensive and can be slower to generate, so output budgeting matters in production.

A practical pattern is to adopt a staged workflow. First summarize or extract key points from each source document, then synthesize across those summaries. This approach keeps token usage predictable and reduces the risk of exceeding the output limit. It also makes evaluation easier because you can validate intermediate steps instead of reviewing one massive response.

Language support and localization

The official model list specifies English and Chinese language support. This makes GLM‑4.5 Air a practical choice for bilingual teams or products that need to serve both global and Chinese‑language users. When a workflow must produce mixed outputs, explicitly segment the prompt and specify which parts should be in which language. Clear language instructions are especially important in long‑context prompts where multiple sources might be in different languages.

For multilingual knowledge bases, a common pattern is to retrieve in the user’s language and then ask the model to translate or summarize into the target language. Because GLM‑4.5 Air operates on text only, it is straightforward to apply these language transformations without mixing modalities or extra preprocessing layers.

Official pricing and cost structure

Zhipu’s public pricing lists GLM‑4.5 Air at $0.2 per million input tokens, $0.03 per million cached input tokens, and $1.1 per million output tokens. Cached input storage is listed as limited‑time free in the official pricing table. These values make GLM‑4.5 Air one of the most cost‑efficient options in the GLM‑4.5 family while still supporting full long‑context and agent‑oriented capabilities.

Zhipu also lists a per‑use charge for web search in its built‑in tools catalog. If your agent relies on web browsing, budget for the extra $0.01 per use alongside token costs.

Model	Input (per 1M)	Cached input (per 1M)	Output (per 1M)
GLM‑4.5	$0.6	$0.11	$2.2
GLM‑4.5‑X	$2.2	$0.45	$8.9
GLM‑4.5 Air	$0.2	$0.03	$1.1
GLM‑4.5 AirX	$1.1	$0.22	$4.5
GLM‑4.5 Flash	Free	Free	Free

Where GLM‑4.5 Air fits in the GLM‑4.5 family

The GLM‑4.5 family includes multiple tiers, each optimized for a different trade‑off between capability, latency, and cost. GLM‑4.5 is the flagship reasoning model. GLM‑4.5‑X pushes performance and responsiveness further for complex reasoning. GLM‑4.5 Air is the cost‑ effective workhorse, while GLM‑4.5 AirX and GLM‑4.5 Flash emphasize speed and affordability.

GLM‑4.5 Air is the best fit for production systems that need high throughput but cannot justify flagship pricing. It preserves the same 128K context and hybrid reasoning framework while delivering lower cost per token.

Model	Positioning	Why choose it
GLM‑4.5	Flagship reasoning model	Best overall capability for demanding agent workflows
GLM‑4.5‑X	High‑performance reasoning tier	Faster, more capable responses for complex workflows
GLM‑4.5 Air	Cost‑effective tier	Balanced capability and price for high‑volume workloads
GLM‑4.5 AirX	Lightweight high‑speed tier	Optimized for responsiveness with higher cost than Air
GLM‑4.5 Flash	Free lightweight tier	Best for experimentation and low‑cost demos

Cost control strategies

Even with low per‑token pricing, costs can spike when agents generate long outputs or perform many tool calls. GLM‑4.5 Air’s cached input pricing is useful for repeated system prompts or shared retrieval context. If your workflow reuses the same policies, schemas, or long tool instructions, caching those tokens can significantly lower spend.

You can also reduce cost by keeping outputs structured and concise. When the response must be long, break it into separate steps or sections and enforce a maximum length. Because GLM‑4.5 Air supports 96K output tokens, it is tempting to request very large answers; in production settings, it is better to cap output and only expand when the user explicitly asks.

Latency, streaming, and throughput planning

GLM‑4.5 Air supports streaming output, which lets you render partial responses while the model is still generating. This is especially useful for chat UIs and agent dashboards where user experience is sensitive to perceived latency. When combined with lower‑temperature prompts and concise output formats, streaming can make the model feel significantly faster even in long‑context sessions.

For high‑throughput systems, plan for concurrency and back‑pressure. Long‑context requests and large outputs can still be slow, so it’s wise to reserve GLM‑4.5 Air for requests that truly need large context. Use smaller prompts or lightweight models for simple classification and routing tasks, then escalate to GLM‑4.5 Air only when the workload benefits from deeper reasoning or long context.

Prompting guidance for GLM‑4.5 Air

GLM‑4.5 Air works best with clear task framing and explicit constraints. Start with a concise system or instruction header that defines the goal, then add sections for requirements, context, and desired output format. The model’s long context window means you can provide background materials, but clear structure is still essential to reduce ambiguity.

For example, if you are asking for a report, specify the audience, structure, and length: “Write a 6‑section report for an engineering manager, 1‑2 paragraphs per section, with a final risk summary.” In bilingual workflows, explicitly state the output language; GLM‑4.5 Air supports Chinese and English, and clear language instructions prevent mixed outputs.

Production use cases

GLM‑4.5 Air is particularly strong in high‑volume workloads: customer support assistants, document summarization pipelines, meeting‑note synthesis, and internal knowledge base search. Its agent‑oriented focus also makes it suitable for workflow automation, where the model can plan tool calls, retrieve data, and draft structured outputs.

In software teams, GLM‑4.5 Air can perform code review assistance, generate test plans, or draft documentation. In operations, it can classify tickets and recommend resolution steps. For the most complex reasoning tasks, you can route only the high‑stakes steps to GLM‑4.5 or GLM‑4.5‑X while keeping day‑to‑day traffic on GLM‑4.5 Air.

Workflow blueprint: long‑document review

A common production pattern is to use GLM‑4.5 Air as the main engine for long‑document review while keeping deterministic checks outside the model. This balances cost and reliability and makes the system easier to audit. A typical workflow looks like this:

Chunk large documents into sections that fit safely within the 128K context window.
Ask GLM‑4.5 Air to extract key facts or risks for each chunk.
Run rule‑based checks or domain validators on the extracted facts.
Send the validated summaries back to GLM‑4.5 Air for synthesis.
Produce a final report with structured headings and an executive summary.

This pattern works well for compliance reviews, contract analysis, or policy audits. It also lets you control costs because each step is bounded. If the synthesis step needs higher reasoning accuracy, you can route only that final step to GLM‑4.5 or GLM‑4.5‑X without rewriting the rest of the pipeline.

Reliability and evaluation practices

Like all large language models, GLM‑4.5 Air can produce incorrect details or hallucinated content. For high‑stakes outputs, build review workflows or automated checks. Retrieval‑ augmented generation and explicit citations within your own system context can reduce error rates, especially for factual or regulated content.

Because GLM‑4.5 Air supports tool invocation and web browsing, you can design agents that verify information before responding. Treat the model as a reasoning engine that can propose actions, and then validate those actions with deterministic systems or trusted data sources.

Implementation notes for API usage

GLM‑4.5 Air is accessed through Zhipu’s chat completion APIs. The `max_tokens` parameter controls response length, and for the GLM‑4.5 series the maximum supported output is 96K. Temperature and top‑p are available to tune creativity versus determinism, but for production automation it is common to use lower temperature settings and rely on structure in the prompt.

When enabling tool calls, treat the model output as a plan rather than a final answer. You can parse tool invocations, run the tools, and then call the model again with results. This is the standard pattern for building robust agents on top of GLM‑4.5 Air.

FAQ

What is the official model ID for GLM‑4.5 Air?

The official API model identifier is `glm-4.5-air`.

How large is the context window?

GLM‑4.5 Air supports a 128K token context window.

What is the maximum output length?

The GLM‑4.5 series supports up to 96K output tokens, which applies to GLM‑4.5 Air as well.

What are the official prices?

Zhipu lists $0.2 per million input tokens, $0.03 per million cached input tokens, and $1.1 per million output tokens for GLM‑4.5 Air.

When should I choose GLM‑4.5 Air over GLM‑4.5?

Choose GLM‑4.5 Air when you need high volume and cost efficiency. Choose GLM‑4.5 or GLM‑4.5‑X when you need the strongest possible reasoning or the fastest performance on complex tasks.