Everything an AI agent can do with the Cohere API.

A reference guide for building AI agents: every method, how to authenticate, and the permissions each one needs.

Endpoints20
API versionv2
Last updated23 June 2026
Orientation

How the Cohere API works.

The Cohere API is how an app or AI agent works with Cohere's models: generating a chat reply, turning text or images into embeddings, reranking a list of documents against a query, or classifying text into labels. Access is granted through a single API key sent as a bearer token, and that key reaches every method the account is entitled to, since Cohere does not offer per-method permissions. Each key is either a Trial key for evaluation or a Production key for paid use, which sets the rate limits and the models it can reach rather than which methods it can call.

20Endpoints
8Capability groups
11Read
9Write
0Permissions
Authentication
Cohere authenticates every call with an API key sent as a bearer token in the Authorization header. There is no OAuth flow for first-party calls and no per-method permission system, so the key reaches every endpoint the account is entitled to. A key can be confirmed with the check-api-key method, which returns whether it is valid and the organization it belongs to.
Permissions
Cohere does not scope a key to specific endpoints. The only distinction is the key type: a Trial key for free evaluation, capped at 1,000 calls a month with lower per-minute limits, or a Production key for paid use with higher limits. Some of the newest models additionally require a Production key or sales approval. There is no Read-only or write-only key.
Versioning
Cohere runs two numbered API versions at once. The current v2 covers Chat, Embed, Rerank and Classify, where model is a required parameter and chat history is a single messages array. The older v1 still serves the rest of the API, like Tokenize, Datasets, Embed Jobs and Fine-tuning. Models carry their own dated names, like command-a-03-2025 and embed-v4.0.
Data model
Most calls are stateless generation: text in, generated output back, with nothing stored. The exceptions are stored resources, datasets uploaded for training, embed jobs that run over a dataset, and fine-tuned models, each of which can be created, listed, retrieved and deleted. Long-running work like an embed job or a fine-tuning run is tracked by polling its status, since Cohere does not push events.
Connect & authenticate

Connection & authentication methods.

How an app or AI agent connects to Cohere determines what it can reach. There is one route for calling the models, and the key behind it carries the same access across every method.

Ways to connect

REST API

The REST API takes JSON request bodies and returns JSON, at https://api.cohere.com. The current v2 path serves Chat, Embed, Rerank and Classify; the v1 path serves Tokenize, Detokenize, Datasets, Embed Jobs and Fine-tuning. Every call authenticates with an API key sent as a bearer token in the Authorization header. Official SDKs exist for Python, TypeScript, Java and Go.

Best forConnecting an app or AI agent to Cohere's models.
Governed byThe API key, which reaches every method the account is entitled to.
Docs ↗
Authentication

API key (bearer)

Cohere authenticates every first-party call with an API key sent as a bearer token in the Authorization header. The key is not scoped to specific endpoints, so it reaches every method the account is entitled to. A key can be confirmed with the check-api-key method, which returns whether it is valid and the organization and owner it belongs to.

TokenBearer API key
Best forServer-side calls to Cohere's models.
Docs ↗

Key type (Trial or Production)

Every key is one of two types. A Trial key is free for evaluation, capped at 1,000 calls a month with lower per-minute limits, and cannot reach the newest models. A Production key is paid, with much higher per-minute limits and the full model lineup. The type sets rate limits and model access, not which methods can be called.

TokenTrial or Production API key
Best forChoosing between free evaluation and paid production use.
Docs ↗
Capability map

What an AI agent can do in Cohere.

The Cohere API is split into areas an agent can act on, like generating chat replies, turning text into embeddings, reranking search results, and classifying text. Some methods only return generated output, while others create and delete stored resources like datasets and fine-tuned models.

Endpoint reference

Every Cohere API method.

Filter by method, access, or permission, or search any path. Select a row for version detail, rate limits, the related webhook event, and the source.

MethodEndpointWhat it doesAccessPermissionVersion

Generate (Chat & Classify)

Methods that generate text replies or classify text into labels.2

No per-method permission; any valid key can call it. Trial keys cannot use the newest models.

Acts onchat
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitTrial: 20 req/min. Production: 500 req/min (per model).

No per-method permission; any valid key can call it.

Acts onclassify
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitTrial and Production: 500 req/min (default tier).

Represent (Embed & Rerank)

Methods that turn text or images into embeddings, or reorder documents by relevance.2

No per-method permission; any valid key can call it. Generation only, nothing is stored.

Acts onembed
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitText: 2,000 inputs/min. Images: Trial 5/min, Production 400/min.

No per-method permission; any valid key can call it. Each document is truncated at max_tokens_per_doc (default 4,096).

Acts onrerank
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitTrial: 10 req/min. Production: 1,000 req/min.

Tokens

Methods that convert text to tokens and back.2

Text must be 1 to 65,536 characters. Returns token data only.

Acts ontoken
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitTrial: 100 req/min. Production: 2,000 req/min.

Returns text only; nothing is stored.

Acts ontoken
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitStandard limits apply

Models

Methods that list and retrieve the models available to the account.2

Read-only; paginated with a next_page_token.

Acts onmodel
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitStandard limits apply

Read-only.

Acts onmodel
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitStandard limits apply

Datasets

Methods for uploading, listing, retrieving and deleting datasets used for training and embed jobs.4

Creates a stored dataset on the account.

Acts ondataset
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitStandard limits apply

Read-only.

Acts ondataset
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitStandard limits apply

Read-only.

Acts ondataset
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitStandard limits apply

Irreversible; removes the stored dataset.

Acts ondataset
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitStandard limits apply

Embed jobs

Methods for running and tracking batch embedding jobs over a dataset.4

Starts a long-running job; track it by polling its status.

Acts onembed-job
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitTrial: 5 req/min. Production: 50 req/min.

Read-only.

Acts onembed-job
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitStandard limits apply

Read-only; poll this to learn when a job completes.

Acts onembed-job
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitStandard limits apply

Stops a running job.

Acts onembed-job
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitStandard limits apply

Fine-tuning

Methods for creating, listing, retrieving and deleting fine-tuned models.3

Requires name and settings (base_model and dataset_id); starts a training run.

Acts onfinetuned-model
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitStandard limits apply

Read-only.

Acts onfinetuned-model
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitStandard limits apply

Read-only.

Acts onfinetuned-model
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitStandard limits apply

Account

Methods for checking the API key.1

Returns whether the key is valid and the organization and owner it belongs to.

Acts onapi-key
Permission (capability)None required
VersionAvailable since the API’s base version
Webhook eventNone
Rate limitStandard limits apply
No endpoints match those filters.
Webhooks

Webhook events.

Cohere does not push events. An app or AI agent learns the outcome of a long-running job, like an embed job or a fine-tuning run, by polling its status method.

EventWhat it signalsTriggered by
No events match that search.
Rate limits & pagination

Rate limits, pagination & request size.

Cohere limits how fast an app can call each endpoint, with separate ceilings for trial keys and production keys, and a monthly cap on trial-key calls.

Request rate

Cohere meters requests per minute, per endpoint, with separate ceilings for Trial keys and Production keys. The free Trial key is also capped at 1,000 API calls a month across all endpoints. The per-minute limits differ widely by endpoint and key type, so they are listed on each method's row rather than summarized here, for example Chat at 20 requests per minute on a Trial key and 500 on a Production key, and Rerank at 10 versus 1,000. Going over returns HTTP 429, and the fix is to slow down or move to a Production key.

Pagination

List methods that can return many items, like List Embed Jobs, List Datasets, List Models and List Fine-tuned Models, page with a page_size parameter and return a next_page_token to fetch the following page. Generation methods like Chat, Embed and Rerank return their full result in one response and do not paginate.

Request size

Per-call input limits apply by method. Embed takes up to 96 texts per call. Tokenize accepts text of 1 to 65,536 characters. Rerank truncates each document at max_tokens_per_doc, which defaults to 4,096 tokens. Beyond these, the model's context length sets how much text a single Chat or Embed call can process.

Errors

Status codes & error handling.

The status codes an agent should handle, and what to do about each.

StatusCodeMeaningWhat to do
400Bad RequestThe request body is not valid, for example a required field is missing or a value is out of range.Check the request against the API spec, fix the fields or values, and resend.
401UnauthorizedThe API key is missing, invalid, or has expired.Send a valid API key in the Authorization header, and rotate the key if it was compromised.
402Payment RequiredThe account has reached a billing or spending limit.Add or update the payment method in the dashboard to continue.
404Not FoundThe requested resource does not exist, for example a wrong or deleted model, dataset, or job ID.Verify the resource identifier and confirm it belongs to this account.
429Too Many RequestsThe rate limit for the endpoint and key type was exceeded.Back off and retry, smooth the request rate, or move from a Trial key to a Production key.
499Request CancelledThe client cancelled the request before it completed.Retry the request when ready.
500Internal Server ErrorAn unexpected error occurred on Cohere's side.Retry with backoff, and contact Cohere support with the request details if it persists.
Versioning & freshness

Version history.

Cohere runs two numbered API versions side by side, an older v1 and a current v2, and ships dated model and API changes through its release notes.

Version history

What changed, and when

Latest versionv2
v2Current version
Current major version (Chat, Embed, Rerank, Classify on v2)

The v2 API reworked the four model endpoints. Chat combines message and chat history into a single messages array and uses JSON-schema tool definitions; Embed and Rerank made model a required parameter, and Embed added a required embedding_types parameter; streaming moved to server-sent events and citation handling was consolidated. The rest of the API stays on v1.

What changed
  • Chat v2: messages array replaces separate message and chat_history.
  • Embed v2 and Rerank v2: model is now a required parameter.
  • Embed v2: embedding_types is now a required parameter.
  • Streaming switched to server-sent events; citations consolidated under citation_options.
  • v1-only features removed from v2: connectors, server-side conversation management, prompt truncation.
2025-09-15Requires migration
Major Command model deprecations

Older Command models were deprecated in favour of the current lineup.

What changed
  • Deprecated: command-r-03-2024, command-r-plus-04-2024, command-light, command, and summarize.
  • Recommended replacements: command-r-08-2024, command-r-plus-08-2024, or command-a-03-2025.
2025-04-15Feature update
Embed Multimodal v4

Embed v4.0 embeds interleaved text and images in the same vector space, so screenshots of PDFs, slides, figures and tables can be indexed alongside text.

What changed
  • embed-v4.0 released, with multimodal (text and image) embeddings in one model.
  • Available on the Cohere platform and major cloud providers.
2025-03
Command A

command-a-03-2025 is Cohere's most performant Command model, built for enterprise tool use, retrieval-augmented generation, agents and multilingual tasks.

What changed
  • command-a-03-2025 released as the flagship Chat model.
2024-12-02Requires migration
Rerank-v3.5 and the v2 Rerank API

Rerank-v3.5 shipped alongside the v2 Rerank API, with a 4,096-token context length and stronger multilingual retrieval. The Rerank-v2.0 model family was deprecated.

What changed
  • rerank-v3.5 released; v2 Rerank API introduced.
  • model made a required parameter; max_chunks_per_doc replaced by max_tokens_per_doc.
  • Rerank-v2.0 model family deprecated.

Call the v2 endpoints for chat, embed, rerank and classify; the rest of the API stays on v1.

Cohere release notes ↗
Questions

Cohere API, answered.

What is the difference between a Trial key and a Production key?+
A Trial key is free and meant for evaluation. It is capped at 1,000 API calls a month and has lower per-minute rate limits, and the newest models are not available on it. A Production key is paid, has much higher per-minute limits, and unlocks the full model lineup. Both keys authenticate the same way and call the same endpoints; the difference is rate limits and model access, not which methods are allowed.
Does Cohere support OAuth or per-method permissions?+
No. First-party calls authenticate with a single API key sent as a bearer token, and that key reaches every endpoint the account is entitled to. Cohere does not offer a Read-only key, a write-only key, or a way to scope a key to specific methods. To limit what an agent can do, the access has to be controlled in front of the key, which is what Bollard does.
What is the difference between the v1 and v2 APIs?+
Cohere runs both at once. The v2 endpoints cover Chat, Embed, Rerank and Classify, where model is a required parameter, chat history is a single messages array, and streaming uses server-sent events. The v1 endpoints still serve the rest of the API, like Tokenize, Detokenize, Datasets, Embed Jobs and Fine-tuning. New integrations should call v2 for the four model APIs and v1 for everything else.
How does an agent track a long-running job?+
Cohere does not push events or send webhooks. An embed job or a fine-tuning run returns an ID when it is created, and the agent polls the matching get method to read its status until it completes or fails. There is no callback or event stream for job completion.
How are rate limits enforced?+
Limits are per minute, per endpoint, and differ by key type. For example Chat allows 20 requests per minute on a Trial key and 500 on a Production key, while Rerank allows 10 versus 1,000. Trial keys also have a hard cap of 1,000 calls a month across all endpoints. Exceeding a limit returns HTTP 429; the response should be retried with backoff or the key upgraded to Production.
What does the v2 Chat endpoint do?+
It generates a text response to a conversation. The request carries a model and a messages array holding the conversation so far, and the response is the assistant's reply plus usage metrics and, when documents are supplied, citations. It also supports tool use, structured outputs, and streaming the reply token by token over server-sent events.
Related

More ai API guides for agents

What is Bollard AI?

Control what every AI agent can do in Cohere.

Bollard AI sits between a team's AI agents and Cohere. Grant each agent exactly the access it needs, method by method, and every call is checked and logged.

  • Allow only the methods an agent needs, never a shared Cohere key.
  • Denied by default, so an agent reaches only what has been explicitly allowed.
  • Every call recorded in plain English: who, what, where, and the decision.
Cohere
Search Agent
Generate chat replies ActionOffReadFull use
Create embeddings ActionOffReadFull use
Delete fine-tuned models ActionOffReadFull use
Per-agent access, set in Bollard AI, not in Cohere