LLM APIs, SDKs, and Model Selection
Learning Objectives
- You know the main concepts involved in calling an LLM API from a program.
- You understand common model-selection trade-offs such as quality, latency, cost, and context size.
- You can read and write a simple raw
fetchcall for a Responses-style API.
LLM APIs as ordinary HTTP APIs
An LLM API is still an HTTP API. A program sends a request, waits for a response, parses JSON, and then decides what to do next.
In this course, we use raw fetch rather than a provider SDK. This keeps the surrounding logic visible and makes the examples easier to adapt across providers.
Many current APIs, including OpenAI’s Responses API, accept a role-based conversation as ordinary JSON input. A typical request body therefore contains a model name, an input field that carries the conversation items, and sometimes extra control parameters such a parameter to control the reasoning effort or a parameter to control the temperature.
Older examples and provider-compatible services may still use a chat-completions style shape with
messageson the way in andchoices[0].message.contenton the way out. That older shape is worth recognizing, but from this point onward the raw API examples in this course use the Responses API style..
This is worth emphasizing because it keeps the model call from looking magical. The application is still assembling a normal JSON request. The difference is only in what the remote service does with it.
A small raw fetch example
The following example uses the Responses API shape.
const apiUrl = Deno.env.get("LLM_API_URL");
const apiKey = Deno.env.get("LLM_API_KEY");
const model = Deno.env.get("LLM_MODEL");
const response = await fetch(apiUrl, {
method: "POST",
headers: {
"Content-Type": "application/json",
"Authorization": `Bearer ${apiKey}`,
},
body: JSON.stringify({
model,
input: [
{ role: "system", content: "You are a concise assistant." },
{ role: "user", content: "Explain what a command-line application is." },
],
}),
});
const data = await response.json();
console.log(data);
Run this kind of program with network and environment access:
$ deno run --allow-net --allow-env app.js
The examples in this course use raw fetch on purpose. This keeps the request body, the response shape, and the validation logic visible. In a production codebase that has already standardized on one provider, an official SDK can be more ergonomic. Useful references include e.g. OpenAI’s JavaScript SDK and Anthropic’s SDK.
Running the above with a valid API key and model should print the model’s response to the console. The exact request shape depends on the provider, but the overall structure of sending a JSON request and parsing a JSON response is common across APIs.
For instance, with the Responses API, the response has an output property that contains an array of items. Not every output item is ordinary assistant text: some items may represent tool calls or other response parts. Assistant text is typically stored in a message item whose content array contains an object where the type is "output_text".
Concretely, the response parsing might look like this:
const data = await response.json();
const messageItem = data.output?.find((item) => item.type === "message")
?.content?.find((message) => message.type === "output_text");
console.log(messageItem?.text ?? "No message found in response.");
For OpenAI’s reasoning models, it is also common to add a reasoning block — we use “low” for reasoning effort:
const response = await fetch(apiUrl, {
method: "POST",
headers: {
"Content-Type": "application/json",
"Authorization": `Bearer ${apiKey}`,
},
body: JSON.stringify({
model,
reasoning: {
effort: "low",
},
input: [
{ role: "system", content: "You are a concise assistant." },
{ role: "user", content: "Explain what a command-line application is." },
],
}),
});
const data = await response.json();
const messageItem = data.output?.find((item) => item.type === "message")
?.content?.find((message) => message.type === "output_text");
console.log(messageItem?.text ?? "No message found in response.");
Together, setting the environment variables and running the program with the appropriate permissions, the output of the application should look something like this:
$ set LLM_API_URL=https://api.openai.com/v1/responses
$ set LLM_API_KEY=your-api-key
$ set LLM_MODEL=gpt-5-nano-2025-08-07
$ deno run --allow-net --allow-env app.js
A command-line application (CLI) is a program designed to be run from
a text-based interface—your terminal or console—rather than through
graphical windows and menus. You type a command, plus optional
options/flags and arguments, and the program prints text to the screen
or reads input from the keyboard or redirected streams.
Key points:
- Invocation: Open a terminal and type the command.
- Options/arguments: Flags like -h or --version modify behavior;
positional arguments provide data (e.g., a filename).
- Output: Text sent to standard output (stdout) or errors to
standard error (stderr); can be piped or redirected.
- Interaction: May be interactive (prompting for input) or
non-interactive (scriptable).
- Use cases: Automation, scripting, remote management, batch tasks.
Examples: ls, grep, curl, git, python, ssh.
In short, a CLI app is a text-driven tool designed for fast,
scriptable control from a terminal.
The important point is that the LLM API is still an API. It has a request shape, a response shape, and failure conditions. There may be different APIs for different models, but the high-level structure of sending a request and parsing a response remains the same.
The fact that it produces text does not change the fact that it is a remote service with a contract. The application still needs to assemble the request, send it, parse the response, and decide what to do next based on the result.
In the examples in this course, we use a Responses-style API with explicit request and response shapes. The conversation still looks like a list of role-based messages, but the surrounding contract is easier to inspect directly.
Important API concepts
When working with LLM APIs, a few terms appear repeatedly. Messages define the interaction history or current prompt context. Tokens are the units used internally for input and output length. Context window describes how much input and conversation history the model can consider at once. Latency matters because the user waits for the response. Cost matters because longer prompts and longer responses typically increase API usage.
These are both provider-specific details and information that effectively shape the architecture of the surrounding application.
For example, if latency matters, a CLI tool may need shorter prompts or a smaller model. If cost matters, the application may need to trim message history instead of sending everything every time. If the context window is limited, the application may need summarization or retrieval instead of a long raw prompt. These are software design choices.
Choosing a model
Model selection is an engineering decision. The “best” model depends on the task.
Questions to consider include whether the task is simple enough for a smaller, faster, cheaper model; whether the application needs a long context window; whether structured output reliability matters more than raw creativity; and whether low latency is important for the user experience.
For many CLI tools, a smaller model is enough if the task is narrow and the prompts are well structured. For more open-ended reasoning or more demanding output formats, a stronger model may be worth the cost.
Suppose that you are building a command-line classifier that reads a short text and assigns one label from a small fixed set. That task may work well with a smaller model if the prompt is explicit and the output is validated. By contrast, a multi-step assistant that has to reason over longer histories, follow tool schemas, and produce stable structured output may justify a stronger model. The important point is that model choice should follow application requirements, not prestige or generic benchmark ratings alone.
Another useful comparison is between interactive and batch workflows. A command-line helper that a developer uses repeatedly during one task may need very low latency, because even short delays become irritating when they happen many times in a row. A nightly batch job that summarizes logs or drafts release notes can tolerate a slower model if that model produces better structured output. The surrounding workflow changes what counts as a good trade-off.
It is also worth distinguishing between “largest available model” and “most suitable model”. If an application has strong validation, narrow prompts, and a small output space, an expensive model may not improve the result enough to justify the extra cost.
Good software engineering means choosing the model that fits the task, not the model with the most impressive name or best score in generic benchmarks.
Read the API like any other dependency
The LLM API is only one part of the application. It should be read like any other external system: check what goes in, check what comes back, and avoid assuming more than the contract actually guarantees.
That mindset becomes especially important when we move from raw responses to structured outputs.
In other words, do not treat the API as if it were an intelligent coworker who will automatically “understand what you meant”. Treat it as a dependency with a contract. The request shape, the response shape, and the failure conditions all matter.
The programming exercise for this chapter keeps that contract visible. Its tests stub fetch, so you can concentrate on request construction, headers, and content validation without needing a real third-party API key.