Basalt Studio logo
Basalt Studio.Basalt Studio.
Back

LLM Tool Calling: How It Works and How To Implement It

Eliott Ardisson

Eliott Ardisson

Founder & CEO - Basalt Studio

Updated
tutorials

A practitioner's guide to LLM tool calling: how structured function calls work, how to implement them safely, and what SMBs should know before deploying AI agents.

ai agents
automation
programmatic

TL;DR

  • LLM tool calling lets AI models generate structured requests to external APIs, databases, and services — moving them from text generators to active participants in your business workflows.
  • The implementation process has four layers that all need to work: tool schema definition, execution layer, response handling, and orchestration for multi-step processes.
  • Security is not an afterthought. Giving an AI model programmatic access to your systems requires authentication, input validation, rate limiting, and audit logging from day one.
  • Start with two or three well-defined tools before expanding. Over-engineering the schema early is the most common mistake teams make.
  • Production readiness typically takes two to four weeks, not days. The concept is simple; the hardwork is in error handling, testing, and integration with existing systems.

What Tool Calling Actually Is

Large language models are, at their core, very good at reading and generating text. That’s genuinely useful — but it hits a hard ceiling the moment a user asks something like “What’s the status of my invoice?” or “Can you reschedule that appointment?” The model doesn’t know. It can’t look it up. It can only approximate.

Tool calling solves this. Instead of generating a natural language guess, the model generates a structured request — typically a JSON object — that your application can interpret and execute against a real system. The result comes back, the model incorporates it, and the user gets an accurate answer grounded in live data.

The term “function calling” gets used interchangeably with “tool calling” in most documentation. Historically, “function calling” referred specifically to OpenAI’s original JSON schema format for invoking external functions. “Tool calling” is the broader, more current term that covers the same mechanism plus capabilities like code execution, retrieval from vector stores, and web search. Most modern implementations use “tool calling” as the standard.

Think of it this way: without tool calling, an AI model is a well-read advisor who can only draw on what they already know. With tool calling, that advisor can pick up the phone, pull a file, or check a live system before answering.


How the Mechanism Works, Step by Step

The process follows a consistent sequence regardless of which model or framework you’re using.

1. Context preparation The system assembles the conversation history, the current user input, and the tool definitions. These definitions are JSON schemas that describe each available function: its name, what it does, what parameters it accepts, and which parameters are required. The model must see these definitions to know what’s available.

2. Intent analysis The model reads the user’s input and decides whether to answer from its own knowledge or invoke a tool. “What are your office hours?” probably doesn’t need a tool call. “What’s the outstanding balance on account 4821?” almost certainly does.

3. Function generation When a tool is needed, the model doesn’t generate natural language — it generates a structured tool call. This includes the function name and the parameter values extracted from the user’s input.

4. Tool execution Your application receives the tool call, validates the parameters, runs the actual function (an API request, a database query, a file read), and returns the result to the model.

5. Response synthesis The model reads the tool output and generates a response in plain language that incorporates the result, confirms the action, or explains what happened.

This loop can repeat. If the result of one tool call informs the next step, the model can issue another call before generating its final response. That’s where multi-step orchestration starts.


Two Modes of Tool Invocation

Automatic invocation gives the model discretion over when to use tools. The model decides, based on the user’s request, whether external data is needed. This works well in conversational interfaces where some questions can be answered from context and others require a live lookup. A legal intake bot might answer general questions about the firm’s practice areas without any tool calls, but invoke a calendar API the moment someone asks to book a consultation.

Forced invocation requires the model to use a specific tool on every request, regardless of what it thinks. This is the right approach for structured extraction workflows — document processing, form completion, lead qualification — where you always want output in a defined format and can’t afford the model deciding to wing it.

Most production systems use a combination: automatic invocation for conversational flows, forced invocation for specific pipeline steps.


Step 1: Define Your Tool Schema

Before writing a line of code, map out what your agent actually needs to do. This is less a technical exercise and more a business process audit.

Questions worth asking:

  • What repetitive tasks consume the most time per week for your team?
  • Which of those tasks require looking something up in a live system?
  • Which tasks involve writing to a system — creating a record, sending a message, updating a status?
  • Where do errors happen most in current manual workflows?

Once you have answers, translate them into function definitions. A recruitment firm might need tools for checking candidate pipeline status, logging interview notes, and sending calendar invitations. An accounting practice might need tools for pulling client ledger data, generating draft invoices, and flagging overdue accounts.

Keep the initial schema small. Two or three well-scoped tools, clearly described, will outperform ten vaguely defined ones. Models make better decisions when their options are limited and clearly differentiated.

Descriptions matter more than most teams expect. The model uses your description text to decide when to call a function. “Retrieve customer information” is worse than “Retrieve full customer profile including contact details, account status, and last three transactions. Call this when the user asks about a specific customer by name or account number.” That level of specificity reduces misuse.


Step 2: Build the Execution Layer

The execution layer is the bridge between a tool call the model generates and the actual system interaction that happens as a result. This is where most of the engineering work lives.

Parameter validation should run before any external call is made. Check that types match, that required fields are present, that values fall within acceptable ranges. A model might generate a valid-looking tool call with a malformed ID or an out-of-range date. Catch that before it hits your database.

Error handling needs to cover the full failure surface: APIs that are temporarily down, queries that return nothing, systems that return malformed responses. Your execution layer should return structured error information that the model can interpret and communicate to the user. “I wasn’t able to retrieve that record right now — the CRM returned an error. You can check directly at [link] or try again in a moment” is significantly more useful than a silent failure.

Security controls are non-negotiable. Tool calling gives an AI model programmatic access to your systems. Without proper controls, that’s a meaningful attack surface. Implement authentication for every external call. Use role-based access control so users can only trigger tools appropriate to their permissions. Sanitize all inputs before passing them to databases or APIs. Log every tool execution with timestamp, user context, and outcome. For businesses handling sensitive data — legal, financial, HR — audit logging isn’t optional.

Rate limiting prevents both accidental loops and deliberate abuse. Set limits per user session and monitor for unusual patterns.

In our work helping founder-led businesses deploy AI agents, the execution layer is where implementations most often break down in production. The model logic is usually fine. The integration code — error handling, edge cases, auth flows — is where teams underestimate the effort.


Step 3: Handle Responses Properly

Once a tool executes, the model needs the result in a form it can reason about. How you structure tool responses affects the quality of what comes back to the user.

A consistent response envelope helps:

  • A status field (success, error, partial) tells the model how to frame its response.
  • A data field contains the actual payload.
  • A human-readable message field gives the model something to reference when translating technical results into plain language.

When multiple tool calls happen in sequence, the model has to synthesize potentially complex outputs into a coherent response. This is where prompt design and response formatting matter. Test what the model actually says when tools return edge cases — an empty result set, a partial match, a permission error. These scenarios will happen in production and users will notice if the model handles them poorly.


Step 4: Orchestrate Multi-Step Workflows

Single tool calls are useful. Multi-step workflows are where tool calling earns its keep in business processes.

Sequential chaining runs tools in order, where each step’s output informs the next. A customer service flow might: verify the customer’s identity, retrieve their account status, check any open cases, then generate a response that addresses their specific situation.

Conditional branching lets the model take different paths based on intermediate results. A recruitment agent that checks a candidate’s pipeline stage might route to a different set of actions depending on whether the candidate is at screening, interview, or offer stage.

Parallel execution runs independent tool calls simultaneously, cutting response time. Fetching a client’s contact record and their billing history at the same time, when neither depends on the other, halves the wait.

As workflows get more complex, you need infrastructure to match: proper state management for multi-turn processes, retry logic for transient failures, and observability tooling so you can trace what happened when something goes wrong. Tools like n8n are well-suited for building these orchestration layers without requiring a full custom backend, particularly for SMB contexts where development resources are limited.


Common Mistakes Worth Avoiding

Starting with too many tools. Teams sometimes define a dozen functions before testing any of them. The model gets confused, selects the wrong tool, or calls tools unnecessarily. Start with two or three and add more based on actual gaps you observe.

Skipping error handling until production. External APIs fail. Databases timeout. Data comes back in unexpected shapes. If your execution layer doesn’t handle these gracefully, the model will either produce nonsense or fall silent. Build error handling into your first version, not your third.

Underspecifying tool descriptions. A vague description leads to incorrect tool selection. Write descriptions the way you’d write documentation for a junior developer — be explicit about when to use this function, what the parameters mean, and what the output represents.

No observability. When a tool calling workflow fails, you need to know whether the issue was in the model’s decision-making, the execution layer, or the response handling. Logging each step separately is the only way to diagnose this quickly.

Ignoring permission boundaries. Not every user should be able to trigger every tool. A customer-facing chatbot probably shouldn’t be able to delete records or access other customers’ data. Define permission tiers early and enforce them in the execution layer, not just in the prompt.


Security and Compliance in Practice

Tool calling introduces a specific class of risk that standard web application security doesn’t fully address: an AI model making decisions about what actions to take on behalf of users.

The key controls:

  • Authentication for every downstream API call. Don’t pass raw user input as authentication credentials.
  • Input sanitization to prevent injection attacks through tool parameters — SQL injection is still possible if you’re constructing queries from model-generated strings.
  • Approval workflows for high-stakes or irreversible actions. Sending an email, creating a contract, or deleting a record may warrant a confirmation step before execution.
  • Audit logs that capture who triggered what, when, with what parameters, and what the result was. This matters for internal accountability and, in regulated industries, for compliance.
  • Circuit breakers that prevent runaway tool calls if something in the orchestration logic produces a loop.

For SMBs in legal, financial services, or HR — sectors where Basalt’s clients tend to operate — these controls are the baseline, not advanced features.


What to Expect from Implementation

A realistic timeline for a production tool calling system:

  • Week 1: Tool schema definition, basic execution layer, connection to one or two external systems.
  • Week 2: Error handling, security controls, initial testing with real scenarios.
  • Weeks 3–4: Edge case testing, monitoring setup, user acceptance testing, staged rollout.

What changes once it’s running: routine data lookups move out of your team’s queue. Customer-facing response times improve for anything that required a manual check. Data entry errors decrease when agents are writing to systems based on structured inputs rather than human transcription.

The efficiency gains are real, but they’re incremental and workflow-specific. Teams that define clear success metrics before implementation — ticket deflection rate, time-to-response on a specific query type, error rate in a data entry process — get better results than teams that deploy broadly and hope for improvement.


Frequently Asked Questions

What’s the difference between function calling and tool calling? Function calling was OpenAI’s original term for their JSON schema format for invoking external functions. Tool calling is the broader term now used across providers — Anthropic, Google, and others — and encompasses additional capabilities like code execution and retrieval. They refer to the same core mechanism.

How reliable is tool calling in production? Well-implemented systems perform reliably for clearly defined functions. The biggest source of failures is typically inadequate error handling in the execution layer, not model errors. Thorough testing against edge cases and proper observability are what separate production-ready implementations from prototypes.

When does tool calling make more sense than traditional automation? Traditional workflow automation — rule-based triggers and actions — works well when the logic is simple and predictable. Tool calling adds value when you need natural language interfaces, dynamic decision-making about which action to take, or reasoning across multiple steps with variable inputs. The two approaches complement each other in most mature implementations.

What security risks should I be aware of? The main risks are unauthorized access to business systems, injection attacks through tool parameters, and unintended actions triggered by model errors or adversarial inputs. Address these through proper authentication, input validation, rate limiting, permission controls, and audit logging. Never give a tool calling system access to irreversible actions without a human approval step.


Where to Go From Here

LLM tool calling is the technical foundation of any serious AI agent deployment. The concept isn’t complicated, but getting it right in production — with proper error handling, security controls, and observability — requires deliberate engineering.

If you’re evaluating where tool calling fits in your business processes, the right starting point is a workflow audit: identify the two or three highest-value repetitive tasks that involve a live data lookup or a system action, and scope the tool schema from there. That’s a more tractable problem than trying to automate everything at once.

If you’d like to talk through what this looks like for your specific context, book an AI strategy call with the Basalt team — we work with founder-led businesses across professional services, real estate, and trades to scope and implement agent systems that hold up in production.