This article provides an in-depth overview of AI agent and application monitoring, with step-by-step guidance on implementing effective oversight using ZBrain Monitor’s feature.
zbrain.ai-Scope Key Metrics Configuration and Best Practices.pdf
1.
1/34
ZBrain August 1,2025
Scope, Key Metrics, Configuration, and Best Practices
zbrain.ai/how-to-use-zbrain-monitor
← All Insights
AI solutions are transforming enterprise operations, driving automation, elevating
customer experiences, and enabling smarter decision-making across industries. As these
agentic AI systems become more autonomous and interconnected, real-time monitoring
is no longer optional; it’s essential for maintaining quality, reliability, and business value.
Recent research highlights the urgency: a May 2024 study, “Visibility into AI Agents,”
warns that a lack of transparency into agent activities and decision-making poses serious
operational and governance risks. According to PwC’s 2024 AI Business Survey, only
11% of executives report having fully implemented fundamental responsible AI
capabilities. Monitoring and auditing are highlighted as one of the most crucial categories
for Responsible AI, yet more than 80% of organizations indicate they are still progressing
toward full implementation.
Security surveys indicate that the risk is escalating. SailPoint’s research found that while
98% of companies plan to expand the use of AI agents, 96% view them as a growing
security threat, and 54% specifically report risks related to the information accessed by AI
agents. These gaps in observability and control are red flags signaling the need for real-
time, agent-level monitoring.
ZBrain’s Monitor feature meets this need, enabling continuous, automated oversight of AI
agents and applications. By combining advanced evaluation metrics with actionable
insights, it empowers organizations to detect issues early and optimize AI performance at
2.
2/34
scale. This articleprovides an in-depth overview of AI agent and application monitoring,
with step-by-step guidance on implementing effective oversight using ZBrain Monitor’s
feature.
AI agent and application monitoring: An overview
Building AI agents is an exciting challenge, but simply deploying them is not enough to
ensure consistent and reliable results in real-world settings. Once in production, AI agents
and applications become part of dynamic, often unpredictable business environments. To
maintain performance, prevent failures, and continuously improve outcomes,
organizations need robust monitoring, evaluation and observability practices.
AI agent and application monitoring is the ongoing process of observing and evaluating
the real-world behavior, performance, and impact of autonomous systems within an
enterprise. It involves the systematic tracking and analysis of agent’s inputs, outputs,
resource usage (such as tokens and credits), response quality, and operational health. It
monitors each agent’s outcomes, success and failure trends, and cost metrics, providing
end-to-end visibility into the behavior and performance of autonomous systems in
production. This goes well beyond checking if the service is “up or down.” It means
capturing and analyzing signals such as logs and metrics that reflect how agents interact
with data.
A new frontier in this field is Large Language Model (LLM) observability. Unlike traditional
software, LLM-powered agents are probabilistic, context-dependent, and sometimes
unpredictable. LLM observability empowers teams to:
Track which prompts, contexts, and actions led to specific responses
Debug and trace unexpected behaviors
Bootstrap robust test sets from real user data
Compare key metrics, like accuracy, latency, and cost, across different model
versions in production.
Comprehensive monitoring provides real-time visibility into the operational status and
performance of AI agents and applications. Effective monitoring enables organizations to:
Confirm AI solution health: Verify that agents and applications are live,
responsive, and able to produce valid outputs through automated health checks.
Evaluate response quality: Continuously assess outputs using metrics such as
response relevancy, faithfulness to context, exact match to reference answers, and
other configurable evaluation criteria.
Track success and failure trends: Monitor patterns of successful and failed
responses to quickly identify recurring issues and breakdowns in agent and app
workflows.
Drive operational insights: Analyze key operational metrics, including latency,
token usage, and resource costs, to inform ongoing optimization.
3.
3/34
Support reliability: Detectabnormal behaviors, errors, or unintended entity actions
that could indicate failures, etc.
Enable troubleshooting: Maintain detailed logs and query-level monitoring to
support rapid root-cause analysis and continuous improvement cycles.
Simply put, AI agents and application monitoring transform opaque, self-learning systems
into accountable business assets, empowering enterprises to maximize value while
minimizing risk.
Streamline your operational workflows with designed to address enterprise
challenges.
Explore Our AI Agents
Key challenges addressed by AI monitoring
As AI agents and applications take on increasingly critical roles in enterprise operations,
robust, real-time monitoring has become essential for addressing a range of operational
and strategic challenges.
Detecting silent failures
Monitoring enables organizations to catch responses that appear valid but are actually
incorrect, irrelevant, or low quality. With query-level visibility and configurable evaluation
metrics, silent failures are identified early, preserving business accuracy and user trust. AI
apps and agents may seem to perform as expected, yet silently deliver low-quality or
inconsistent results.
Granular quality assessment
Monitoring allows every response to be automatically assessed against custom criteria for
relevance, faithfulness, and accuracy. This ensures that AI agents and applications
consistently meet enterprise benchmarks for quality.
Framework diversity and integration overhead
AI solutions are built using a wide range of frameworks, each with distinct abstractions,
control patterns, and internal states. Achieving unified visibility often requires manual
instrumentation (adding trace metadata, wrapping agent logic, custom logging), which is
tedious, error-prone, and can still miss key behaviors like retries or fallbacks.
Fragmented view of agent workflows
In complex environments, agents interact with multiple tools, APIs, or models. Without
centralized monitoring, understanding the complete flow of data and decisions across
these components is challenging, making root-cause analysis a time-consuming process.
Visualization gaps in traditional tools
4.
4/34
Most observability dashboardsare designed for linear, synchronous applications, not for
the nonlinear, parallel, and branching logic typical of agentic systems. This makes it
difficult to reconstruct execution paths, trace agent decision-making, or identify where and
why failures occurred.
Enabling timely issue detection
With real-time alerts and thresholds, monitoring quickly surfaces performance
bottlenecks, error spikes, or sudden drops in accuracy. This supports rapid diagnosis and
minimizes the impact on users and operations.
Vendor lock-in
Relying on monitoring tools that store logs and metrics in proprietary formats can trap
organizations with a single vendor. If teams later need to switch providers or integrate
with other platforms, migrating historical monitoring data becomes costly and technically
challenging, directly limiting flexibility and slowing processes.
Integration complexity and technical debt
Monitoring AI agents and apps built on diverse frameworks often demands custom
connectors and manual scripts for each environment. As systems evolve, maintaining
these integrations increases support costs, slows development, and compounds technical
debt, raising the risk of operational failures.
By addressing these challenges, comprehensive monitoring enables AI agents and
applications to transform from opaque, unpredictable systems into transparent,
manageable, and reliable enterprise assets.
Metrics used for monitoring AI agents and applications
5.
5/34
Robust monitoring ofAI agents and applications relies on a diverse set of metrics to
evaluate response quality, operational health, and alignment with business objectives.
The ZBrain Monitor module supports a rich selection of evaluation metrics that fall into
three primary categories:
LLM-based metrics
Non-LLM-based metrics
LLM-as-a-judge metrics
LLM-based metrics
LLM-based metrics evaluate the quality and relevance of agent and app responses using
large language models. The LLM-based metrics are specific to the Ragas library, which
uses internal prompt templates and language models to perform automated evaluation.
These are especially valuable when output quality cannot be reliably measured by simple
string matching or basic logic. These metrics capture the nuanced, semantic qualities of
language and are particularly valuable for open-ended, context-rich outputs. Common
LLM-based metrics include:
Response relevancy: This metric assesses how well the AI solution’s response
addresses the user’s query. A higher score indicates a more relevant and
contextually appropriate answer, supporting better user experiences and business
outcomes.
Faithfulness: Evaluates whether the generated response accurately reflects the
provided context, data, or source information. High faithfulness scores indicate that
the output minimizes hallucinations or unsupported statements, which is crucial for
ensuring trustworthy AI in enterprise settings.
Example:
For a customer support agent, LLM-based metrics help ensure that responses not only
sound reasonable but are factually accurate and tailored to the actual customer question.
LLM-based metrics help overcome the limitations of simple text matching by leveraging
semantic understanding, making them more aligned with human judgment.
Non-LLM-based metrics
Non-LLM-based metrics rely on deterministic, algorithmic comparisons, making them
efficient and useful for specific scenarios where the expected output is known or tightly
controlled. These are highly efficient for structured outputs and scenarios where
deterministic comparisons are appropriate. Common non-LLM-based metrics include:
Health check: Verifies if the agent or application is operational and capable of
producing a valid response. If the health check fails, no further metric evaluations
are performed for that execution, helping teams quickly pinpoint systemic issues
and ensuring focus on root-cause diagnosis.
6.
6/34
Exact match: Comparesthe agent’s and app’s response to an expected result for
an exact, character-by-character match. This metric is particularly valuable for tasks
that require deterministic outputs, such as code generation, database lookups, or
standardized answers.
F1 score: Balances precision and recall, measuring how well the response covers
all relevant information (precision) and avoids missing any expected content (recall).
Widely used for classification and extraction tasks.
Levenshtein similarity: Calculates the minimal number of single-character edits
needed to change one string into another. This provides a measure of similarity
between the generated and reference responses, which is useful for tracking the
closeness of the output to the desired answer.
ROUGE-L score: Evaluates the similarity between generated and reference
responses by identifying the longest common subsequence of words. ROUGE-L is
commonly used in natural language generation tasks, such as summarization.
Example:
For document automation or data extraction agents, these metrics ensure the outputs
match ground truth records with high accuracy and reliability. These metrics are fast,
objective, and especially valuable for rule-based validation and deterministic use cases.
LLM-as-a-judge metrics
LLM-as-a-judge metrics simulate human-like evaluation by using an LLM to judge
qualitative aspects of a response. By “judging” aspects of a response in natural language,
these metrics help organizations quantify subjective qualities that are important for user
engagement and satisfaction. This approach provides a scalable, consistent, and cost-
effective alternative to human review, enabling continuous quality monitoring at
production scale:
Creativity: Rates the originality and imagination in the response while addressing
the prompt, which is valuable for generative content or brainstorming agents.
Helpfulness: Measures how well the response guides or supports the user in
resolving their question or problem, ensuring the AI solution provides actionable and
valuable information.
Clarity: Assesses how clearly the message is communicated, reflecting whether the
response is easily understandable and effectively delivers its intended meaning.
Example:
For executive summary generators or creative writing assistants, these metrics help
teams ensure that AI outputs are not only accurate but also engaging and easy to
understand.
Why use multiple metrics?
7.
7/34
Combine multiple metricsto define composite and reliable evaluation rules. By leveraging
a mix of LLM-based, non-LLM-based, and LLM-as-a-judge metrics, organizations gain a
comprehensive and nuanced view of operational health, output quality, and user
experience. This enables continuous optimization, facilitates faster troubleshooting, and
enhances confidence in AI-powered workflows.
LLM-based metrics bring semantic intelligence, non-LLM-based metrics ensure objective
comparability, and LLM-as-a-judge delivers scalable, human-like evaluation. By
combining these, the Monitor module provides end-to-end visibility, allowing enterprises
to:
Detect subtle, context-specific errors invisible to simple algorithms
Quantify output quality across diverse use cases and formats
Benchmark model performance over time and across versions
Automate routine evaluation while reserving human attention for high-impact issues
With configurable evaluation conditions and flexible metric selection, modern monitoring
practices empower enterprises to maintain the highest standards of accuracy, reliability,
and user satisfaction across all ZBrain AI agents and applications.
How ZBrain Monitor enables real-time oversight of AI agents and
applications
For enterprises relying on AI solutions, maintaining high standards of reliability, quality,
and compliance is non-negotiable. The ZBrain Monitor module is purpose-built for this
challenge, empowering teams to automate granular evaluation and performance tracking
across all AI agents and applications.
The ZBrain Monitor module delivers end-to-end visibility and control of AI agents and
applications, as well as prompts specific to apps, by automating both evaluation and
performance tracking. With real-time tracking, ZBrain Monitor ensures response quality
by continuously monitoring for emerging issues based on configured evaluation criteria.
This enables users to be alerted in real-time, helping maintain optimal performance
across all deployed solutions.
Monitor captures inputs and outputs from your AI solutions and continuously evaluates
responses against defined metrics at scheduled intervals. This process delivers real-time
insights into operational performance, tracking both success and failure rates while
8.
8/34
highlighting trends thatrequire attention. All results are presented in an intuitive interface,
enabling rapid identification and resolution of issues to ensure consistent, high-quality AI
interactions across your enterprise.
Automated evaluation: Leverage flexible evaluation frameworks with LLM-based,
non-LLM-based, and LLM-as-a-judge metrics. This enables scenario-specific
monitoring tailored to your specific use cases.
Performance tracking: Identify trends in agent and application performance
through dynamic visual logs and comprehensive reports.
Query-level monitoring: Configure evaluations at the individual query level within
each session. This granular approach provides precise oversight of agent and app
behaviors and outputs.
Agent and app support: Achieve end-to-end visibility by monitoring both AI agents
and AI applications across your enterprise landscape.
Input flexibility: Evaluate responses across a wide range of file types, including
PDF, text files, images, and other supported formats, ensuring broad applicability
across diverse workflows.
Notification alerts: Enable real-time notifications to receive event status updates
via multiple channels or email.
9.
9/34
The Monitor moduleincludes these main sections, accessible from the left navigation
panel:
Events: View and manage all configured monitoring events in a centralized list.
Monitor logs: Review detailed execution results and evaluation metrics, making it
simple to trace, audit, and troubleshoot.
Event settings: Configure evaluation metrics, thresholds, and notification
parameters for each event, enabling tailored monitoring strategies.
User management: ZBrain Monitor supports role-based user permissions.
Administrators can assign access and management rights to specific users,
including Builders and Operators, ensuring secure and controlled oversight of
monitoring activities.
By automating the monitoring process and surfacing actionable insights in real time,
ZBrain Monitor enables teams to maintain continuous, high-quality AI operations, achieve
faster issue resolution, and drive ongoing performance improvements.
A practical guide to configure ZBrain Monitor for apps and agents
This section provides a practical walkthrough for configuring ZBrain Monitor across both
apps and agents. You will also find insights on setting thresholds and selecting the right
metrics to align monitoring with your operational goals. Let’s explore how to achieve
precision monitoring in practice.
How to configure a monitoring event for apps using ZBrain Monitor
Step 1: Access the monitoring configuration
To set up monitoring for an application:
Access the app session:
Navigate to the Apps page.
Click on your desired application.
Go to the query history section of the app.
Select a specific user session from the list; details such as Session ID, user
information, session time and prompt count are displayed for each session.
10.
10/34
Review the conversation:
Viewsession details and chat history for the selected user session.
Configure monitoring events at the individual query level; each query within a
conversation can have its dedicated monitoring event.
Access conversation logs:
Click ‘Conversation Log’ to see the interaction details of a specific query
Review status, time, and token usage
Check the input, output, and metadata
12/34
Step 2: Configureevent settings
You will be redirected to the Events > Monitor page. In the last status column, click
‘Configure’ to open the event settings page. On the event settings screen:
1. Review entity information
Entity name: Confirm the name of the application being monitored. For
example, it is the HR Policy Query App
Entity type: The type of entity being monitored (e.g., App)
2. Verify monitored content
Monitored input: Review the query or prompt that will be evaluated.
Monitored output: Confirm the corresponding response to be assessed.
13.
13/34
3. Set evaluationfrequency
Click the dropdown menu under “Frequency of evaluation”
Select the desired interval (Hourly, Every 30 minutes, Every 6 hours, Daily,
Weekly, or Monthly) for monitor event execution.
4. Configure evaluation conditions
Click ‘Add metric’ in the Evaluation Conditions section
Select a metric type. The following metrics are currently available for
configuration, and additional options are being continuously added to enhance
monitoring capabilities.
Metric
category
Metric
name Purpose Example
14.
14/34
LLM-
based
metrics
Response
relevancy
Checks how wellthe response
answers the user’s question. Higher
scores indicate better alignment with
the query.
Use for FAQ bots or
chat assistants to
validate on-topic
answers.
Faithfulness Measures whether the response
accurately reflects the given context,
minimizing hallucinations or incorrect
information.
Essential for RAG or
context-driven LLM
apps to prevent
factual errors.
Non-
LLM
metrics
Health
check
Determines if the app/agent is
operational and capable of
producing a valid response; further
checks halt on failure.
Run at the start of
every app/agent
execution for
operational
monitoring.
Exact
match
Compares theapp’s or
agent’sresponse to the expected
output for an exact character-by-
character match.
Use for deterministic
output tasks (e.g.,
structured data
extraction).
F1 score Balances precision and recall to
assess how well the response
captures the expected content.
Useful in QA,
classification, or
multi-label tasks with
expected answers.
Levenshtein
similarity
Calculates how closely two strings
match, based on the number of edits
needed to transform one into the
other.
Detects typos or
near-matches in text
or code outputs.
ROUGE-L
score
Evaluates similarity by identifying the
longest common sequence of words
between the generated and
reference text.
Effective for
summarization and
paraphrase
evaluation.
LLM-as-
a-judge
metrics
Creativity Rates how original and imaginative
the response is in addressing the
prompt.
Use for
brainstorming,
content generation,
or marketing copy.
Helpfulness Assesses how well the response
guides or supports the user in
resolving their query.
Evaluate customer
support or advisory
agent responses.
15.
15/34
Clarity Measures howeasy the response is
to understand and how clearly it
communicates the intended
message.
Review for user-
facing documentation
or explanations.
Select evaluation method: Choose how the metric should be applied (e.g., is less
than, is greater than, or equals to).
Set threshold value: Enter the appropriate threshold (between 0.1 and 5.0) for
your metric. The threshold represents the cutoff point at which the monitoring event
will trigger an alert or action. Example: For a response relevance metric, you might
set a threshold of 1.0 with “is less than”—meaning the event will be flagged if
relevance drops below 1.0. Additionally, if the metric is set to be less than 0.5, the
evaluation of the event is marked as fail.
Add metric: Click ‘Add’ to include the metric in your monitoring configuration.
Set the “Mark evaluation as” dropdown to ‘Fail’ or ‘Success’
16.
16/34
5. Configure notifications
Togglethe ‘Send Notification’ option to enable alerting for this monitoring event.
Click ‘+ Add a Flow from the Flow Library’. This opens the Add a Flow panel.
In the panel, search for the desired notification flow and select it.
17.
17/34
Click the Play▶️button to run a delivery test.
Confirmation: If the flow succeeds, a “Flow Succeeded” message appears.
Error handling: If the flow fails, you will see inline error messages and a link to
“Edit Flow” for troubleshooting.
6. Test your configuration
Test the setup: Click the ‘Test’ button, enter a test message if required, and review
the test results.
Reset option: Click ‘Reset’ to try again or adjust your configuration as needed.
18.
18/34
7. Save yourconfiguration
Click ‘Update’ to save and activate your monitoring event
These structured steps enable you to establish robust monitoring for your AI applications,
driving operational excellence and reliability.
How to configure a monitoring event for agents using ZBrain Monitor
This section provides a step-by-step walkthrough for configuring a monitoring event for an
AI agent. For example, we will demonstrate how to set up comprehensive monitoring for
the Article Headline Optimizer Agent, illustrating each stage of the process for clarity and
actionable insights.
Step 1: Access the monitoring setup
To set up a monitoring event for an agent:
Access the agent dashboard:
Go to the Agents page and select the deployed agent (e.g., Article Headline
Optimizer Agent).
Enable monitoring:
To activate monitoring for an AI agent, follow these steps:
Open the full-screen view of ‘Agent Activity’ for your chosen agent execution.
Click the ‘Monitor’ button associated with the relevant execution.
When prompted, select ‘Configure Now’ to proceed with setting up the monitoring
parameters.
19.
19/34
Step 2: Configureevent settings
When you click on configure ‘Now’, you will be redirected to the ‘‘Monitor’’ page. Click
‘Configure’ in the ‘Last Status’ column to open the Event Settings page.
On the Event Settings screen, review the following settings:
Entity Name and type: Agent name – Headline Optimizer Agent and type – Agent
Monitored input (e.g., a prompt or document): This shows the input text/file details.
Monitored output: This shows the response generated by the agent for the specified
input.
20.
20/34
After reviewing thesedetails, proceed to configure the key monitoring parameters for the
event:
Frequency of evaluation: Select how often you want the agent to be evaluated—
choose from intervals such as hourly, daily, weekly, or every 30 minutes, etc., to
match your operational needs.
Metrics: Under Evaluation Conditions, use the “Add Metric” option to select from
LLM-based, Non-LLM-based, or LLM-as-judge metrics. For example, response
relevance and health check metrics are used in this execution. Combine multiple
metrics using AND/OR concatenators as needed.
21.
21/34
Once you selectthe evaluation conditions, set the threshold as per business needs.
The optimal threshold values will differ by industry, application, and business
priority. These practices are essential to set appropriate thresholds:
Industry standards and regulatory requirements:
Highly regulated sectors such as healthcare or finance typically demand
stricter thresholds for accuracy, faithfulness, and reliability to protect users and
comply with laws. For example, a financial AI solution handling loan approvals
or fraud detection may require a near-perfect accuracy threshold, whereas a
customer support chatbot can allow for more flexibility.
Business objectives and risk tolerance:
Consider what matters most for your use case: accuracy, recall, speed, or
user satisfaction. If missing a positive case is unacceptable (e.g., financial
fraud detection), set high recall thresholds even at the cost of increased false
positives. For applications where rapid response or cost savings are critical
(e.g., retail chatbots), optimize thresholds accordingly.
Operational impact:
Evaluate how threshold sensitivity affects operations. Excessively strict
thresholds may result in frequent false alarms and alert fatigue, while loose
thresholds may let serious issues slip through undetected.
Iterative adjustment:
Start with benchmark values based on industry norms or pilot studies (e.g.,
0.7–0.9 for accuracy in critical domains) and refine over time using feedback
from real-world monitoring data. Engage both technical and business
stakeholders in this review process.
By taking these considerations into account, you can define metric thresholds that both
align with your organization’s goals and ensure your AI systems operate safely, efficiently,
and reliably in production.
After setting thresholds, set the Mark Evaluation as either ‘Success’ or ‘Fail’
Toggle the ‘Send Notification’ option to enable alerting for this monitoring event.
22.
22/34
Add a notificationFlow: Click ‘+ Add a Flow from the Flow Library’. This opens the
Add a Flow panel. Search for and select the appropriate notification flow that suits
your alerting needs, such as Gmail alerts, Slack notifications, MS Teams or other
available options, if any.
Test your monitor event: Use the ‘Test’ button to validate your setup. Enter a test
message if required and review the results to ensure correct behavior. If
adjustments are needed, click ‘Reset’ to modify and retest as necessary.
Save your configuration: Click ‘Update’ to save and activate your monitoring
event.
Once activated, the monitoring event runs automatically at the specified evaluation
frequency, such as every 30 minutes, daily, or weekly.
23.
23/34
After configuring metricsand other settings, you can test the monitor and check results as
shown in the image below. By resetting the test message, you can test evaluation settings
for multiple scenarios.
Monitor logs
Once monitoring is enabled, ZBrain captures a comprehensive activity record for every
configured event, making it easy to analyze and troubleshoot agent behavior over time.
Monitor Log for an event comprises the following components:
24.
24/34
Event information header
EventID: The unique identifier assigned to each monitoring event (e.g., Event ID:
f9dab8).
Entity name and type: Shows which agent is being monitored—for instance, the
Article Headline Optimizer Agent with type as Agent.
Frequency: Indicates how often the monitoring is performed (e.g., hourly, daily).
Metric: Performance criteria being measured across selected metrics.
Log status visualization
Below, event information, status visualization uses color-coded bars to signal outcomes
instantly:
Colored bars provide a quick visual indicator of recent execution results
Green for successful evaluations
Red for failures
This visual summary helps teams quickly spot trends or anomalies.
Filtering options
Status dropdown: Filter by All/Success/Failed/Error status
Log time dropdown: Filter by active/inactive
Log details table
A detailed breakdown of each event with essential context:
Log ID: Unique identifier for each log entry. E.g., f9dfe8, f9dfe1, etc.
Log time/date: Time when the evaluation occurred
LLM response: The output that LLM provides
Credits: Credits used for each execution.
Cost: Corresponding expense, deducted based on credits used.
Metrics: Success/Fail results for each metric under review.
Status: Final outcome, clearly color-coded for immediate clarity
(Success/Failed/Error).
25.
25/34
ZBrain Monitor isdesigned for enterprise environments and supports robust role-based
user permissions. Administrators can assign specific access rights to users, ensuring that
only authorized team members can view or configure monitoring events.
Select Monitoring Event: From the Monitor page, select the desired monitoring
event.
Open User Management option: Navigate to the ‘User Management’ tab.
Review Entity Details: A user management panel opens with the following
elements:
Entity name: Name of the agent.
Entity type: AGENT
Builder: A Builder is someone who can add, update, or use ZBrain knowledge
bases, apps, flows, and agents. Select the builder you want to invite. The
options include custom or everyone.
When selecting the ‘Custom’ option, you can use the search field to
search for builders to invite. Enter the user’s name and click the ‘Invite’
button to assign them the Builder role.
User access: Once accepted, the user will see the assigned event in the main
interface under the Monitoring tab upon login and will be able to manage it.
26.
26/34
This role-based approachensures security, accountability, and flexible collaboration for
large teams operating critical AI solutions deployments.
By integrating ZBrain Monitor, enterprises ensure their AI solutions consistently meet
defined standards for accuracy, performance, and business value.
Streamline your operational workflows with designed to address enterprise
challenges.
Explore Our AI Agents
Best practices for monitoring AI agents and applications
Robust monitoring is essential for ensuring the accuracy, quality, and compliance of both
AI agents and applications at scale. By leveraging ZBrain Monitor’s full capabilities,
organizations can implement the following best practices to maximize oversight and drive
continuous improvement:
Align monitoring to business outcomes
Define success criteria: Specify what constitutes “success” for each AI app and
agent—whether it’s task accuracy, user satisfaction, compliance, or operational
efficiency. When defining success criteria, use SMART(Specific, Measurable,
Achievable, Relevant, Time-bound) goals to ensure each metric and KPI is
actionable, relevant, and time-bound. For example, a customer support agent’s
response time goal might be to reduce average response latency by 20% within a
quarter.
Map metrics to business impact: Select evaluation metrics (LLM-based, non-
LLM-based, and LLM-as-a-judge) that reflect both technical and business priorities
for each monitored entity.
Utilize multi-layered metrics and establish baselines
Comprehensive metric coverage: Monitor a range of metrics (quality, latency, cost
etc.) for all apps and agents, providing a holistic view of entity behavior. Move
beyond single-metric (LLM-based or non-LLM based) evaluation by employing
composite or multi-objective metrics, ensuring agents and apps are assessed on
quality, speed, and cost.
Baseline comparisons: Regularly compare performance data against historical
benchmarks or defined targets to identify improvements, degradations, or
anomalies.
Set and tune effective thresholds
Context-aware thresholds: Start with conservative thresholds and refine them
based on real-world usage and the risk profile of each application or agent.
27.
27/34
Criticality-based segmentation: Setmore stringent thresholds (e.g., higher
accuracy) for business-critical or customer-facing apps and agents. For internal or
less critical use cases, thresholds can be adjusted as needed to strike a balance
between resource usage and business risk.
Continuously refine metrics and thresholds
Iterative improvement: Revisit your metrics and thresholds regularly, especially
after major product updates, scaling events, or as your understanding of business
impact matures.
Stakeholder feedback: Incorporate input from both technical and business teams
to ensure metrics remain aligned with operational and strategic goals.
Blend automated and human-in-the-loop evaluation
Human oversight for subjective criteria: Supplement automated metrics with
regular human review, particularly for subjective factors such as clarity, tone, or
regulatory risk.
User feedback integration: Collect and utilize user feedback on both apps and
agents to continuously improve outcomes.
Drive iterative improvement through regular review
Periodic data reviews: Establish a schedule for reviewing monitoring results and
updating evaluation strategies.
Evolve metrics and policies: Add or revise metrics as business needs and AI
system roles evolve; adjust thresholds based on observed trends.
Maintain comprehensive monitoring documentation
Policy documentation: Maintain detailed records that outline what is monitored,
which metrics are tracked, how those metrics are calculated, and the rationale
behind the selected thresholds.
Response playbooks: Document escalation paths, notification recipients, and
clear, actionable steps for responding to different types of incidents or alerts.
Regular reviews: Periodically update documentation to reflect system changes,
new regulatory requirements, or evolving business needs.
Track credits and cost transparency
Monitor credit usage: Ensure that every execution for a monitoring event, whether
by an AI agent or application, logs the exact number of credits consumed. Tracking
credit consumption at the session and query level helps organizations understand
usage patterns and optimize allocation.
Enable cost visibility: Provide real-time visibility into the cost associated with each
monitored event. Make it easy for stakeholders to view cost breakdowns, both per
query and in aggregate, directly from the monitoring interface.
28.
28/34
Set budget andusage alerts: Implement thresholds or automated alerts for credit
usage and overall spending. E.g., ZBrain Monitor can support automated alerts for
cost-specific metrics. Notify teams proactively if usage approaches or exceeds
predefined limits to prevent cost overruns.
Foster a monitoring-first culture
Training: Educate teams on the importance of monitoring, how to interpret metrics,
and the process for responding to alerts.
Transparency: Make monitoring dashboards, alert histories, and performance
reports available to all relevant stakeholders.
Accountability: Assign clear ownership for each monitored entity and for
maintaining the health of the monitoring infrastructure itself.
Key benefits of monitoring AI agents and applications with ZBrain
Monitor
Effective monitoring isn’t just about compliance or operational health; it unlocks business
value across multiple dimensions. Here’s how ZBrain Monitor delivers measurable
benefits:
Actionable performance insights
ZBrain Monitor offers real-time visibility into agent and app performance through metrics
such as response relevance, faithfulness, and health checks. For example, if an agent’s
relevance score drops below a set threshold, teams can immediately investigate and
resolve the issue before it affects business outcomes.
Faster troubleshooting and root-cause analysis
Granular, query-level monitoring enables teams to quickly pinpoint the cause of errors or
suboptimal performance, reducing investigation time and accelerating resolution.
Stronger alignment with business objectives
By aligning monitoring metrics with business KPIs, organizations ensure that their AI
systems remain focused on delivering measurable value, whether it’s operational
efficiency, risk reduction, or enhanced customer experience.
Enhanced debugging and troubleshooting
By logging detailed execution traces, including inputs, outputs, and metric outcomes,
ZBrain Monitor makes it simple to isolate root causes. For example, if an agent’s F1 score
or Levenshtein similarity consistently falls short for specific input types (like PDF files), the
team can review the logs, replicate the scenario, and roll out targeted fixes.
Elevated user experience
29.
29/34
Monitoring satisfaction andrelevance metrics helps product owners identify when AI-
generated responses become less helpful or clear, triggering timely improvements. In
customer-facing scenarios, maintaining high clarity and helpfulness scores ensures
continued trust and engagement.
Greater agility and innovation
With visibility into performance and the ability to iterate quickly, product teams can launch,
test, and refine new AI-powered features with greater confidence. Early detection of
issues accelerates feedback cycles, reducing the risk and cost of failed experiments.
Improved customer trust and retention
Consistent, high-quality AI interactions lead to enhanced end-user experiences. When
customers see that your applications deliver reliable, accurate, and timely responses,
their trust and loyalty to your brand increase, directly impacting retention and revenue.
With ZBrain Monitor, organizations achieve real-time, automated oversight that drives
higher performance, faster innovation, and ongoing business value from every AI solution
deployment.
Endnote
As AI agents and applications rapidly reshape enterprise operations, robust monitoring is
not just a technical necessity; it is a driver of business resilience and competitive edge.
ZBrain’s monitoring module empowers organizations to move beyond surface-level
oversight, delivering automated, granular, and actionable intelligence for every AI agent
and application. By enabling real-time performance tracking, accelerating root-cause
resolution, and aligning AI outcomes with business KPIs, ZBrain transforms AI from an
experimental tool into a reliable, scalable business asset. In today’s landscape,
embracing advanced monitoring is essential to unlock the full value and longevity of your
AI investments.
Ready to unlock the full potential of AI agents and applications? Explore the to build,
deploy, and continuously monitor enterprise-grade AI solutions with seamless integration,
real-time visibility, and actionable insights at every step.
Listen to the article
Author’s Bio
0:00
0:00 / 38:25
/ 38:25
30.
30/34
Akash Takyar
CEO LeewayHertz
Tableof content
Frequently Asked Questions
What is AI agent/application monitoring, and why is it critical for enterprises?
AI agent and application monitoring is the systematic process of tracking, evaluating, and
analyzing the real-time behavior, responses, and performance of AI-powered systems in
production. Unlike basic uptime monitoring, it includes a granular evaluation of response
quality, operational health, compliance, and cost efficiency.
For enterprises, this practice is crucial because AI agents and applications frequently
manage complex, business-critical workflows that demand both trustworthiness and
transparency. Proactive monitoring enables the early detection of errors, operational
bottlenecks, and compliance risks, while facilitating continuous optimization and
accountability for business outcomes.