| Internet-Draft | agent-sketch-com | May 2026 |
| Cui, et al. | Expires 19 November 2026 | [Page] |
This document describes a framework for efficient and reliable communication among AI-driven agents and between agents and network devices in the context of network operations (NetOps). As large language model (LLM)-based agents are increasingly deployed to automate network management tasks — including fault localization, configuration verification, traffic engineering, and attack mitigation — they must exchange large volumes of network state information across multiple administrative domains. Existing protocols are not designed to jointly satisfy the reliability requirements of operational commands and the efficiency requirements of network state dissemination at scale.¶
This document motivates the need for a new communication framework, defines requirements, and proposes an architecture that combines the Constrained Application Protocol (CoAP) for reliable message delivery with distributed probabilistic data structures (Sketch) for compact, mergeable network state representation. Bindings between CoAP and emerging agent protocols (MCP and A2A) are outlined. Representative use cases, including DDoS detection and mitigation, are described to validate the applicability of the framework.¶
This note is to be removed before publishing as an RFC.¶
The latest revision of this draft can be found at https://xmzzyo.github.io/nmop-agent-sketch-com/draft-cui-nmop-agent-sketch-com.html. Status information for this document may be found at https://datatracker.ietf.org/doc/draft-cui-nmop-agent-sketch-com/.¶
Discussion of this document takes place on the Network Management Operations Working Group mailing list (mailto:nmop@ietf.org), which is archived at https://mailarchive.ietf.org/arch/browse/nmop/. Subscribe at https://www.ietf.org/mailman/listinfo/nmop/.¶
Source for this draft and an issue tracker can be found at https://github.com/xmzzyo/nmop-agent-sketch-com.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 19 November 2026.¶
Copyright (c) 2026 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
The operational complexity of modern networks has grown substantially. Networks now span multiple autonomous systems (ASes), administrative domains, and technology layers. Network management tasks — such as detecting and mitigating distributed denial-of-service (DDoS) attacks, localizing faults across domains, verifying configuration consistency, and optimizing traffic engineering — require the timely collection, synthesis, and reasoning over large amounts of network state.¶
Traditional approaches to network operations relied on human operators and rule-based automation. These approaches do not scale to the demands of large, dynamic, multi-domain networks. The widespread availability of high-quality large language models (LLMs) in the 2020s opened a new paradigm: AI-driven network operations, in which autonomous LLM-based agents perform complex reasoning tasks — root cause analysis, multi-step remediation planning, policy synthesis — that previously required significant human expertise.¶
A fundamental architectural insight is that a single LLM agent cannot maintain complete, real-time visibility over a large multi-domain network. The scale of telemetry data, the diversity of device types, and administrative separation between domains each impose hard limits on what any single agent can observe or control. A multi-agent architecture is therefore necessary: Orchestration Agents maintain a global view and coordinate responses across domains; Domain Agents aggregate state from devices within their domain; Device Agents (or device-side CoAP servers) maintain local state and execute instructions. This three-tier structure allows each level to operate with appropriate granularity, eliminating the information overload that would result from a flat, fully-connected agent topology.¶
Deploying cooperating agents in a production network introduces a fundamental communication challenge with two conflicting pressures.¶
The first is a reliability requirement. Network operations involve consequential actions: changing routing policies, applying access control lists, rate-limiting traffic, rolling back configurations. These actions must be executed with confirmation. An agent that issues a mitigation command to a border router and receives no confirmation cannot know whether the network is protected. Silent delivery failures are operationally dangerous. Commands must be acknowledged, retransmission must be idempotent, and failures must be reported.¶
The second is an efficiency requirement. The primary information currency between agents is network state — link utilization, flow statistics, routing tables, interface health, configuration parameters — and the volume of this data is enormous. A single edge router may generate millions of flow records per minute; a domain of hundreds of routers generates billions. Agents do not need raw data: they need actionable summaries — answers to questions such as "Which source prefix is sending the most traffic?" or "How many unique source IPs are observed across this domain?" Moreover, in multi-operator or multi-AS scenarios, raw flow records cannot be shared across administrative boundaries due to privacy, legal, and competitive constraints.¶
Existing protocols do not simultaneously address both requirements. NETCONF/YANG [RFC6241] and gNMI [GNMI] provide reliable, schema-driven management but produce verbose, full-fidelity output not suitable for agent-to-agent state exchange. IPFIX [RFC7011] provides efficient flow export but offers no reliability guarantees or agent-interaction semantics. The Model Context Protocol (MCP) [MCP] and Agent-to-Agent (A2A) protocol [A2A] provide the right agent-native semantics — tool invocation, task delegation, artifact exchange — but are currently defined over HTTP/SSE, which is ill-suited to constrained network device management planes, and define no mechanism for compressing the network state that agents must exchange.¶
| Protocol | Reliable Delivery | Efficient State | Agent-Native | Assessment |
|---|---|---|---|---|
| NETCONF/YANG | Yes | No (full XML) | No | Too verbose; no agent semantics |
| gNMI/gRPC | Yes | Partial | No | No summary layer; heavy stack |
| IPFIX/NetFlow | No | Partial | No | Export only; no agent interaction |
| MCP (HTTP) | Partial | No | Yes | No transport guarantees; no compression |
| A2A (HTTP/SSE) | Partial | No | Yes | SSE not suitable for device management planes |
| This framework | Yes (CoAP CON) | Yes (Sketch) | Yes (MCP/A2A bindings) | Addresses both requirements |
This document proposes a framework that resolves the reliability-efficiency tension through a two-layer communication design, with each layer addressing one dimension of the problem and the two layers combining cleanly.¶
The Reliable Layer is based on the Constrained Application Protocol (CoAP) [RFC7252]. CoAP operates over UDP and provides a reliable subset of HTTP semantics with a compact 4-byte binary header. Its Confirmable (CON) message type implements acknowledged delivery with exponential-backoff retransmission, directly satisfying the reliability requirement for operational commands. CoAP's Non-confirmable (NON) messages and Observe extension [RFC7641] provide loss-tolerant push notifications for high-frequency telemetry streams. DTLS [RFC9147] provides mutual authentication and encryption. CoAP is already widely implemented on network equipment and is the basis of existing IETF management standards such as COMI [RFC9254], giving the framework both the right transport properties and an established deployment footprint.¶
The Efficiency Layer is based on distributed probabilistic data structures — collectively referred to in this document as Sketch. Sketches provide compact, fixed-size representations of streaming network observations with provable bounded-error guarantees. A Count-Min Sketch summarizing per-flow traffic rates across a domain may occupy a few hundred kilobytes, compared to gigabytes of raw flow records. A HyperLogLog estimating the number of unique source IPs occupies 64 bytes regardless of how many distinct addresses are observed. Critically, Sketches support a merge operation: structures from multiple devices or domains can be combined into a structure representing the union of all observations, without access to the underlying raw data. This mergeability property enables cross-domain state aggregation and cross-domain sharing without exposing privacy-sensitive information.¶
The two layers are orthogonal and complementary: CoAP governs how messages are delivered; Sketch governs what they contain. Neither alone is sufficient — CoAP without Sketch would transmit raw telemetry and fail on efficiency and privacy; Sketch without CoAP would have no mechanism for reliable command delivery. Together, they allow each component to be evolved independently while providing a clean interface that both the network management and AI agent communities can implement.¶
Beyond CoAP and Sketch, the framework defines normative bindings between CoAP and MCP/A2A. MCP and A2A represent an emerging consensus on how AI agents communicate. Without defined bindings, each deployment builds its own translation layer, leading to fragmentation. This document defines those bindings to standardize a single, interoperable interface.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
The following terms are used throughout this document:¶
Agent: A software entity capable of autonomous reasoning and action in a network management context. An agent may be driven by a large language model (LLM), a rule engine, or a combination of both.¶
Orchestration Agent: A high-level agent responsible for decomposing complex network management tasks, dispatching sub-tasks to Domain Agents, and synthesizing results into operational decisions. Typically LLM-driven.¶
Domain Agent: An agent responsible for a specific administrative domain (e.g., an autonomous system or a geographic region). It collects and summarizes network state from devices within its domain and cooperates with other Domain Agents and the Orchestration Agent.¶
Device Agent: A lightweight agent co-located with or embedded in a network device (router, switch). It manages local Sketch structures and exposes them via a CoAP server interface.¶
Sketch: A probabilistic data structure that provides a compact, bounded-error summary of a multiset or set of network observations. Examples include Count-Min Sketch, HyperLogLog, DDSketch, MinHash, and Bloom Filter.¶
Sketch Node: A network device or software component that maintains one or more Sketch structures updated from the local data plane (e.g., via P4, eBPF, or software sampling).¶
Sketch Merge: The operation of combining two Sketch structures of the same type into a single structure whose estimates reflect the union of the underlying observation sets.¶
XOR-Delta: An incremental transmission scheme in which only the changed cells of a Sketch array are transmitted between synchronization points, computed as the bitwise XOR of the current and baseline Sketch arrays.¶
CoAP: The Constrained Application Protocol [RFC7252], a lightweight RESTful protocol operating over UDP with optional reliability via Confirmable (CON) messages.¶
MCP: The Model Context Protocol [MCP], an open protocol that standardizes how applications provide context, tools, and resources to LLM-based agents.¶
A2A: The Agent-to-Agent protocol [A2A], a protocol for task-level communication and coordination between autonomous agents.¶
CON message: A CoAP Confirmable message that requires an acknowledgment (ACK) from the recipient. Used for reliable delivery.¶
NON message: A CoAP Non-confirmable message sent without requiring an acknowledgment. Used for high-frequency, loss-tolerant data streams.¶
Observe: A CoAP extension [RFC7641] that allows a client to register interest in a resource and receive notifications when the resource changes.¶
Network operations involve consequential actions: changing routing policies, applying ACLs, rate-limiting traffic, rolling back configurations. An agent that issues a mitigation command to a border router and does not receive confirmation of execution cannot know whether the network is protected. Silent failures — commands lost in transit — are operationally dangerous.¶
The reliability requirement has several dimensions:¶
Delivery guarantee: Operational commands MUST be delivered to the target device and acknowledged.¶
Idempotency: Retransmitted commands MUST NOT cause duplicate or inconsistent state changes on the device.¶
Ordering: Related commands (e.g., a sequence of configuration steps) MUST be executed in the correct order.¶
Failure notification: When a command cannot be delivered after retransmission, the issuing agent MUST be notified.¶
Existing approaches that rely on UDP-based telemetry streams without acknowledgment do not satisfy this requirement. TCP-based protocols satisfy it but introduce head-of-line blocking and connection overhead that is problematic for constrained devices and lossy management-plane paths.¶
Network state is voluminous. Transmitting raw telemetry between agents at operational timescales is impractical for several reasons:¶
Volume: A single domain may generate terabytes of raw telemetry per day. Agents cannot buffer or transmit this at the latency required for real-time operations.¶
Cross-domain privacy: In multi-operator or multi-AS scenarios, raw flow records cannot be shared across administrative boundaries due to legal, regulatory, and competitive constraints.¶
Inference latency: LLM agents reasoning over gigabytes of raw input incur unacceptable latency. Compact, structured summaries are required.¶
Management plane bandwidth: The management plane of network devices is deliberately rate-limited to protect the control plane. High-volume telemetry export over this plane is not feasible.¶
The efficiency requirement demands a state representation that is compact, supports cross-domain sharing without exposing raw data, and can be incrementally updated to avoid full retransmission on every synchronization cycle.¶
No existing protocol simultaneously satisfies the reliability requirement for operational commands, the efficiency requirement for network state dissemination, and the agent-native communication semantics required for LLM-driven NetOps. This three-way gap is the core problem this framework addresses.¶
The reliability-efficiency tension is not resolvable by adjusting parameters of any single existing protocol — it requires a deliberate two-layer design. The agent-native gap requires new bindings between emerging agent protocols (MCP, A2A) and network device interfaces. Both design choices are developed in Sections 5 through 8.¶
This section defines the requirements that the framework is designed to satisfy. Requirements are stated using the key words defined in Section 2.¶
The framework MUST provide a mechanism for delivering operational commands from an agent to a target device or agent with guaranteed delivery and acknowledgment.¶
The framework MUST support retransmission of unacknowledged commands with configurable backoff.¶
The framework MUST support idempotent command execution, such that a retransmitted command does not cause duplicate state changes on the target.¶
The framework MUST notify the sending agent when a command cannot be delivered after the maximum number of retransmission attempts.¶
The framework MUST support a compact representation of network state that can be exchanged between agents with substantially lower bandwidth than raw telemetry data.¶
The compact representation MUST provide provable, configurable error bounds (epsilon, delta) on the accuracy of estimates derived from it.¶
The compact representation MUST support merging of instances from multiple sources to produce a combined representation without access to the underlying raw data.¶
The framework MUST support incremental (delta) transmission of network state updates, such that only changes since the last synchronization point are transmitted.¶
The incremental transmission mechanism MUST support fallback to full-state transmission when the delta exceeds a configurable threshold or when a gap in the update sequence is detected.¶
The framework MUST define normative bindings between the agent communication primitives of MCP and A2A and the CoAP message types and resource model used by the framework.¶
The bindings MUST cover at minimum: tool invocation (MCP tools/call), resource subscription (MCP resources/subscribe), task submission (A2A tasks/send), and task status subscription (A2A tasks/subscribe).¶
The framework MUST ensure that the error bounds (epsilon, delta) of Sketch estimates are communicated alongside the estimates themselves, so that agents can incorporate uncertainty into their reasoning and decision-making.¶
The framework SHOULD define how error bounds propagate through Sketch merge operations across multiple domains or devices.¶
The compact state representation used by the framework MUST NOT require the transmission of raw flow records, raw IP addresses, or other privacy-sensitive data to satisfy cross-domain state sharing requirements.¶
The representation MUST allow agents in different administrative domains to derive useful aggregate estimates without exposing the underlying observations.¶
The framework SHOULD be deployable on existing network infrastructure without requiring hardware upgrades.¶
The framework SHOULD define a software-based implementation path (e.g., using Linux eBPF or user-space sampling) as a fallback to hardware-accelerated implementations (e.g., P4-based data plane Sketch updates), with documented performance trade-offs.¶
The framework SHOULD be interoperable with existing network management infrastructure, including YANG data models [RFC7950] and NETCONF [RFC6241].¶
The framework SHOULD define a YANG module for Sketch node configuration and state, allowing Sketch nodes to be managed via existing NETCONF/RESTCONF tooling.¶
The framework defines a three-tier agent architecture connected by a two-layer communication stack:¶
┌──────────────────────────────────────────────┐
│ Orchestration Agent (LLM) │
│ Global reasoning · Task dispatch · Decision │
└───────────────┬──────────────────────────────┘
│ A2A over CoAP
┌─────────┼─────────┐
v v v
┌──────────┐ ┌──────────┐ ...
│ Domain │ │ Domain │
│ Agent │ │ Agent │
└────┬─────┘ └────┬─────┘
│ CoAP │ CoAP
┌────v────────────v──────┐
│ Network Devices │
│ (Sketch Nodes) │
└────────────────────────┘
Figure 1: The two-layer communication stack of the framework
¶
Reliable Layer (CoAP): Carries operational commands, task coordination messages, and large Sketch payloads with guaranteed delivery.()¶
Efficiency Layer (Sketch): Provides the data representation in all network state exchanges. Sketch structures are generated at Sketch Nodes, transmitted via CoAP to Domain Agents, merged at the domain level, and aggregated at the orchestration level.¶
Orchestration Agent: - Receives NetOps task requests from operators or automated systems. - Decomposes tasks into sub-tasks and delegates them to Domain Agents via A2A task messages. - Aggregates Sketch summaries from multiple domains to derive global network state estimates. - Makes operational decisions based on Sketch-derived estimates and LLM reasoning. - Issues operational commands to Domain Agents or Sketch Nodes via CoAP CON messages.¶
Domain Agent: - Subscribes to Sketch updates from all Sketch Nodes within its domain via CoAP Observe. - Maintains a domain-level merged Sketch representing the aggregate state of its domain. - Responds to Sketch sharing requests from the Orchestration Agent or peer Domain Agents. - Executes sub-tasks assigned by the Orchestration Agent.¶
Device Agent / Sketch Node: - Maintains one or more Sketch structures updated from the local data plane. - Exposes Sketch resources via a CoAP server at well-known resource paths. - Pushes incremental Sketch updates to subscribed Domain Agents via CoAP Observe NON messages. - Receives and executes operational commands delivered via CoAP CON messages.¶
| Relationship | Protocol | Primary Use |
|---|---|---|
| Orchestration Agent ↔ Domain Agent | A2A over CoAP | Task delegation, Sketch aggregation, decision dissemination |
| Domain Agent ↔ Domain Agent | A2A over CoAP | Cross-domain Sketch sharing, peer coordination |
| Domain Agent ↔ Sketch Node | CoAP (direct) | Sketch subscription, command delivery, status reporting |
This framework uses CoAP [RFC7252] as the transport substrate for all agent-to-device and agent-to-agent communication. The following features are used:¶
Confirmable (CON) messages for operational commands, task messages, and large Sketch transfers, with ACK and exponential-backoff retransmission.¶
Non-confirmable (NON) messages for high-frequency incremental Sketch updates via Observe. Loss-tolerant; a missed update is recovered at the next synchronization cycle.¶
Observe [RFC7641] for Domain Agent subscriptions to device Sketch resources, receiving push notifications when Sketch state changes beyond a configured threshold.¶
Block-Wise Transfer [RFC7959] for Sketch payloads exceeding the maximum CoAP message size.¶
CBOR encoding (Content-Format 60) for all Sketch payloads [RFC8949], reducing payload size by 30–50% compared to JSON.¶
Sketch Nodes and Device Agents MUST expose the following CoAP resource tree:¶
coap://<device>/
├── ops/sketch/
│ ├── ops/sketch/cms (Count-Min Sketch)
│ ├── ops/sketch/hll (HyperLogLog)
│ ├── ops/sketch/ddsketch (DDSketch)
│ └── ops/sketch/minhash (MinHash / Bloom Filter)
├── ops/agent/
│ ├── ops/agent/task (Receive agent tasks via POST)
│ └── ops/agent/status (Report agent status via GET/Observe)
└── ops/config/
├── ops/config/apply (Apply configuration via CON POST)
└── ops/config/rollback (Rollback configuration via CON POST)
¶
Devices MUST expose at minimum ops/sketch/cms and ops/sketch/hll.¶
Retransmission: CON messages not acknowledged within ACK_TIMEOUT (default: 2 seconds per RFC 7252) are retransmitted with exponential backoff up to MAX_RETRANSMIT (default: 4) attempts. After MAX_RETRANSMIT failures, the sending agent MUST be notified of delivery failure.¶
Idempotency: Every CON command message MUST carry a unique Token (4 bytes) and a SequenceID in the payload. Receiving devices MUST maintain an idempotency cache keyed by (Token, SequenceID) with a configurable TTL (default: 300 seconds). Duplicate messages MUST return the cached response without re-executing the command.¶
Gap detection: Domain Agents MUST monitor Observe sequence numbers on subscribed Sketch resources. If a gap larger than a configurable threshold (default: 5 missed updates) is detected, the Domain Agent MUST issue a CON GET to retrieve the full current Sketch state and resynchronize the baseline.¶
In this framework, Sketch structures serve as the primary representation of network state exchanged between agents. Rather than transmitting raw flow records, routing tables, or interface statistics, agents exchange Sketch summaries — compact structures that answer specific queries about network state with bounded error.¶
A Sketch is not a detection tool or anomaly detector. It is a data representation format — the network state analog of a compressed file format, but one that supports meaningful queries and cross-domain merging. The intelligence — detection, reasoning, and decision-making — resides in the agents that query and interpret Sketch structures.¶
The appropriate Sketch type depends on the nature of the network state being represented and the queries agents need to answer:¶
| NetOps Task | Query Type | Recommended Sketch | Key Property Used |
|---|---|---|---|
| Flow rate analysis | "What is the traffic rate from prefix X?" | Count-Min Sketch (CMS) | Frequency estimation with epsilon-delta bounds |
| Source diversity analysis | "How many unique source IPs are there?" | HyperLogLog (HLL) | Cardinality estimation, cross-domain mergeable |
| Latency / jitter analysis | "What is the p99 latency on path P?" | DDSketch | Quantile estimation with relative error bounds |
| Configuration consistency | "Is device A's config consistent with peers?" | MinHash | Set similarity estimation (Jaccard index) |
| Affected flow marking | "Is flow F affected by fault X?" | Bloom Filter | Set membership with configurable false positive rate |
Sketch parameters SHOULD be configured based on the expected observation cardinality and the desired accuracy level (epsilon, delta).¶
Sketch structures are fixed-size arrays of counters or registers. Full retransmission at every synchronization interval is wasteful when only a small fraction of cells change. The XOR-Delta scheme provides efficient incremental updates:¶
The Sketch Node maintains the current array S[t] and the baseline S[t0] (state at last synchronization).¶
The delta is computed as D = S[t] XOR S[t0].¶
Only the non-zero entries of D are transmitted as (index, value) pairs.¶
The receiving agent reconstructs the current Sketch: S[t] = S[t0] XOR D.¶
When the fraction of changed cells exceeds a threshold (default: 20%), or a gap in the Observe sequence is detected, full Sketch retransmission is triggered.¶
Under typical steady-state conditions, incremental deltas are expected to represent 1–5% of the full Sketch size, reducing management plane bandwidth consumption proportionally.¶
When Sketch structures from multiple sources are merged, the error bounds of the merged structure can be computed analytically for most Sketch types. For example, when two Count-Min Sketches with the same dimensions (w, d) and error parameters (epsilon, delta) are merged via element-wise maximum, the merged structure retains the same error parameters.¶
The framework requires that error bound parameters (epsilon, delta) be included in all Sketch messages so that receiving agents can propagate them correctly. Implementations SHOULD validate that Sketch structures being merged have compatible parameters before performing the merge operation.¶
The bindings defined in this section map MCP and A2A semantic primitives onto CoAP methods, message types, and resource paths. The guiding principles are:¶
Reliability follows semantics: MCP/A2A primitives with operational consequences (state-modifying tool invocations, task submissions) MUST map to CoAP CON messages. Observational primitives (subscriptions, status updates) SHOULD map to CoAP NON with Observe.¶
Encoding efficiency: All payloads SHOULD use CBOR encoding (Content-Format 60). JSON (Content-Format 50) MAY be used for diagnostic purposes.¶
Path stability: CoAP resource paths defined in this framework MUST NOT change between protocol versions.¶
| MCP Primitive | CoAP Method | Message Type | Resource Path |
|---|---|---|---|
tools/list
|
GET | CON |
/ops/mcp/tools
|
tools/call
|
POST | CON |
/ops/mcp/tools/call
|
resources/read
|
GET | CON |
/ops/mcp/resources/{name}
|
resources/subscribe
|
GET + Observe | CON (register) / NON (notify) |
/ops/sketch/{type}
|
resources/unsubscribe
|
RST | — | CoAP RST to cancel Observe |
prompts/get
|
GET | CON |
/ops/mcp/prompts/{name}
|
For tools/call, the MCP JSON-RPC 2.0 request body is carried as the CoAP payload. The CoAP Token field serves as the correlation identifier and MUST be unique per outstanding request.¶
| A2A Primitive | CoAP Method | Message Type | Notes |
|---|---|---|---|
tasks/send (sync) |
POST | CON | Response carries task result directly |
tasks/send (async) |
POST | CON | Response is 2.31 Continue; task ID in Location-Path |
tasks/get
|
GET | CON | Resource path: /a2a/tasks/{task-id}
|
tasks/cancel
|
DELETE | CON | Resource path: /a2a/tasks/{task-id}
|
tasks/subscribe
|
GET + Observe | CON (register) / NON (notify) | Replaces HTTP SSE for task status streaming |
For asynchronous tasks (the common case for complex NetOps tasks such as fault localization), the interaction proceeds as follows:¶
The initiating agent sends tasks/send via CON POST and receives a 2.31 Continue response with the task ID.¶
The initiating agent registers an Observe subscription on the task resource.¶
The executing agent sends NON Observe notifications as the task progresses; the final notification carries the task result artifacts.¶
The initiating agent cancels the Observe subscription by sending a CoAP RST.¶
Agents supporting this framework MUST include the following additional fields in their A2A AgentCard:¶
{
"coap_extensions": {
"endpoint": "coap://<host>[:<port>]",
"dtls_required": true,
"observe_supported": true,
"cbor_encoding": true,
"max_payload_bytes": 1024,
"block_transfer": true,
"sketch_types": ["cms", "hll", "ddsketch", "minhash"]
}
}
¶
Scenario: A volumetric DDoS attack is directed at a destination prefix within AS-1. The attack traffic originates from a large botnet distributed across multiple ASes.¶
Participating entities: Border routers in AS-1 (Sketch Nodes), Domain Agent for AS-1, peer Domain Agents for AS-2 and AS-3, Orchestration Agent.¶
Sketch usage: Border routers maintain Count-Min Sketch structures updated by eBPF programs on the data plane, tracking per-source-prefix packet rates. Each Domain Agent maintains a HyperLogLog to estimate the cardinality of unique source IP addresses across its domain.¶
Protocol flow:¶
Domain Agent AS-1 receives CMS incremental updates (XOR-Delta, NON Observe) from border routers. It queries the merged CMS and detects that traffic from source prefix 203.0.113.0/24 has exceeded a configured threshold.¶
Domain Agent AS-1 sends an A2A tasks/send (DDOS_SUSPECT) message via CON POST to the Orchestration Agent, attaching its domain HLL (64 bytes) as a task artifact.¶
The Orchestration Agent sends A2A SKETCH_SYNC_REQUEST tasks to Domain Agents for AS-2 and AS-3, requesting their HLL Sketches.¶
The Orchestration Agent merges the three HLLs to estimate the total unique source IPs across all domains. A cardinality above 10^5 indicates a distributed botnet; below 100 suggests a single-source amplification attack requiring a different response.¶
The Orchestration Agent determines the appropriate mitigation action (e.g., FlowSpec [RFC8955] rate-limit rule) and delivers it to border routers in AS-1 via CON POST to /ops/config/apply. CON guarantees delivery; idempotency ensures retransmission does not cause duplicate ACL entries.¶
Routers acknowledge application (2.04 Changed). Domain Agent AS-1 continues pushing CMS deltas; the Orchestration Agent monitors whether traffic normalizes and issues a CON POST to remove the rule when the attack subsides.¶
Requirements addressed: REQ-1 (reliable mitigation delivery), REQ-2 (HLL/CMS compact representation), REQ-6 (cross-domain sharing without raw IP exposure).¶
Scenario: Users in AS-1 report packet loss to a destination in AS-3. The fault may lie in any of the transit ASes.¶
Protocol flow: Domain Agents for each AS query their DDSketch structures tracking per-path latency and loss distributions, and share merged DDSketches via A2A tasks. The Orchestration Agent compares per-hop quantile estimates to identify the AS where latency or loss deviates from baseline, then queries the relevant Domain Agent for Bloom Filter data marking affected flows to narrow down the faulty link.¶
Requirements addressed: REQ-2, REQ-3, REQ-4 (async A2A task binding), REQ-6.¶
Scenario: A network-wide audit is required to verify that all border routers are running consistent BGP policy configurations.¶
Protocol flow: The Orchestration Agent requests MinHash Sketches from all Domain Agents. Each Domain Agent computes MinHash structures representing the set of active configuration items on its devices. The Orchestration Agent computes pairwise Jaccard similarity estimates to identify devices whose configurations have diverged. Divergent devices are flagged, and corrective configurations are pushed via CON POST.¶
Requirements addressed: REQ-1, REQ-2, REQ-5 (MinHash similarity bounds).¶
Scenario: The Orchestration Agent needs to optimize inter-domain traffic routing based on current load and latency conditions.¶
Protocol flow: Domain Agents continuously aggregate per-flow CMS structures from their devices into domain-level traffic matrices. DDSketch structures capture latency distributions on inter-domain links. The Orchestration Agent periodically collects these structures via A2A tasks (using Block-Wise Transfer for large matrices), constructs a compressed global traffic matrix, and uses LLM reasoning to generate updated routing policy recommendations. Approved policies are pushed via CON POST.¶
Requirements addressed: REQ-1, REQ-2, REQ-3, REQ-7.¶
The framework is designed for incremental deployment without simultaneous upgrades across all devices:¶
Phase 1 — Software-based Sketch: Sketch structures are maintained by user-space or eBPF programs on existing device CPUs. CoAP servers run as software daemons. No hardware changes required; deployable immediately on Linux-based equipment (REQ-7).¶
Phase 2 — eBPF-accelerated Sketch: eBPF XDP programs update Sketch structures at several million packets per second on commodity NICs, providing near-line-rate collection for high-traffic edge devices without programmable forwarding hardware.¶
Phase 3 — P4-based hardware Sketch: On P4-programmable forwarding hardware, Sketch updates occur at full line rate (100 Gbps and beyond) entirely within the data plane. P4 programs export Sketch deltas to the device's CoAP server process for transmission to Domain Agents.¶
A companion document defines a YANG module ietf-sketch-node modeling Sketch type configuration, Sketch state (current values, timestamps, error bounds), CoAP server configuration, and subscription management. This allows operators to configure and monitor Sketch Nodes using existing NETCONF/RESTCONF tooling (REQ-8).¶
Count-Min Sketch: Set width w = ceil(e / epsilon) and depth d = ceil(ln(1/delta)) where epsilon is the desired maximum relative frequency error and delta is the desired failure probability.¶
HyperLogLog: Relative standard error ≈ 1.04 / √m where m = 2^b is the number of registers. For 2% error, use b = 12 (m = 4096), occupying 1.5 KB with 4-bit registers.¶
DDSketch: The relative accuracy parameter α (default: 0.01) determines the maximum relative error on quantile estimates.¶
All CoAP communication MUST be protected by DTLS 1.3 [RFC9147] when operating over untrusted networks:¶
Certificate mode: Used for agent-to-agent communication. Each agent presents an X.509 certificate whose Subject or SAN field identifies its CoAP URI. Certificates are issued by a management-plane PKI.¶
Pre-shared key (PSK) mode: Used for agent-to-device communication where devices have limited computational resources. PSK values are provisioned out-of-band and can be rotated by the Orchestration Agent via CON POST.¶
Sketch structures transmitted between agents and devices could be tampered with to influence agent decision-making. DTLS encryption prevents eavesdropping; DTLS authentication prevents impersonation. Additionally, each Sketch message SHOULD carry an HMAC-SHA256 integrity tag over the payload, keyed with a secret negotiated during the DTLS handshake.¶
An adversary with write access to a Sketch Node could manipulate Sketch structures to cause incorrect agent decisions. Defenses include:¶
Using keyed hash functions (e.g., SipHash [SIPHASH]) for Sketch index computation, preventing predictable collision attacks.¶
Cross-validating Sketch estimates from multiple independent Sketch Nodes before acting on them.¶
Monitoring for statistically anomalous Sketch patterns (e.g., a single cell accounting for an implausibly large fraction of total counts).¶
CoAP servers on network devices are resource-constrained and could be overwhelmed by floods of CON messages. Implementations SHOULD enforce per-source rate limits on incoming CON messages and SHOULD use CoAP's built-in congestion control mechanisms (ACK_TIMEOUT, NSTART).¶
This document has no IANA actions.¶
TODO acknowledge.¶