Building an AI-Powered Platform Operations Agent#

Platform engineering teams handle a constant stream of repetitive requests: onboarding users, managing API keys, checking service health, rotating credentials. Most of these tasks follow well-defined procedures that a human executes step by step. What if an AI agent could handle them instead?

In this post, I’ll walk through the architecture of an AI-powered operations agent that automates common platform tasks by giving an LLM access to your internal tools through a structured tool-calling interface.

The Problem#

A typical day on a platform team looks like this:

“Can you create a GitLab repo for project X?”
“I need AWS access for the new developer joining next week.”
“What’s the current status of the LLM Gateway?”
“Can you rotate the API key for service Y?”
“Add these users to the Jira project.”

Each request is straightforward, but collectively they eat hours of engineering time. The procedures are documented, the APIs exist, the permissions are in place. What’s missing is an intelligent dispatch layer that can understand the request and execute the right steps.

Architecture Overview#

The agent follows a simple loop:

User Request -> LLM (understands intent) -> Tool Selection -> Tool Execution -> Response

The LLM acts as the “brain” that:

Understands what the user wants
Selects the right tool(s)
Passes the correct parameters
Interprets the results
Decides if more steps are needed

from strands import Agent
from strands.models import BedrockModel

from .tools import (
    aws_tools,
    gitlab_tools,
    jira_tools,
    keycloak_tools,
    email_tools,
)

def create_agent() -> Agent:
    model = BedrockModel(
        model_id="anthropic.claude-sonnet-4-20250514",
        region_name="eu-central-1",
    )

    tools = [
        *aws_tools,
        *gitlab_tools,
        *jira_tools,
        *keycloak_tools,
        *email_tools,
    ]

    system_prompt = """You are a platform operations assistant for the engineering team.
    You help with:
    - AWS account and user management
    - GitLab repository operations
    - Jira project administration
    - Keycloak user management
    - Service credential rotation

    Always confirm destructive operations before executing them.
    Never expose secrets in your responses.
    When in doubt, ask for clarification rather than guessing."""

    return Agent(
        model=model,
        tools=tools,
        system_prompt=system_prompt,
    )

Defining Tools#

Each tool is a Python function with a clear docstring that tells the LLM what it does and what parameters it needs:

from strands.tools import tool

@tool
def dynamodb_get_item(
    table_name: str,
    key: dict,
    profile: str = "default",
) -> dict:
    """Retrieve an item from a DynamoDB table.

    Args:
        table_name: The DynamoDB table name
        key: The primary key of the item to retrieve (e.g., {"pk": {"S": "value"}})
        profile: AWS profile to use

    Returns:
        The item if found, or an empty dict
    """
    import boto3

    session = boto3.Session(profile_name=profile)
    client = session.client("dynamodb", region_name="eu-central-1")

    response = client.get_item(TableName=table_name, Key=key)
    return response.get("Item", {})


@tool
def dynamodb_scan(
    table_name: str,
    filter_expression: str | None = None,
    expression_values: dict | None = None,
    profile: str = "default",
) -> list[dict]:
    """Scan a DynamoDB table with an optional filter.

    Args:
        table_name: The DynamoDB table name
        filter_expression: Optional filter expression
        expression_values: Optional expression attribute values
        profile: AWS profile to use

    Returns:
        List of matching items
    """
    import boto3

    session = boto3.Session(profile_name=profile)
    client = session.client("dynamodb", region_name="eu-central-1")

    params: dict = {"TableName": table_name}
    if filter_expression:
        params["FilterExpression"] = filter_expression
    if expression_values:
        params["ExpressionAttributeValues"] = expression_values

    items = []
    while True:
        response = client.scan(**params)
        items.extend(response.get("Items", []))
        if "LastEvaluatedKey" not in response:
            break
        params["ExclusiveStartKey"] = response["LastEvaluatedKey"]

    return items

Secrets Manager Integration#

Managing secrets is a common platform task. The agent needs to list, retrieve, and rotate secrets:

@tool
def secrets_manager_list(
    prefix: str | None = None,
    profile: str = "default",
) -> list[str]:
    """List secrets in AWS Secrets Manager, optionally filtered by prefix.

    Args:
        prefix: Optional prefix to filter secret names
        profile: AWS profile to use

    Returns:
        List of secret names
    """
    import boto3

    session = boto3.Session(profile_name=profile)
    client = session.client("secretsmanager", region_name="eu-central-1")

    paginator = client.get_paginator("list_secrets")
    filters = []
    if prefix:
        filters.append({"Key": "name", "Values": [prefix]})

    secrets = []
    for page in paginator.paginate(Filters=filters):
        for secret in page["SecretList"]:
            secrets.append(secret["Name"])

    return secrets


@tool
def secrets_manager_get(
    secret_name: str,
    profile: str = "default",
) -> str:
    """Retrieve a secret value from AWS Secrets Manager.

    Args:
        secret_name: The name or ARN of the secret
        profile: AWS profile to use

    Returns:
        The secret value as a string
    """
    import boto3

    session = boto3.Session(profile_name=profile)
    client = session.client("secretsmanager", region_name="eu-central-1")

    response = client.get_secret_value(SecretId=secret_name)
    return response["SecretString"]


@tool
def secrets_manager_put(
    secret_name: str,
    secret_value: str,
    description: str | None = None,
    profile: str = "default",
) -> str:
    """Create or update a secret in AWS Secrets Manager.

    Args:
        secret_name: The name of the secret
        secret_value: The secret value to store
        description: Optional description for the secret
        profile: AWS profile to use

    Returns:
        The ARN of the created/updated secret
    """
    import boto3

    session = boto3.Session(profile_name=profile)
    client = session.client("secretsmanager", region_name="eu-central-1")

    try:
        response = client.update_secret(
            SecretId=secret_name,
            SecretString=secret_value,
            **({"Description": description} if description else {}),
        )
    except client.exceptions.ResourceNotFoundException:
        response = client.create_secret(
            Name=secret_name,
            SecretString=secret_value,
            **({"Description": description} if description else {}),
        )

    return response["ARN"]

Jira Integration#

The agent handles Jira service desk requests, which is where most user requests come in:

import requests
from strands.tools import tool


@tool
def jira_get_issue(issue_key: str) -> dict:
    """Get details of a Jira issue.

    Args:
        issue_key: The issue key (e.g., GAP-123)

    Returns:
        Issue details including summary, status, assignee, and description
    """
    import os

    base_url = os.environ["ATLASSIAN_BASE_URL"]
    auth = (os.environ["ATLASSIAN_USER"], os.environ["ATLASSIAN_TOKEN"])

    response = requests.get(
        f"{base_url}/rest/api/2/issue/{issue_key}",
        auth=auth,
        timeout=30,
    )
    response.raise_for_status()

    data = response.json()
    return {
        "key": data["key"],
        "summary": data["fields"]["summary"],
        "status": data["fields"]["status"]["name"],
        "assignee": (data["fields"].get("assignee") or {}).get("displayName"),
        "description": data["fields"].get("description", ""),
        "issue_type": data["fields"]["issuetype"]["name"],
    }


@tool
def jira_transition_issue(issue_key: str, status: str) -> str:
    """Transition a Jira issue to a new status.

    Args:
        issue_key: The issue key (e.g., GAP-123)
        status: The target status name (e.g., "Done", "In Progress")

    Returns:
        Confirmation message
    """
    import os

    base_url = os.environ["ATLASSIAN_BASE_URL"]
    auth = (os.environ["ATLASSIAN_USER"], os.environ["ATLASSIAN_TOKEN"])

    # Get available transitions
    response = requests.get(
        f"{base_url}/rest/api/2/issue/{issue_key}/transitions",
        auth=auth,
        timeout=30,
    )
    response.raise_for_status()

    transitions = response.json()["transitions"]
    target = next(
        (t for t in transitions if t["name"].lower() == status.lower()),
        None,
    )

    if not target:
        available = [t["name"] for t in transitions]
        return f"Status '{status}' not available. Options: {available}"

    response = requests.post(
        f"{base_url}/rest/api/2/issue/{issue_key}/transitions",
        auth=auth,
        json={"transition": {"id": target["id"]}},
        timeout=30,
    )
    response.raise_for_status()

    return f"Issue {issue_key} transitioned to {status}"

Email Notifications#

After completing a request, the agent can send email notifications:

import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText

from strands.tools import tool


@tool
def send_email(
    to: list[str],
    subject: str,
    body_html: str,
    cc: list[str] | None = None,
) -> str:
    """Send an email notification.

    Args:
        to: List of recipient email addresses
        subject: Email subject line
        body_html: Email body in HTML format
        cc: Optional list of CC recipients

    Returns:
        Confirmation message
    """
    import os

    msg = MIMEMultipart("alternative")
    msg["Subject"] = subject
    msg["From"] = os.environ["SMTP_FROM"]
    msg["To"] = ", ".join(to)
    if cc:
        msg["Cc"] = ", ".join(cc)

    msg.attach(MIMEText(body_html, "html"))

    with smtplib.SMTP(os.environ["SMTP_HOST"], int(os.environ["SMTP_PORT"])) as server:
        server.starttls()
        server.login(os.environ["SMTP_USER"], os.environ["SMTP_PASSWORD"])
        recipients = to + (cc or [])
        server.sendmail(os.environ["SMTP_FROM"], recipients, msg.as_string())

    return f"Email sent to {', '.join(to)}"

Safety: Confirmation for Destructive Operations#

The system prompt tells the agent to confirm destructive operations, but you should also enforce this at the tool level:

@tool
def s3_delete_objects(
    bucket: str,
    prefix: str,
    dry_run: bool = True,
    profile: str = "default",
) -> dict:
    """Delete objects from an S3 bucket matching a prefix.

    Args:
        bucket: The S3 bucket name
        prefix: The key prefix to match for deletion
        dry_run: If True, only list objects that would be deleted without deleting them
        profile: AWS profile to use

    Returns:
        Dict with count of objects deleted (or that would be deleted in dry-run mode)
    """
    import boto3

    session = boto3.Session(profile_name=profile)
    s3 = session.client("s3", region_name="eu-central-1")

    paginator = s3.get_paginator("list_objects_v2")
    objects = []
    for page in paginator.paginate(Bucket=bucket, Prefix=prefix):
        for obj in page.get("Contents", []):
            objects.append({"Key": obj["Key"]})

    if dry_run:
        return {
            "mode": "dry_run",
            "would_delete": len(objects),
            "sample_keys": [o["Key"] for o in objects[:10]],
        }

    if not objects:
        return {"mode": "live", "deleted": 0}

    # Delete in batches of 1000 (S3 API limit)
    deleted = 0
    for i in range(0, len(objects), 1000):
        batch = objects[i : i + 1000]
        s3.delete_objects(
            Bucket=bucket,
            Delete={"Objects": batch},
        )
        deleted += len(batch)

    return {"mode": "live", "deleted": deleted}

The dry_run = True default means the agent must explicitly opt into destructive behavior. The LLM will naturally do a dry run first, show the results to the user, and only proceed with dry_run=False after confirmation.

Running the Agent#

The agent can be exposed as a CLI, a Slack bot, or a web API. Here’s a simple CLI loop:

def main():
    agent = create_agent()

    print("Platform Operations Agent")
    print("Type 'quit' to exit\n")

    while True:
        user_input = input("You: ").strip()
        if user_input.lower() in ("quit", "exit"):
            break

        response = agent(user_input)
        print(f"\nAgent: {response}\n")


if __name__ == "__main__":
    main()

Example interaction:

You: List all API keys in the LLM Gateway DynamoDB table

Agent: I found 23 active API keys in the LLM Gateway table. Here's a summary:
- 8 keys for production services
- 10 keys for non-production services
- 5 keys for external consumers

Would you like me to show details for any specific category?

You: Show me the production keys

Agent: Here are the 8 production API keys:
1. service-claims-ai-prd (last used: 2 hours ago)
2. service-document-classifier-prd (last used: 15 minutes ago)
...

Lessons Learned#

After running this agent in practice for several months:

Tool descriptions are critical. The LLM selects tools based on their docstrings. Vague descriptions lead to wrong tool selection. Be specific about what each tool does and when to use it.
Default to safe. Every destructive tool should default to dry-run mode. The agent will naturally show the user what it plans to do before executing.
Keep tools focused. One tool should do one thing. A secrets_manager_get tool is better than a secrets_manager_manage tool that tries to do everything based on an “action” parameter.
Log everything. Every tool invocation should be logged with its parameters and result. This creates an audit trail and helps debug when the agent makes mistakes.
Start small. Don’t try to automate everything at once. Start with the three most common requests, prove the pattern works, then add more tools over time.
The agent makes mistakes. It will occasionally select the wrong tool or pass wrong parameters. The dry-run defaults and confirmation steps catch most errors before they cause damage.

What’s Next#

The natural evolution is to connect the agent to your ticketing system. When a new Jira service desk request comes in, the agent can:

Read the request
Classify it
Execute the standard procedure
Update the ticket with results
Send a notification to the requester

This turns your platform team from a ticket queue into an automated service, freeing up engineers for the work that actually requires human judgment.

Building an AI-Powered Platform Operations Agent

Table of Contents