How Our AI Agent Squad Slayed Duplicate Content and Mastered End‑to‑End Traceability

In our fast‑paced, content‑driven world, staying ahead means automating as much as possible—without sacrificing quality. Every 30 minutes, our AI “agent squad” springs into action, long before most of us have even taken that first sip of coffee. This post walks through how we built an autonomous, keyword‑aware pipeline that pulls from hundreds of RSS feeds, conquered the menace of duplicate content, and gave every article a unique UUID so we can trace it from start to finish.


The Agentic Orchestra at Dawn

Our pipeline is triggered by a simple cron job every 30 minutes. From that heartbeat, a chain of specialized agents takes over:

  • Fetch Agent:
    Monitors hundreds of RSS feeds, looking specifically for our target keywords. As soon as a story matches, it grabs the headline, body, and any associated metadata.

    This agent ensures we never miss breaking news or niche updates—no manual intervention required.

  • Clean Agent:
    Raw HTML can be riddled with inline styles, malformed tags, or unwanted scripts. Our cleaner strips out everything but the essentials, producing a standardized document that downstream agents can rely on.

    Think of it as our digital housekeeping team.

  • Draft Agent:
    Feeds the sanitized text into our LLM, instructing it to rewrite in our brand voice. What was once a dry feed entry emerges as a polished, magazine‑style article—complete with engaging intros and natural transitions.

    This step transforms raw data into reader‑friendly storytelling without any hand‑editing.

  • Audit Agent:
    Runs an SEO checklist: keyword density, H1/H2 structure, meta descriptions, alt text for images, and more. It flags any missing or weak elements so we publish with confidence.

    Every post rolls out fully compliant with our SEO guidelines.

  • Image Agent:
    Dynamically generates charts, diagrams, or simple illustrations based on article content. Whether it’s a performance graph or a concept diagram, each image aligns perfectly with the text.

    Visual consistency at scale—100% automated.

  • Publish Agent:
    Packages the final draft, images, tags, categories, and publish timestamp, then drops it into WordPress via the REST API. Posts go live on schedule, with zero manual steps.

    Our calendar stays full, our team stays focused.

On paper, this six‑stage factory ran like a dream. In reality, a silent saboteur lurked…


The Duplicate‑Content Dragon

Stage 1: Fingerprint Fumble

We began with a classic: hashing the full article text and comparing digests. Exact duplicates were caught instantly—but near‑duplicates, where one site rewrote a paragraph or shuffled sentences, slipped right through. We discovered the hard way when our SEO dashboard signaled penalties and our embedding costs skyrocketed as we processed the same story twice.

  • Wasted tokens: Every duplicate triggered fresh LLM calls and image generation.
  • CPU cycles: Downstream agents reran identical content.
  • SEO damage: Google penalties for thin or duplicate content.

Stage 2: The Semantic Swoop

One late night, fueled by espresso and determination, we reimagined our approach. Instead of hashing the entire text, we distilled each article to its essence:

  1. Summarize Agent:
    Takes ~1,500 tokens of draft and produces a 200‑word synopsis (≈300 tokens).

    This step cuts noise and focuses on core meaning.

  2. Embed Agent:
    Feeds the synopsis into text-embedding-ada-002, yielding a 1,536‑dimensional vector—a semantic fingerprint, not just a text hash.
  3. Vector Vault (Qdrant):
    Stores these vectors in sharded indexes by topic. With Rust‑powered speed, native cosine similarity, and a 30‑day TTL, Qdrant automatically evicts stale entries.

Now, on fetch, the pipeline asks: “Do I look ≥85% similar to any recent vector?” If yes, the article is flagged and we skip all downstream steps.

  • Accuracy: 98% of near‑duplicates caught
  • Latency: <120 ms per similarity check
  • Cost savings: 70–80% reduction in embedding spend
  • Throughput: 600 semantic checks per minute

By distilling meaning before matching, we turned duplicate detection into a lean, reliable process.


The Trace‑ID Tether

With duplicates slain, we faced a new challenge: tracing an article’s journey across six stages pipeline. Log files were siloed, and a single error meant spelunking through multiple systems.

  1. Minted a UUID: At the very first fetch job, we generate a standard UUID—our article’s mission badge—ensuring uniqueness across every run and service.

    This ID becomes the single source of truth for that story’s path.

  2. Automatic Propagation: A lightweight wrapper around our job orchestration embeds the UUID into each task’s payload. When the Fetch Agent passes data to the Clean Agent, the same UUID tags along.
  3. Logging with Context: We configured nestjs-pino to intercept every job execution. It extracts the UUID and stamps every log entry—info, warnings, errors—with that ID.
  4. Metrics & Alerts: In Prometheus, every metric (job duration, error count, success rate) is labeled with the UUID. We set up alerts for any stage exceeding 30 seconds; the alert payload includes the UUID link so we can click straight into detailed logs and dashboards.

Why it matters:

  • Instant Visibility: One UUID, one timeline—no more guessing where the pipeline hiccupped.
  • Rapid Resolution: Debugging that used to take hours now wraps up in minutes.
  • Cross‑Team Alignment: Engineers, SREs, and PMs all reference the same ID, making collaboration seamless.

From Challenge to Champion

By blending semantic embeddings with a robust trace‑ID system, we transformed our pipeline from a black box into a transparent, cost‑efficient engine:

  • Confidence Boost: Our team trusts that duplicates won’t slip through, so they can focus on new features and innovation.
  • Shared Ownership: With every agent speaking the same UUID, accountability spans the entire workflow.
  • Motivation Surge: Celebrating real‑time wins—98% fewer duplicates, dramatic cost savings, lightning‑fast debugging—has become our new morning ritual.

Our AI agentic pipeline isn’t just automation; it’s a testament to what happens when human creativity meets thoughtful engineering. And every 30 minutes, when the cron fires again, we’re reminded that our creation is humming along—always vigilant, always improving, and always hungry for the next challenge.