AI crawlers are hammering CMS search — every platform is exposed
On-site and faceted search endpoints are being traversed at scale by AI bots. A platform-agnostic look at rate-limiting and caching.
A pattern has emerged across our managed estate over the past six months: a significant and growing proportion of server load is coming from AI training crawlers hammering on-site search endpoints. This isn't SEO bot traffic hitting cached pages. It's systematic traversal of search URLs — including faceted search with parameter combinations — that bypasses every caching layer and hits the application server directly.
The problem is not unique to any platform. We're seeing it on Drupal, WordPress, Magento and headless front ends alike. Here's what's happening and what you can do about it.
Why search endpoints are the target
AI training crawlers are optimised to find unique content. A site's main pages are likely already in training datasets. Search and faceted browse endpoints generate unique URLs for every parameter combination — platform, price range, date range, category — and each URL returns different content. A faceted product catalogue with ten filter dimensions can generate millions of unique URLs. For a crawler seeking to maximise unique content per request, this is very attractive.
Critically, these endpoints are almost never cached. Search results are dynamic, vary by query, and are typically excluded from Varnish or Redis page caches. Every request is a full application-layer round-trip — often including a database query or Elasticsearch call.
What it looks like in practice
The traffic signature is distinctive: a large number of requests from a small number of IP ranges (known AI crawler ASNs), hitting search and filter URLs with systematic parameter enumeration. The User-Agent strings are often undisguised — GPTBot, ClaudeBot, ByteSpider and others — but some crawlers rotate User-Agents or use residential proxy networks, making UA-based blocking unreliable.
- Drupal: /search/node?keys=... and Views-generated faceted browse URLs
- WordPress: /?s= and WooCommerce product filter query strings
- Magento: /catalogsearch/result/ and layered navigation filter combinations
- Headless front ends: /api/search, /api/products with filter parameters
The mitigation stack
There is no single fix. Effective mitigation requires layering several controls:
- robots.txt: Disallow search and filter URL patterns for known AI crawlers. Not all honour it, but reputable operators do.
- Rate limiting at CDN/WAF layer: Apply per-IP rate limits to search endpoints — much lower than page limits. Most CDN providers support this natively.
- Crawl budget capping: Some AI crawler operators offer crawl-rate controls in their respective webmaster tools.
- Canonical URLs: Ensure all faceted URLs include rel=canonical pointing to the base category/search page, reducing their crawl appeal.
- Cache the uncacheable: Where search results are semi-static (e.g., faceted browse without user personalisation), explore short-TTL caching at the CDN level.
A word of caution on blanket blocking
Blocking AI crawlers entirely is tempting but has downsides. Some crawlers also feed into AI assistants that cite and link to web content — there is some traffic value to being indexed. The right posture is usually to allow crawling of static content while aggressively rate-limiting or blocking access to expensive dynamic endpoints.
If your search and filter endpoints are showing unexplained load spikes, we can help identify the source and implement rate limiting and caching controls without disrupting legitimate user traffic. This is included in scope for Managed and Enterprise support clients.
Related articles
The future of headless CMS architecture
Decoupled systems are reshaping how enterprises build digital experiences. A clear-eyed look at when composable wins.
28 May 2026 · 6 min readAccessibilityMaking your site work for everyone
Practical steps to improve accessibility across your digital properties — whatever platform you run.
22 May 2026 · 8 min readIndustryHeadless CMS adoption accelerates in enterprise
New data shows how decoupled architectures are reshaping content strategy — and where a monolith still wins.
28 May 2026 · 6 min readStay ahead of the next release
Security alerts, platform updates and industry analysis — straight to your inbox.