Like several other Negative SEO or adverse SEO attacks, Canonical Link Attacks aren’t really about single strikes doing massive damage.
Instead, such approaches are about wasting resources, complicating analysis and dirtying the victim’s data.
In this case – it’s causing Search Engine Bots to request, crawl and index a number of URLs that shouldn’t really be accessible or indexed. You’ll find that search engines spend more time on duplicate pages than new ones, that you have to wade through tons of superfluous rows in Google Search Console, or even suffer weird ranking fluctuations (impacting traffic and business).
Breakdown of a Canonical Link Attack
Technical SEO is often downplayed. Links are the sexy side of SEO, and get all the glory.
Content is King, and Google loves to push the idea that quality content is what you should focus on.
But on-site SEO isn’t just about strong foundations, nor maximizing the gains brought by off-site or on-page SEO.
It can also avoid several nasty surprises, and even offer some protection from certain Negative SEO attacks – such as Canonical Link Attacks.
This NSEO vector relies on you not having proper canonicalisation implemented,
Leaving your site open to internal duplication issues.
The worse your site’s canonical handling – the more URIs can be “created”,
and the more turbulence and data pollution you’ll face if/when a competitor decides to strike at you.
People tend not to grasp how many variants of a URI there can be.
For example:
- http://www.example.com/somepage
- https://www.example.com/somepage
- http://example.com/somepage
- https://example.com/somepage
- http://www.example.com/SomePage
- https://www.example.com/SomePage
- http://example.com/SomePage
- https://example.com/SomePage
- http://www.example.com/somepage?parameter=value
- https://www.example.com/somepage?parameter=value
- http://example.com/somepage?parameter=value
- https://example.com/somepage?parameter=value
- http://www.example.com/SomePage?parameter=value
- https://www.example.com/SomePage?parameter=value
- http://example.com/SomePage?parameter=value
- https://example.com/SomePage?parameter=value
That’s one single page, accessible from fifteen additional URLs.
And to a Search Engine you’ve got sixteen different pages, with the same content.
The upshot of this means you may face:
- Links going to different URIs (instead of all links to a single page)
- Google potentially ranking a weaker page (lower rankings)
- Google crawling/ranking the variants
From an SEO perspective – (a) and (b) generally get the attention,
which is understandable (links and rankings are kind of important to SEOs).
(c) gets some attention – but usually misportrayed as “crawl budget”.
In most cases, people don’t need to worry about whether G has enough time/resources to crawl their site (few sites are big enough for it to be an issue).
But – when you suddenly have fifteen times the number of pages… things can get messy.
Your 500 page site … which usually gets a hundred or so pages crawled at a time, suddenly has 8,000 pages!
Google has to try to figure out which of those it’s supposed to crawl … and will spend some of its time on what it thinks are 7,500 new pages (as it’s not crawled or indexed them before!).
Now imagine trying to launch a new product, or push out content for the start of sales season, or announce offers, and Googlebot not having a clue what it should be crawling.
Then there’s the joy of digging through Google Search Console, and all the additional URIs (not to mention the skew on Impressions and CTR.)
And it’s not just the Protocol (http/s), or Subdomain (www or not), or multiple domains (.com and .net, hyphenated and non-hyphenated etc.).
Parameters and Values (query strings) can also be used:
- /somepage?param1=aaa
- /somepage?param2=bbb
- /somepage?Param1=aaa
- /somepage?param1=aaa¶m2=bbb
- /somepage?Param1=aaa&Param2=bbb
- /somepage?param2=bbb¶m1=aaa
- /somepage?Param2=bbb&Param1=aaa
So your 500 pages may not stop at 8,000, but could easily be made into 16,000 or more!
Then there’s the risks with infinite crawl spaces – which often occur with Site Search and Pagination systems.
Pagination requests past your real quantity may return 200 responses (that means the page exists and is “ok”) – these may get crawled and indexed, and if the numeration is automated,
bots will see “links” to pages, and add them to crawl queue (1 canonical link attack can be automatically turned into multiple others!).
Some sites allow search URIs to be crawled/indexed. A simple dictionary style attack can lead to tens or hundreds of thousands of URIs being requested!
Obviously we’re not going to lay out how to actually run such Canonical Link Reverse SEO campaign.
Instead, we’re hoping this information is enough for you to see how you may be vulnerable,
and how you can help protect yourself against such an attack.
How to Defend Your Website Against a Canonical Link Attack:
Sadly – there is no way to stop people making weird links to your site, or pointing SE bots to sitemaps full of stupid URLs.
Instead, you are forced to rely on handling the bad requests.
Deploying proper canonicalisation across your site will go a long way to reducing the impact,
after the initial requests have been made.
It doesn’t matter if you use the canonical link element in the page head, or a canonical link response in the header, or set canonicals via your sitemaps – just make sure you deploy proper canonicalisation, at the very least.
Using “hard” canonicalisation methods (such as redirects), and rejecting certain types of requests can be more effective (but hardly anyone goes that far!).
If you’re not sure whether you’re exposed, or want more advanced ways of handling this sort of attack – you can reach out to us.