Duplicate Content

duplicate content - duplicate content

Tackling Technical Causes of Duplicate Content

Perhaps the biggest, most common, On-Page SEO error that I see across the web, has to be Duplicate Content. The sad thing is this problem is common across all kinds of websites from small time bloggers, to large media websites, and (perhaps the most unforgivable of all) SEO agencies.

Duplicate Content is also probably the single topic of SEO that has the most confusion and, for lack of a better word, bullshit written about it.
The truth is that Duplicate Content is actually quite a simple concept. In fact there is pretty much one decisive rule that encompasses the whole idea of Duplicate Content:

One Piece of Content – One URL – No Exceptions

What this means is simply that you should provide Google with one unique URL for each of your pieces of content.
For example this article is available under the URL “http://www.danclarkie.co.uk/duplicate-content.html”

One Piece of Content – One URL.

Perhaps the biggest myth/misunderstanding surrounding duplicate content is that another website copying your content and republishing it, is a Duplicate Content issue. It isn’t.

Duplicate Content refers specifically to content duplicated on other URLs on your own website.

Other websites copying or scraping your content is a different issue and thankfully Google is now pretty good at working out which website is the true origin of the content and ranking them accordingly.

Why is Duplicate Content a problem?

We have to remember, no matter how much we might sometimes like to think otherwise, Google is just a dumb computer. The GoogleBot crawls the web, reads the pages content and then stores them in a big database.
OK there is more to it to that, but essentially that is it. So when the GoogleBot comes across the same content under different URLs it doesn’t know which to present to people in the search results.
Equally you might want to try and increase your search engine rankings by copying the same (or very similar) content onto many different pages of your website in a (misguided) effort to try and increase your relevancy to your chosen keywords.
Also if you think about it from a non-Google perspective, if you were building a website there are very few legitimate reasons that you would put the same content on several URLs.

All of these reasons together add up to one conclusion, Duplicate Content is bad.

Dealing With Duplicate Content Issues
So once you have identified a Duplicate Content issue how can we go about fixing it?
There are basically two main ways in which you can fix Duplicate Content problems, either by redirecting duplicate URLs to the main source or by canonicalising your content to signal the main source.
URL Redirects

Anyone visiting "PAGE1.HTML" or "PAGE2.HTML" or "PAGE3.HTML" is redirected to "PAGE.HTML"


If you have the same content on several URLs then by far the best way to fix this is to set up URL redirects to redirect all off the URLs to one single URL.

For example, if we have a page called “PAGE.HTML” which we want people to visit, but we have the same content available under “PAGE1.HTML”, “PAGE2.HTML”, and “PAGE3.HTML”.

We can set up URL Redirects to send anyone visiting “PAGE1.HTML”, “PAGE2.HTML”, or “PAGE3.HTML” to “PAGE.HTML”

One thing to remember here is to be careful to use a 301 redirect not a 302 redirect.

301 Redirects tell Google that the content has been moved and will now permanently be available under the new URL so the new URL should be indexed/served in the SERPs

302 Redirects tell Google that the content has been moved temporarily and that the target page should not be indexed, owing to the “Temporarily” nature of the redirect.

If you are running on a “Apache” server, then the best way to do URL redirecting is in the .htaccess file. The .htaccess file is a file that lies at the root of website and tells your Apache server how to handle URL requests.
Most modern blogging CMS’s include a .htaccess file with the standard install.

A simple 301 redirect in the .htaccess file looks like this:

Options +FollowSymlinks
RewriteEngine On
RewriteBase /
redirect 301 /page1.html http://www.danclarkie.co.uk/page.html

In this example we have page.html with content that is duplicated on page1.html. So we redirect anyone visiting danclarkie.co.uk/page1.html to danclarkie.co.uk/page.html.

This applies to real human visitors as well as search robots, so page1.html ceases to be available and the content is only accessible under one URL.

You can redirect any URL from your domain to any internal or external URL in the .htaccess file. A good example of this is since Google+ currently doesn’t support “pretty URLs” a lot of tech savvy webmasters set up “danclarkie.co.uk/+” to redirect to their Google+ profile.
Canonical Tags

By designating one URL using the Canonical Tags we tell Google "This one is 'The Daddy' "

Canonical Tags

The second option to deal with pesky duplicated content, when redirecting URLs is not possible/wanted, is by way of canoncialising the “daddy” URL as the main source of the duplicated content.

This is the best option where Redirecting URLs isn’t possible, for example if each item in your webstore needs a unique URL for sales tracking, but a bunch of items have identical descriptions (ignoring that this is a sloppy, bad URL structure in the first place!)

If you allow all of the pages to be indexed by the GoogleBot then you have a Duplicate Content issue.
If you delete or redirect the duplicated pages then you will make less sales and offer less products.

Canonical Tags to the rescue!
First we need to choose one page as “The Daddy”, all other pages that show the same content can then be mapped back to this page. For Example: In the diagram we set “PAGE.HTML” as “The Daddy” and canonicalise all of the duplicate pages back to that URL.
Important to also notice is that we also place a self-canonical tag on “PAGE.HTML”, there is discussion as to whether or not this is needed but in my opinion is certainly best practice.

Setting canonical tags is easy, just add the following in the <head> section of your page’s code.

<link rel='canonical' href='http://www.danclarkie.co.uk/page.html' />

The GoogleBot will read the canonical tag and know that the page’s contents belong to that URL, simple.

One of the worst things you can do is dynamically generate the canonical tags.
By doing this you set the canonical tag to the current URL and tell the GoogleBot “This page is the Daddy”, doing this on mulitple pages with duplicate content is a sure fire way to look stupid and fall out of favour with the GoogleBot, so don’t do it.

Typical Duplicate Content Errors

Index Files

By far the most common Duplicate Content error that I see all over the internet is the index file problem.
This is where you have your website set up in the usual manor with an index file resting at the root level of your website.
For example www.example.com/index.html where the websites main page is the index.html file.
This is typically how all websites work, when you visit bcc.co.uk your browser loads the BBC website and loads the index file from its root level.
This stems from way back in the early days of the internet, the index file would literally be an index of the websites contents and would look something like this: http://…/debian/index.html.
If you take away the index.html part of that URL you still load the same contents http://…/debian/
As you can see, this example shows why the index file problem exists.
If a “folder” url is requested the default behaviour is to show the index.html file that is contained in that folder, but the index file can also be called directly.

Same Content – Two URLS
Duplicate content Problem

This is a classic case of Duplicate Content that can be fixed with a URL redirect.
Simply redirect the index file to the root domain and the problem is fixed! example(http://www.danclarkie.co.uk/index.php)

This is a typical example that you can find all over the internet and isn’t just limited to small blogs, you can find it on bigger more serious websites.
For example http://www.thelocal.de and http://www.thelocal.de/index.php.
The problem would be less serious with appropriate canonical tags, but there are none.

Sub Domains

Another classic cause of Duplicate Content is poorly handled subdomains, most commonly, “www.” and the root level domain.

For a good example we can look to my charmingly, idyllic hometown and the town centre website:
http://www.mansfieldtowncentre.co.uk & http://mansfieldtowncentre.co.uk.
In this case the internal links are all relative so the entire website is duplicated making each page accessable under a www. and non-www. version.

Also the main page suffers from the previously mentioned “index file” problem http://www.mansfieldtowncentre.co.uk/default.aspx.
In this case we have the main page of the website accessible under 4 different URLs, no Canonical Tag.

  • http://www.mansfieldtowncentre.co.uk
  • http://mansfieldtowncentre.co.uk
  • http://www.mansfieldtowncentre.co.uk/default.aspx
  • http://mansfieldtowncentre.co.uk/default.aspx

Worse still is when a website really mishandles the sub-domains and just shows the root domain contents regardless of the sub-domain.
Probably the worst example I have seen of this comes from Google itself:

Which quite surprised me. The next logical step was to try those domains with “index.html” at the end..

Obviously Google are probably all like “We’re Google, we do not give one single fuck” indeed I tweeted Matt Cutts to ask him about this, but he was too busy to respond:


Trailing Slash

The third and final classic example of duplicate content is what’s known as the “trailing slash” this is where the same content is accessible by visiting “example.com/hello” and “example.com/hello/”.
The difference in the URLs is subtle but it still fits the golden rule of Duplicate Content.
A real world example can be seen here:

This is a problem that can easily be fixed in the htaccess file to either redirect all URLs without a slash to URLs with a slash or visa-versa.
To redirect all urls with a slash to URLs without a slash, your htaccess file should include, something like, the following:

RewriteEngine On
RewriteBase /
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_URI} ^(.*)/$
RewriteRule ^(.*)$ http://domain.com/$1 [L,R=301]

That’s it!

Just remember “one piece of content – one URL” either redirect or canonicalise duplicate pages to a master URL then you can avoid a duplicate content slap from Google.
Setting appropriate canonical tags is always a great idea, even if you are not aware of any URLs causing a duplicate content issue, if you set canonical tags for all of your content then you are safe in the event that you accidentally start producing duplicate URLs.
It’s also just a good habit to get into and good SEO!