Can the Googlebot read JavaScript Ajax Cookies

An Experiment into Search Engines Client Side Script Handling

An Experiment into Google's Handling of Client Side Scripting

Having recently read the fascinating blog comment from iPullRank over at seomoz (Just How Smart Are Search Robots?) It really made me think about everything we think we know about the GoogleBot.
What made matters worse, I was set to teach a session on “Understanding the Googlebot” the very next day to my team of 40, or so, interns.

Bollocks, I thought.

Suddenly the idea that Google can only see the source code of a page and any client side trickery happens right under their dumb noses, was no longer true. To be honest, SEO’s have been either lazy, or stupid, or both to think for so long that the Googlebot was some simple bot that just crawled over pages chewing up the output source code. We are talking about the biggest internet company in the world, did we seriously think that they were resting on their laurels, happy with the massive failings that the classic Google Crawler presented? to be honest, I could write a simple crawling robot, I did of sorts (Link Checker) and I’m pretty sure that the guys that write the code behind Google, are a little bit better at it that I am :D…

If you look at it, all the signs were there, back in 2009 Google put forward a a plan on how to access Ajax content, more recently, the interwebs was a abuzz with the whisper that Google was starting to index Facebook comments which are hauled over an Ajax request. More rumours floated around of Flash pages being indexed and ranking for their content….

You Mad, Bro?

You Mad, Bro?

Something strange was afoot, no mistaking.

So, it’s no big deal right? This only changes… Everything.

With this raft of new rumours I really wanted to get some real world experience to test for sure what the Googlebot is able to read and what it isn’t.
So in this post I will use a few of the classic techniques that have traditionally been used to obfuscate text from the Googlebot.

To do this in a pragmatic way we need to take a string of text that can’t be found anywhere else on the web. That way when we search for that string of text, if we find this page then we can safely say that Google has read and understood the text rather than giving this as a result owing to external factors.

Secondly we need to use a unique string of text for each different method of obfuscation in order to know, specifically, which methods are being read and which aren’t.

The text strings I will use are inspired by one of my favourite songs:The Age of the Understatement
I really love this song/album for its fantastic use of the English langauage, so we will use some of the more poetic lines as our test strings with the letters all shifted by 1 place (a = b, b = c, etc)

Test 1 – Simple Javascript
Relentless Marauder
Here we will use a really simple bit of JavaScript to output a string of text.

<script type="text/javascript">
document.write("The Secret Keyword");
</script>

so… Relentless Marauder becomes…


Hurrah!

It’s highly likely that the Googlebot will pick up on this text though as it is inline JS and visible in the source code….

Test 2 – jQuery AJAX request
Endearingly Bedraggled
Here we are going to use a little bit of jQuery to pull in the contents of an external file with an AJAX request.
The contents of the external file will be pulled into the page and show in the <p> tag.

Traditional thinking here is that the Googlebot WILL NOT be able to see the text as it is rendered client side and is not visible in the source code…

However, the address of the file is visible in the source code. The googlebot could simply send out a spider to that file, read the contents then do some magic to allocate that content to the assigned <p> tag.

<p id="test2">Endearingly Bedraggled</p>
<script>
  $("#test2").load("/gbottest2.html");
</script>

so… Endearingly Bedraggled becomes…

Endearingly Bedraggled

Test 3 – jQuery AJAX request with Referrer Filtering
Affection to Rent
In this test we will use the same AJAX request to pull in the data from an external file, but the key difference is that this time we will run some referrer filtering on the external file so that the correct output will only be show if it is called into this page. Visiting the file directly will output the message
“no direct access, go awai”.

Hopefully this can counteract the idea outlined above in that the Googlebot could spider out to the external file, read the contents, then jigsaw them back into place on the page making the AJAX call…

The jQuery remains almost exactly the same with just the target file name changing.

<p id="test3">Affection to Rent</p>
<script>
  $("#test3").load("/gbottest3.php");
</script>

But rather than just being a HTML file, the target is a PHP file containing the following filtering:

<?php
if($_SERVER['HTTP_REFERER'] != "http://www.danclarkie.co.uk/can-the-googlebot-read-javascript-ajax-cookies.html"){
	echo "no direct access, go awai";
}
else{
?>
<p><mark>The Secret Keyword!</mark></p>
<?php
}
?>

so… Affection to Rent becomes…

Affection to Rent

Test 4 – Cookies
Attraction Ferments
In this final test we will set a cookie using PHP that contains the keyword, and then call the keyword from the cookie…

Again, traditional thinking has been that in this case the Googlebot WILL NOT be able to see the text as storing cookies is only done in the browser. Also, given that all of this code is server-side it will never be visible to the Googlebot unless they take the cookies.

Personally I think this text will remain hidden from the bot as dealing with cookies would be quite a big task.

Ok so first of all I edited the code right at the top of the header.php file of my template.

Initially I set it to check to see if you visit this blog post (as opposed to any other page on my blog) then if you are on this page to check to see if you have already had the cookie set. If so then to do nothing, the page loads and you see the contents of the cookie. If not then to set the cookie and reload the page instantly so that the browser can read it.

This was a bit of a ham-fisted hack (even by my standards). Also I considered pulling in the cookie via AJAX but then there would be 2 possible points of error, which is why I opted to do it this way.

But I then realised this would set up a redirection loop for anyone that cant handle cookies, and if the Gbot cant handle cookies then it would loop around like crazy and never see the content..
I aint even mad

I Aint even mad


In the end I decided to just set the cookie in the header, sadly the contents wont be available until after a reload but by setting the cookie on every page, it means that if a vistor views any page before viewing this one then the cookie will be there. Also if the Googlebot crawls the site it will get the cookie at some point then when it crawls this page (if it stores cookies) the text will be visible.

At the top of the header.php file I added the following

if(isset($_COOKIE["test4"])){
//ok its done
}
else{
$value = 'the secret keyword';
setcookie("test4", $value);
}

Then on the page we just call that cookie…

<?php
echo $_COOKIE["test4"];
?>

so… Attraction Ferments becomes…

So there we go.
Now, we wait for the Googlebot :D
Results to follow {^^,}