| « Searching for Television | Tastes Like Consumer Reports » |
I should probably write something original, lest the entire blog be relegated to the depths of "supplemental results". Over the past couple months, there's been a lot of that going around, apparently. Perfectly good (at least in the owner's opinion) content being shoved off supplemental results or, worse, vanishing completely, with no apparent rhyme or reason. Granted, expecting rhyme or reason from a search engine algorithm, though nice, is probably wishful thinking. Besides, assuming there actually is either rhyme or reason involved, no one is actually going to reveal this information to you.
I mean, if they did that, someone might exploit it...
In the last few days an interesting story broke in the dark corners of the internet known as SEO forums, one that may provide a partial explanation as to why so many results may have gone supplemental. Namely that Google started running out of room to index an internet that, to their spider-eyes, had suddenly grown by astronomical leaps and bounds. Yes, over the course of a couple weeks, Google's intrepid spiders had suddenly uncovered uncounted billions (yes, with a 'b') of new pages, all with keyworded backlinks and, fancy this, all sporting PPC ads.
It seems someone found a chink in that shiny armor of theirs. Someone with the will and resources to exploit that chink in a monumental fashion. Anyone who's surprised can please leave the room now.
This particular chink has everything to do with how kindly Google treats subdomains. A subdomain is, basically, a third-level domain, something like this: subdomain.domain.com. This humble blog is a subdomain, in fact: blog.apollohosting.com. You may have noticed them strewn across the net; they're popular ways of organizing information on a large site without resorting to directories or separate domain names. Technically there can be more sublevels, but one generally gets the job done.
Google's spider treats subdomains like unique domains, and, even though it might not "deep crawl" a subdomain immediately, it will add the main index page to Google quickly. How it remembers to return for a "deep crawl" is irrelevant. Perhaps it scribbles a note and attaches it to the fridge, trusting that Mrs. GoogleBot won't cover it up with their kid's latest crayon rendering of DMOZ.
So, you've got a kind of page that Google will index, basically "no questions asked." Surely no one will exploit this behavior... Altogether now: "Of course someone will exploit this behavior, and don't call me Shirley."
Enter some fine, enterprising Eastern European gentleman, his servers, his spambots, his content scrapers, and his PPC accounts. Granted, as the poor attempt at humor above illustrates, that someone exploited this isn't surprising at all, what has everyone buzzing is just how massively he did it.
I'll try to give a quick thumbnail as I understand the "process". It's not something anyone "off the street" could accomplish, but the scripting knowledge required supposedly isn't all that great, either. It also takes a little investment capital to run some decent servers, since, you know, you'll soon be getting massive amounts of Google traffic and you don't want those puppies to bog down.
Once you've got your servers and your magic scripts, it's time to unleash the old staple- spambots targeting blogs and any comment/testimonial forms. Since this blog is a repeated target of these self-same automated scourges (to the point I've basically shut down all interactivity), I already wish this guy would get run over by a bus. The spambots serve to sow the seeds, though, and it only takes a few, because that's where the scripts come in.
Once GoogleBot has been put on the scent via the referrals section of a lazy blogger, the scripts do the rest. They create, on the fly, an essentially endless array of subdomains, each with a single index page. That single index page, via the kindness of Google mentioned earlier, is added to the index quickly. Sure, GoogleBot intends to go back and look deeper later, but for now, it trusts the index page is "okay." Each of these subdomains is linked via keywords, contains scraped content relating to said keywords, and a variety of PPC ads targeted at the keywords through which it's linked. All of this is accomplished by the scripts on the fly.
Bingo... billions of new pages, all indexed, and many showing up quite well in the SREPs. Which is why they all have PPC ads on them. Reports claimed they were Adsense ads at first, but they may have changed later on to different networks. Granted, with a scheme like this, they wouldn't need to be there long to make a killing. The ultimate irony is that Google makes money off the deal too, since all those Adsense impressions on all those billions of pages are coming out of the pockets of advertisers paying Google to display the ads in the first place.
There's a more detailed explanation of the process here.
Google has referred to the cause of the problem as a "bad data push". This has been met with everything from amusement to derision from the SEO community. Google's immediate response to the issue has been to "put troops on the ground", so to speak; manually deleting the domains and subdomains from the index over the course of the past few days. The sheer number of them makes this a slow process, as ripping out that much data over multiple datasets can't be done haphazardly.
Google is at least providing lip service to the fact they are attempting to fix the hole in the algorithm that caused the issue, but at the moment the only sure work being done is on the manual side. Naturally, now that one enterprising individual has shown the way, others are leaping on the bandwagon in order to cash in before the "loophole" is actually closed algorithmically. More examples of similar subdomain schemes are popping up, no doubt with more to come.
This story has yet to penetrate the "mainstream" media in any significant way. It exists in that strange, eccentric place known as the "blogosphere." Though perhaps it isn't the best place to look, there was only three sources for the story available on Google News as of the morning of the 20th (ought-six), and only then by searching.
Whether it will crack any large news outlet is unclear at this point. Google is probably concerned enough about all those savvy SEO's out there gossiping about their misfortune without having to save face to the general public. Then again, this is a story whose significance relies on knowledge of some underlying information that is mostly lacking in the general public, especially those without websites.