Isolated, hidden webpages could be indexed by Google

Did you know that Google could index pages that have no inbound links? I mean, pages that have not been submitted to any search engine, that are not linked from any visible indexed link, that are not listed in any sitemap, and that are hosted in servers with directory listing disabled.

In the following lines I will explain you the complete scenario that led me to investigate whether such supposedly hidden pages could be indexed by Google.

And, what's more important, I will also explain how I think Google could be authorized to discover such webpages.

Lastly, I will explain how the distributed datacenters of Google may lead to confusing results, or even to valuable hints if you really take into account every information source.


Why I thought about how Google could find pages without inbound links

While I was creating a website for a customer, I developed an under construction version of the webpage. Then I uploaded this version to my own website for testing purposes.

Let's say that my customer main URL address was:

http://my-Customer-Domain-Name/

And that I uploaded the test webpage version to:

http://my-website-domain/my-Customer-Domain-Name/index.htm

So I thought that the only way to access that under development version of the webpage was typing the whole exact directory URL, right? Wrong! I was really shocked when I performed a search in Google, using my customer domain name as the main search keyword, and the second entry in Google search results was this test webpage hosted on my own website.

Yes, a page that I considered a hidden webpage, supposedly without any inbound link, was indexed by Google!


Was that webpage really hidden from search engines?

Okay, the fact is that Google found that kind of "hidden" webpage. It was an under construction page after all, so nobody was meant to take a look at it (not yet!). So I was really interested in finding out how Google managed to index this hidden and isolated page.

The first question is: was this webpage really hidden from search engines? Here is a list of all the basics that should be checked (and that I checked) to verify that a webpage is a hidden webpage.

  • The indexed webpage had no inbound links (according to Google link search). It wasn't linked from any page from my website. And, at that moment, wasn't linked from any other webpage on the whole Internet.
  • The testing webpage was never submitted to any search engine. I'm not sure about whether search engines would really crawl a suggested directory, instead of trying to crawl from the main domain index. But I'm completely sure that I didn't submit the URL of my website under development to Google.
  • Nobody else could have submitted such URL. None of my co-workers did know about such test version. And I think hackers have much more interesting things to do.
  • The webpage was not included in a sitemap. Yes, I have Google sitemaps and Yahoo URL lists on my website. But I checked the included URL's, and the mysteriously indexed URL was not on any sitemap or list.
  • Directory listing was disabled. If you don't type the exact URL, and if there isn't an index page at the target directory, my web server won't return a list of the hosted webpages - just a lean mean "directory listing disabled" message.

Then I was pretty sure that such webpage was really hidden from search engines... And then I thought that I was giving some additional information to Google (and only to that search engine, the powerful Google) in a somehow indirect, unconscious way.

So here are the two Google services that could have been used by Google to index new webpages:


Pages with Google Analytics code may be discovered and Indexed by Google

I realized that I had Google Analytics code already installed at the testing webpage. I created a profile for a new website (in fact, my customer's final website), and I embedded the tracking code (Google's latest code, ga.js) at the webpage version under construction.

On the other hand, the installation of this tracking code wasn't complete: I never uploaded the webpage under construction to the customer's domain, and Google displayed the message "tracking not installed". And the stats graphic didn't display any visit at all.

So I think that it's possible that Google could use Google Analytics data to index new web pages (or even to tweak search results in order to improve relevance).


Anonymous navigation data sent by Google Chrome could be used to index new webpages in Google

The only other Google application that had knowledge of the existance of my hidden testing webpage was the brand new, lightning fast web browser Google Chrome. I used this web browser to try the under construction webpage version. And then I found out that I had active the option of Google Chrome which allowed this browser to send anonymous data about navigation bar search suggestions and autocompletion.

I thought that these anonymous data would be used by Google to provide more relevant and more accurate search results. What I wouldn't expected is that anonymous navigation data gathered by Google Chrome could be used to index new web pages in Google. Yes, after discarding any other option, I thought at first that Google find out my hidden webpage because I provided the URL directly on the navigation bar of Google Chrome several times.


But the hidden webpage had an inbound link!

Just when I was sure that Google had indexed a hidden webpage, I received a very interesting message from Google Webmaster Tools after renaming such "hidden webpage".

The message was an URL crawling error about a missing URL. Checking the details of this error, I found that the "hidden webpage" had been linked from my customer's website (but was not linked from there anymore).

That reminded me of some of the basic principles in web programming:

  • Customers will use your website in really unexpected ways (even when the website is not already finished).
  • Google has many datacenters with different information, so you cannot rely on just a search result. While a normal link query may report that there aren't any inbound links, you can obtain different results by checking the stats of your website through Google Webmaster Tools
  • Google Webmaster Tools now reports broken links even when the only link to your webpage comes from an external website. After all, if the linked to you, you should keep some content in that URL.

Conclusions

It seems that you cannot rely on keeping secret an isolated web page without any inbound link if you plan to use some Google applications with it. Either Google Chrome or Google Analytics could be using gathered navigation data to index new webpages and add them to the main Google search engine: you could be authorizing them to do so.

That's not a security or privacy issue. In fact, if you really don't want search engines to crawl part of your website, tell them about that in your robots.txt file. Just keep in mind that robots.txt files are public, and that human users may easily find out complete lists of your 'secret hidden urls' by reading these files. But that's a different story.

After all, Google Chrome and Google Analytics could just be aiding Google to do its work: discovering and indexing as many webpages as possible (even when those webpages seem to be really hidden!).

Finally, if you want to get accurate results about the links to your website, don't rely just on a search query. Google has many datacenters with subtle different content. So if you want serious results, perform link queries through the main Google search engine, but check the link stats of your website through Google Webmaster Tools as well.

0 comments: