Apache Information Disclosure Issues or, “How to detect cloaking”
Well, we made it to our second SEO blog post without a major hitch. This one is about an Apache issue that I was talking about that is probably one of the nastier issues out there as far as detecting SEO (Search Engine Optimization) IP cloaking from the search engine’s perspective. I doubt things will roll this fast and furious once we get some of these initial projects out of the way but thus far I am cranking away.
Anyway onto the problem. Again, putting on my black hat, I would assume based on the fact that there are so many SEO companies out there that one or two of them may be IP cloaking. Call me crazy. For anyone not in the know, IP cloaking is where you give a search engine spam (like Google or Yahoo, etc…) and real users legitimate content, or vice versa depending on the application. All this for the eventual goal of raising natural search ranking as opposed to paid advertizing. Eventually I’m going to build an ROI tool to show people why natural search is so valuable, but I digress.
Well, there are really a ton of ways to do IP cloaking but the most common under Apache are using mod_rewrite or using a ScriptAlias. First you provide a link to a search engine and then you direct it to a script to deliver different content depending on IP matching (there are lots of problems with this technique beyond this, which I’ll go into in another blog post).
Okay, so what? Google and Yahoo see something different than everyone else and they can’t tell that they’ve been duped, right? Well, sorta. While I was playing around with some server headers I came across something odd when connecting to scripts verses normal HTML files:
Normal file headers under Apache 2.0:
HTTP/1.1 200 OK
Date: Fri, 07 Apr 2006 08:46:54 GMT
Server: Apache 2
Last-Modified: Fri, 07 Apr 2006 07:52:33 GMT
ETag: “1b0979-777-a5636e40″
Accept-Ranges: bytes
Content-Length: 1911
Connection: close
Content-Type: text/html; charset=ISO-8859-1
CGI Script headers under Apache 2.0:
HTTP/1.1 200 OK
Date: Fri, 07 Apr 2006 08:26:37 GMT
Server: Apache 2
Content-Length: 2616
Connection: close
Content-Type: text/html; charset=ISO-8859-1
Well, that’s kinda interesting I guess, but the fact that the file is named “.cgi” would probably tip you off before anything else so it’s not that interesting. But then I attempted cloaking the file with something like this:
ScriptAlias /cloak.html “/usr/local/www/htdocs/cloak.cgi”
Which would give the user the appearance that they were going to an HTML file while they were actually visiting a dynamic page. This is where it gets interesting. Here is the resultant header:
ScriptAliased file headers under Apache 2.0:
HTTP/1.1 200 OK
Date: Fri, 07 Apr 2006 08:32:47 GMT
Server: Apache 2
Connection: close
Content-Type: text/html; charset=ISO-8859-1
Notice anything different from that header and the normal file? I’ll give you a hint, it’s the ETag. In particular, it’s non-existant on CGI scripts altogether. Why’s that? The ETag header as defined by RFC2616 provides the current value of the entity tag for the requested variant. In english that means that it gives you the unique value of that file being requested by performing a mathematical function on the location on the drive and the last modified date. Okay, that’s pretty interesting but let’s come back to it in a second.
Now what about mod_rewrite? Mod_rewrite is the cloaker’s tool of choice because of it’s flexibility. Let’s say you wanted to send any URLs with the word “seo” in them to a script. IE: www.whatever.com/seo or www.whatever.com/blah/seo/blah etc…. You’d use mod_rewrite simply because it is easy and scalable. Here’a an example that would do just that:
Example .htaccess file with mod_rewrite:
RewriteEngine on
RewriteBase /
RewriteRule seo /cloak.html
In the example above I am re-writing to an HTML file (the same HTML file as the very first example) not a CGI script. Now, this is a pretty good cloaking technique because again it is scalable, however it suffers a different but similar flaw to what we saw before. Here’s an example:
Mod_rewrite to original HTML file headers on Apache 2.0
HTTP/1.1 200 OK
Date: Fri, 07 Apr 2006 08:46:15 GMT
Server: Apache 2
Last-Modified: Fri, 07 Apr 2006 08:52:33 GMT
ETag: “1b0979-777-a5636e40;2bd1c700″
Accept-Ranges: bytes
Content-Length: 1911
Connection: close
Content-Type: text/html; charset=ISO-8859-1
Let’s look at those two ETag signatures side by side:
ETag: “1b0979-777-a5636e40″
ETag: “1b0979-777-a5636e40;2bd1c700″
It looks like Apache has told us two things. It has told us the the original file is the same, and it has told us that it is accessing it in a different way (in this case via mod_rewrite). But wait, there’s more. What if we use mod_rewrite to access a CGI script (the most common application for mod_rewrite for SEO cloaking anyway)? Let’s check it out:
Mod_rewrite forwarding to a CGI script headers
HTTP/1.1 200 OK
Date: Fri, 07 Apr 2006 08:28:11 GMT
Server: Apache 2
Content-Length: 1911
Connection: close
Content-Type: text/html; charset=ISO-8859-1
Okay, but does that really help us? I mean, there’s no ETag at all right? Well, yes, and that’s the exact point. Because there is no ETag on in the header and there is for a confirmed normal file, you can tell that that page is dynamically created using mod_rewrite or a ScriptAlias. But now you’re asking, “What if you don’t know if it normally has the ETag at all, or more specifically what if the entire htdocs directory is dynamic?” How about trying a file that is always there and lives outside of the htdocs directory? The Apache logo that is included with the base install inside the /icons directory definitely qualifies. By getting /icons/apache_pb.gif we see the following:
GET /icons/apache_pb.gif HTTP/1.0
HTTP/1.1 200 OK
Date: Fri, 07 Apr 2006 08:32:37 GMT
Server: Apache 2
Last-Modified: Tue, 21 Apr 2004 14:35:21 GMT
ETag: “1818d7-916-a64a7c40″
Accept-Ranges: bytes
Content-Length: 2326
Connection: close
Content-Type: image/gif
That’s even true if the .htaccess file would seem to disallow that with something extremely restrictive like the next example which tried to make anything with a slash in it redirect to cloak.cgi:
RewriteEngine on
RewriteBase /
RewriteRule “/” /cloak.cgi
The reason being, the .htaccess file lives outside of that directory. So unless the webmaster takes specific action to remove the /icons directory or remove the apache link in httpd.conf or otherwise add cloaking to all the files on the system there is a high risk of cloak detection.
And there you have it folks. Using a static file to base-line, a search engine can tell what else on your system is dynamically built and may make it more likely to be cloaking - thereby raising red flags. I tested this under Apache 2.x primarily but it should work on all forms of Apache that use the ETag header (versions 1.3.23 and later). Black-hat SEOs beware. Your mod_rewrites are vulnerable to information disclosure and the search engines of the world can tell what you are doing if this is every implemented as a detection mechanism. I wonder what Matt Cutts and Jeremy Zawodny will think of this.
Now, back to work!
