Thursday, February 08, 2007

Search Engine Optimization with Apache and mod_rewrite

I've recently been using the powerful mod_rewrite to modify the URL's on a client's website. mod_rewrite is a powerful tool that lets you turn "ugly" URL's like

http://example.com/search.cgi?searchType=pie&searchTerm=pumpkin%20pie

into cleaner URL's like

http://www.example.com/pie/pumpkin_pie

This is useful for a couple reasons - not only is it cleaner to look at, but it can help with search engine indexing. In this case, because "pumpkin_pie" is part of the URL as opposed to part of the query string, the keyword ranks higher in many search engines.

Lets say we have an application that will return search results for various categories, and we want the URL's to have the format of "http://www.example.com/(category)/(search term)". Also we want to have a landing page if the URL is simply "http://www.example.com/(category)". We want to make this as generic as possible so that the httpd.conf does not need to be edited every time a category is added.

This can be configured a number of ways, but the way I have it installed here is with apache running on port 80, and the application - a java servlet container - is running on a different port, say port 8000. Apache intercepts most of the requests for static, on-disk content, and uses the proxy mechanism to send dynamic requests to the servlet container. Let's break down the relevant sections of the apache configuration file:

First, it can be useful to funnel all traffic for your site through a single hostname, as opposed to links to both "example.com" and "www.example.com". This rule will force a redirect back to "www.example.com" with a HTTP 301 redirect:

RewriteCond %{HTTP_HOST} ^example.com$ [NC]
RewriteRule ^/(.*) http://www.example.com/$1 [L,R=301]

Now lets map the static page elements and HTML to the local filesystem, so that they don't get remapped to a search query, and are served by apache instead of proxied through another layer. Note that we need to map favicon.ico to the local filesystem, else you can end up sending searches to your application when the browser requests the favicon.ico for /pie/pumpkin_pie/favicon.ico! The [L] in the rewrite modifier tells the rewrite engine to stop the processing at this point and serve the file directly.

RewriteRule ^/js/(.*) /opt/static/js/$1 [L]
RewriteRule ^/pictures/(.*) /opt/static/pictures/$1 [L]
RewriteRule ^/images/(.*) /opt/static/images/$1 [L]
RewriteRule ^/css/(.*) /opt/static/css/$1 [L]
RewriteRule /favicon.ico$ /opt/static/html/favicon.ico [L]
RewriteRule ^/robots.txt /opt/static/html/robots.txt [L]

Another useful trick is to re-map underscores to %20 in the search parameters, so we can use terms like "pumpkin_pie" that get remapped to "pumpkin%20pie" when sent to the backend application. This rule will match any URL that has an underscore in it, and then rewrite one underscore to a %20 and then send the processing back to the first rewrite rule. (So it will keep remapping them one at a time until they're all gone). This is necessary because we don't know how many underscores there might be in the URL, and there is no "replace all" modifier like "/g" for normal unix search and replace. Note the "QSA" in the rule modifiers; this means "Query String Append" and will leave any query string intact through the processing:

RewriteCond %{REQUEST_URI} ^/.*_
RewriteRule ^/(.*)_(.*) /$1\%20$2 [N,QSA]

Now lets say there are a couple of URL paths we want to treat differently, say, we need to treat the "buy" section of the site differently. With the way we map the general search cases later in this file, anything that needs to be treated differently needs to be mapped in a way that will bypass the generic match:

RewriteRule ^/buy/(.*) /purchase.jsp?cat=$1 [QSA]

Now for the "/(category)" landing page. We have to have a limitation here for categories to be only alphanumeric characters - this is so that things like "purchase.jsp" are not treated as categories! Also we prevent any request that contains a query string from being treated as a category, so we can have servlets, etc, continue to work:

RewriteCond %{QUERY_STRING} ^$
RewriteRule ^/([a-z]*)$ /landingPage.jsp?category=$1 [NC]

Now for the generic /(category)/(searchterm) mapping.

RewriteRule ^/([a-z]*)/(.*) /search.jsp?category=$1&search=$2 [NC,QSA]

We are at the end of the line, we proxy the resulting modified URL back to our application:

RewriteRule ^/(.*) http://127.0.0.1:8000/$1 [P]

And if you run into any trouble, you can turn logging on with the following commands:

RewriteLog /opt/app/logs/rewrite.log
RewriteLogLevel 9

Now of course, these remappings only map INCOMING URL's to our application. Our application is still responsible for sending this URL format back to the user, so if a user links to your site they are using this optimized URL format. Another way to get these URLs sent to search engines is with a sitemaps file, see www.sitemaps.org for details.

Tags: , ,

1 comment:

Anonymous said...

Great article, thanks for sharing on my site :)