Fixing WordPress 404 Problems for Google Sitemaps

Fixing WordPress 404 Problems for Google Sitemaps

I had a problem with my Google Sitemap, which was not being recognized by Google because my “404 (file not found) error page returns a status of 200 (Success) in the header.” So I dug around to fix my 404 page setup, which never really worked. Geeky notes follow, so I don’t have to look this up again.

Setting up a Custom 404 Page

I had noticed some time ago that non-existent pages on my site which should have generated 404 pages were instead delivering “post not found” pages. This was right after I upgraded to WordPress 2.0 from 1.5, so I figured it was just some change to the way it worked.

As I was researching Google’s 404 verification requirements and WordPress, I realized that it was that my custom theme doesn’t have a custom 404.php page. So I added one, following the directions. Still no go on Google verification. I used a web page header display tool to check that the 404 was being sent. It worked, but then when I told Google to verify the site again, it failed. Weirdness.

Caching

After some digging, I tracked it down to WP-Cache 2.0.17, the plugin I use to reduce the load on my shared server. What happened: when an attempt to access a non-existent page occured, the first time WordPress properly delivers a 404 page with the right headers set. However, this output is CACHED by WP-Cache, so the *next time** the bad page is request, the cached error page is delivered! And of course, that’s not a 404, but a successful delivery.

WP-Cache 2.0.19 fixes this by no longer caching 404 errors. Google Sitemaps verified my site, and everything seems to be working again

Spiffing up the 404 Page

I came across the A Perfect 404 article as I was figuring out what was going on, and cleaned up my 404.php file to be friendlier. If the $_SERVER['HTTP_REFERER'] variable exists, it emits it as partof the error message, and provides a link back. If it doesn’t exist, it prints a more generic message. I was thinking of implementing a check of the referring link to customize the message to search engine traffic, but I’ll leave that for another day. The A Perfect 404 has some instructions if you’re interested.

SECURITY UPDATE

In the comments, reader “epc” points out that printing out the value of Referer without some escaping is not a safe practice. I added a test that checked whether the referer value begins with https://davidseah.com or http://www.davidseah.com, and further escaped the output using the htmlspecialchars() function. I’m not sure what can really be done with the 404 page that might be dangerous, but thinking about issues like this is a good habit to get into. This article on Top 7 PHP Security Blunders was helpful in understanding some of the other issues. Thanks epc!

4 Comments

  1. Kyle Korleski 17 years ago

    Nice. I am going to be redoing my 404 page with the redesign of my website.

  2. epc 17 years ago

    Make sure you’re escaping the value of Referer before embedding it in your page.  It appears that you’re using mod_security or something else to trap nasty referers (like &ltscript>something), but be paranoid and escape the string, just in case a configuration change trips up the filtering and ends up allowing a XSS/CSRF referer through.

    I would drop the Expires: header, it’s redundant with the cache-control headers.  It doesn’t do any harm either.

  3. Dave Seah 17 years ago

    epc: Thanks for the advice! I’m clearly playing with things I don’t know that much about, so I’m glad you’ve taken the time to point out the way.

    I applied the following preg_match check:

      |^http://(www.)?davidseah.com|
    </pre>

    to just allow referer from within my site to emit the “back to” page. If there’s no match, then it just prints the generic not-found text.

  4. Kyle Korleski 17 years ago

    Yeah, that’s some good advice. I am working on my blog’s 404 page so it can work with all the navigational stuff I am working on.