Webforumz Newsletter - November 2007
Articles
$_GET['ting'] Pages (Safely) With PHP
Introduction
I will never forget the first assignment of my apprenticeship. My new employer had a website with over two-hundred pages, and wanted to add a new item to the navigation menu. It was my first day on the job and I was determined and eager to please as I opened the first page, added the new link and, after saving, closed the file. Then the next one and the next one and the next one.... After opening, editing, saving and closing fifty or so files, I was starting to wonder if there wasn't a better way to do this. After opening, editing, saving and closing the one-hundred-and-twenty-fourth file, I knew that there had to be a better way because I was not about to trade my sanity for an apprenticeship. Sure enough, after a bit of searching, I found the better way: PHP and query strings.
The Basics
The layout and menu were identical on all the pages and the only things that were different were the title and the content. I took one of the pages, replaced the title and content with two PHP variables and saved it as "index.php." I then defined the two PHP variables in an external file which was included at the top of my new index.
An old page looked like this:
<html>
<head>
<title>Products</title>
<body>
<div class="menu">
<ul>
<li><a href="home.html">HOME</a></li>
<li><a href="contact.html">CONTACT</a></li>
<li><a href="links.html">LINKS</a></li>
<li><a href="products.html">PRODUCTS</a></li>
</ul>
</div>
<div class="content">
<h1>Our Products</h1>
<p>Bla bla bla blub.</p>
</div>
</body>
</html>
And the new template (index.php) where the title and content get parsed in using PHP variables looked like this:
<?php
include('page.php');
?>
<html>
<head>
<title><?php echo($title); ?></title>
<body>
<div class="menu">
<ul>
<li><a href="index.php?page=home">HOME</a></li>
<li><a href="index.php?page=contact">CONTACT</a></li>
<li><a href="index.php?page=links">LINKS</a></li>
<li><a href="index.php?page=products">PRODUCTS</a></li>
</ul>
</div>
<div class="content">
<?php echo($content); ?>
</div>
</body>
</html>
And the external file ("page.php") which was included at the top of the template:
<?php
$title="Products";
$content="\t<h1>Our Products</h1>\n\t<p>Bla bla bla blub.</p>";
?>
The HTML source code generated by both the old and the new pages was identical but now I could use PHP to change the content of the $title and $content variables. I then created a sub-folder called "pages" and some files for testing: "home.php", "contact.php", "links.php" and "products.php". By modifying the PHP at the top of index.php, I could dynamically load these pages using the value in the query string:
<?php
include('./pages'.$_GET['page'].'.php');
?>
A short explanation: The $_GET array holds the values that are passed in the query string after the "?" question mark and seperated by "&" ampersands. So a query string like: "http://www.mysite.com/index.html ? page=start & id=236 & name=john_doe", stores the following values in $_GET array:
$_GET['page'] contains "start"
$_GET['id'] contains "236"
$_GET['name'] contains "john_doe"
So, now I still had to convert all the old pages to .php files but, in the future, changes to the layout would mean only having to change my template instead of having to change two-hundred pages.
Security Issues
This one little line of PHP that I used to include the pages is completely inadequate. Not only can it quickly cause interpreter errors, but it is also a huge security risk. For example, what happens if someone enters a query string like "http://www.mysite.com/index.php?page=bla". The PHP intepreter will try to include "pages/bla.php" and will throw an error because that file does not exist. Another common problem is that the "page" variable in the query string is not set ("http://www.mysite.com/"). This also results in a PHP error when the interpreter tries to include "pages/.php". The worst case scenario is that on a Unix server a malicious user could enter the following query string to output the contents of your "passwd" file: "http://www.mysite.com/index.php?page=../../etc/passwd". If your server security is not configured properly, your usernames and passwords will be accessible from any browser in the world! So, how do we go about preventing this?
Handling 404's
The first step towards making this method secure is preventing clients from requesting pages that do not exist. First we check if the $_GET['page'] variable is valid and then we check if the requested file exists before we include it. I use an inline if statement to check the $_GET['page'] variable. If the $_GET['page'] variable is set and contains a value, then write it into the $page variable, otherwise $page will equal 'home.' Then I check if the requested file exists and, if not, then set the $page variable to 'error.':
<?php
// check the $_GET['page'] variable
$page = ((isset($_GET['page']) && $_GET['page'] != '') ? $_GET['page'] : 'home');
// check if file exists
$page = (file_exists('./pages'.$page.'.php') ? $page : 'error');
include('./pages'.$page.'.php');
?>
Of course there has to be an "error.php" page in your "pages" folder that tells the user the requested page is not available. A redirect to the standard 404 page is also a good option.
Prevent Browsing of the File Structure
It is also important that the clients are only allowed to request pages from your "pages" folder. That means the value stored in the $_GET['page'] variable is only allowed to be a string containing alphanumerical characters plus "-", "_", "." and spaces. In other words, a valid filename without the .php extension. There are various ways to check this: perl-compatible regular expressions, POSIX-Extended regular expressions, character type functions, or the normal string functions. I will use the PCRE functions preg_match and preg_replace to check for malicious requests and remove illegal characters. First I catch requests up or down the folder structure by checking for ".." or "/" in the $page variable. Then I remove all illegal characters from the $page variable by replacing them with an empty string. If there is anything suspicious in the query string, just set the $page variable to "home."This code comes right after the line where the $_GET['page'] variable is validated and before the file_exists function is called:
// prevent file browsing
$page=(preg_match('/(\.\.|\/)/i',$page)?'home':$page);
// replace illegal characters
$page = preg_replace('/[^a-zA-Z0-9 \._-]/', '', $page);
Now the user can only request valid files from the pages folder. All other requests are caught and dealt with.
Finished!
<?php
// check the $_GET['page'] variable
$page = ((isset($_GET['page']) && $_GET['page'] != '') ? $_GET['page'] : 'home');
// prevent file browsing
$page=(preg_match('/(\.\.|\/)/i',$page)?'home':$page);
// replace illegal characters
$page = preg_replace('/[^a-zA-Z0-9 \._-]/','',$page);
// check if the requested file exists
$page = (file_exists('./pages'.$page.'.php') ? $page : 'error');
// and include the page
include('./pages'.$page.'.php');
?>
<html>
<head>
<title><?php echo($title); ?></title>
<body>
<div class="menu">
<ul>
<li><a href="index.php?page=home">HOME</a></li>
<li><a href="index.php?page=contact">CONTACT</a></li>
<li><a href="index.php?page=links">LINKS</a></li>
<li><a href="index.php?page=products">PRODUCTS</a></li>
</ul>
</div>
<div class="content">
<?php echo($content); ?>
</div>
</body>
</html>
An example of an included file:
<?php
$title="Welcome";
$content="\t<h1>Welcome</h1>\n\t<p>to our super cool website!.</p>";
?>
Of course the included files can contain more than just two variable declarations....
A Bit of Cleanup
There are some drawbacks to this method. The first one is that the query strings do not make very pretty URL's. They also let people know what kind of technology you're using to generate your pages. This can be a plus or a minus, personally, I see it as the latter. There is a way of cleaning up your URL's using mod_rewrite. Getting mod_rewrite to work is complicated process for me to describe it here, but if you've got it up and running on your site already, then you can use this rewrite rule to get rid of those query strings:
RewriteRule ^([A-Za-z0-9_]+)/?$ index.php?page=$1 [L]
Now a nice clean URL like http://www.yoursite.com/contact will be silently redirected to http://www.yoursite.com/index.php?page=contact
So, before you copy your layout fifty times into fifty new .html files, at least consider $_GET['ting'] pages (safely) with PHP.