Phase 1 leads to phase 2

Around Easter last year I wrote down about 20 different ideas I’d had for a while for new websites and also addons/upgrades to the ones I already manage. My first project was scheduled to take around 2 weeks. It ended up taking 4 months. This was my long overdue upgrade to my Phishing Scams website – This was really the first time in a long while that I’d had a good stab at working with server side advanced (and ultimately automatic) data manipulation. I learned a heck of a lot doing it, and probably most importantly I realised the power of server side scripting, the power of combining “real time” processing┬á that happens when webpage content construction is computed by the likes of PHP or ASP with latent behind the scenes data processing – using cron jobs for example. Anyway – cutting a long story short, I saw the potential for a site that could take advantage of this data processing on a large scale. MillerSmiles told me it was possible and so with that I moved onto project 2. Somewhere down the list of 20 potential projects was “a site containing useful UK placename data such as postcodes, dialling codes and local pubs and shops” And so the idea of Big Red Directory was born.

As usual with my website development I kind of just jumped in the deep end. Phase 1 was to get something live and get it live soon. I started work on my very first web crawling bot at the end of the summer. This would be the information gatherer taking bits of the web back to the back end of the website where it could be processed, parsed and pushed to a directory which would be useful for the site’s visitors. It took a while. From spending my nights on holiday in Vancouver last November programming in the beautiful Bacchus bar at the Vancouver Wedgwood to spending long nights over the Christmas period entranced in PHP wonderland I managed to get something up and running.

Its really taken me by surprise how useful this data is. Having told friends and family about the site its already found its way onto bookmark lists. Over the last 2 months I’ve upgraded the web crawler to start searching for opening times and and daily opening hours for each of the venues listed. This is still work in progress and is ultimately going to form a large part of my work over the next couple of months, making sure the data is watertight. At the moment its a little shaky but ultimately still useful. Some opening times that are being drawn in are just plain wrong. For example the opening times for R Gilbertson antiques in Hastings is plain out of whack. I won’t go into detail about how the bot works but it essentially scours the web looking for keywords and venue names and once it hits a website it deems an authority on the placename (which is very hopefully the domain name of the pub shop whatever) it looks for opening time data and sends it back to the site. I really need to get some kind of logging for the bot up and running soon, at the moment I liken the bot to a firecracker out of control. Well to an extent.

What I am really pleased with recently is an bolt-on I made for the bot last week. The extra bit of code enabled the bot to successfully find all the branches and opening times of UK post offices. Now this really does work well. Couple of examples

  • Ore, Hastings post office opening hours
  • Pett, Hastings post office opening hours

Look at the opening times for Pett post office. They even have lunchtime closing time!! This is something now ready to take to the next level. I’m pretty pleased with the results for various supermarkets and big restaurant chains. e.g. Aintree Mcdonalds. I think next stage is start looking more at retail chains. I had a go at Boots and WHSmith. They seem to have worked ok. I think next step is things like New Look, Homebase, B&Q etc.

Anyway, so thats really the summary of phase 1. Phase 2 is hopefully going to be bit smoother. Phase 2 is acknowledging that phase 1 involved getting the site up quickly to see whether there was any merit in aggressively pushing the site forward on a longer term basis. Well, traffic stats suggest so. People obviously find the website useful and informative and clearly offers information that can’t be readily seen elsewhere. I’m hoping that phase 2 development is something like a 10 degrees trajectory higher as opposed to the 60 degrees for phase 1.

There are quite a few things I want to sort out, a lot of them actually onpage and not really big back end coding issues. Im going to start moving forward with these this weekend. Some things which readily come to mind:

  • Big one is the structure of Big Red Directory. At the moment at its deepest (and most pages are in this category) the directory is 5 directories deep. If the page is a pub or other venue it follows /rp/Pub/Place_Name/Pub_Name/Unique_ID. By the end of phase 2 I will have got rid of /rp and /Unique_ID. These are lazy programming pointers. /rp lets me know its a venue as opposed to a placename and the Unique_ID acts as a checksum, well, a unique ID, exactly as it says. We don’t need either. Similarly I want each directory to have a base. And the base shouldn’t just be stuffed with 4000 links. They should be sorted into different pages alphabetically into groups of say 200 links. So, at the moment /rp/Pub/Place_Name gives a 404. This is bad. It should give a list of pubs for Place_Name. The real problem here is how to handle and distribute the directory structure over a mod_rewrite using the htaccess. What I currently envisage is a general handler page for all mod_rewrite requests. So… All the directory names are turned into an array and the handler will go through the array and determine if its a venue or placename or otherwise. I don’t really think this should be too hard. Although using a page handler might cause some ill side effects which brings me onto point 2
  • Page load time. Its not massive at the moment, but its slower than I’d like. Most pages are taking about 1.5/2 secs to load. I think this is probably too slow. And my worry is that using a page handler to handle mod_rewrite page redistributions might add to the issue. I know what the problem is in general here. There’s a lot of database usage on every page of the website. In other words, theres a lot of onpage processing as opposed to backend processing. Now I could make more things hardcoded but this is going to destroy the much needed flexbility of the website. The solution is going to have to be more efficient code. And thats definitely possible. Phase 1 was always just getting the site online and seeing whether the internet community had use for it. So there are portions of code that can be streamlined / cut altogether.
  • Simpler one – make a proper 404 error page. Needs a link back to homepage. We’ve got a big directory here, there’s plenty of room for mistyped URLs.
  • Sitemap. Need to create an XML sitemap. Shouldn’t be hard at all. In fact I think there are web based tools that can spew one out for you in a few seconds. Obviously need to get point 1 done first – sorting out the directory structure.
  • More efficient page interlinking. I like what I’ve done at the moment. For every pub I list other pub in the area. I’m more on about anchor text here. I think instead of listing the name and full address of the place inside the link a href name tag we can possibly just do the name with a high level location (i.e. not street number and name) and then just have full address after that. I think that should make things a bit easier on the eye. Obviously very simple to implement across the site.
  • Start looking at how we can collate reviews and pictures of places almost like google images. Now this is not really a concrete part of the plan. The last thing I want to do is just copy reviews and just spam my own directory of useful information. I think the image thing might be useful though. Anyway, I need to think about this. I don’t know whether it’s worth doing at the moment or not.
  • More generally, and probably the biggest point – keep improving the robot. I have no specific aims here. It does a good job and has shown its worth with the Post Office data. I’ll just keep building it.

Right, getting late now. Plenty to think about.

Good night. post office

You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>