Going from a Live Site to a Database

3 03 2009

We all know that it’s a web developer’s job to turn a database into a live web site (often design the database too). But what about when a client wants to take the data on a live site, and turn it into a database/data file.

  1. Stage #1 – download the site
    Retreiving the site requires planning and foresight for the next step. Using wget, my first approach was to mirror the entire site

    wget -m -w2 -x -E -k -DĀ  "www.thesite.com" -p http://www.thesite.com

    I was greeted with a restricted message, which lead me immediately to believe that wget was violating a robots.txt or similar restriction. I added to the above command:

    -U "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.6) Gecko/2009011913 Firefox/3.0.6 (.NET CLR 3.5.30729)"

    which set the user agent to my browser settings.

  2. Stage #2 – prepare the files for pattern matching
    The easiest method I could imagine using was to combine all the (relevant) files into one file. I’m no Unix expert, but I was slightly familiar with cat. After some research, I combined cat with find, and later learned some xargs to make cat work for the job. But I kept running into problems, sometimes with syntax, sometimes with argument list size, and sometimes other unclear File errors.
  3. Stage #1 revisited – download a second site
    I decided to try the next site by downloading the pages directly into one file, using:

    wget -r -l inf --no-remove-listing -w2 -D "www.site2.com" http://www.site2.com -np -O site2.html -U "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.6) Gecko/2009011913 Firefox/3.0.6 (.NET CLR 3.5.30729)"
  4. Stage #3 – the script
    My approach was to use a simple script using regular expressions to convert the data into a CSV file. Javascript will run pretty slow, so I modified an existing PHP script for our purposes.
    I gzipped the files on my unix box, and copied them to my local Windows machine.
    After getting the right regular expression usingĀ  The Regex Coach, I pushed the data file through, but hit some problems. As it turns out, one large file, what I was trying to acheive in Stage #2, didn’t fare well with preg_match_all. It seems the function hit a limit and stopped returning results.
    The previous script was equipped to read a directory and process each file independantly. So I found a class to emulate the unix split function, dividing the huge string into smaller, tolerable files. Of course, I might have been splitting the blocks and messing with the reg. exp. So instead I split the multi-file file into single-file strings

    $ar = explode('<html', $str);

    and handled each original “file” independantly.
    Viola!

  5. Stage #2 revisited – what to do with a complete site
    Now I have a site with lots of pages in lots of directories. What to do? I tried using Windows Search/Explorer to find the relevant files from the many directories, and copy them into a single directory to be processed by our now flexible script. But Search only returned 5,000 results (I found out later the file count was closer to 70,000), and when I tried to copy/move the files into a single directory, Windows hung for a few hours, gobbled the CPU and did nothing. A quick search found me XXCOPY, which I easily installed, flattened the directories, and ran our script without a hitch.

I’m sure there are quicker ways to do it by those Unix experts, but now I have this flexible script that can operate on a number of input types fairly easily, so I’m happy.

Let me know if you’re interested in seeing this PHP script.