Going from a Live Site to a Database

3 03 2009

We all know that it’s a web developer’s job to turn a database into a live web site (often design the database too). But what about when a client wants to take the data on a live site, and turn it into a database/data file.

  1. Stage #1 – download the site
    Retreiving the site requires planning and foresight for the next step. Using wget, my first approach was to mirror the entire site

    wget -m -w2 -x -E -k -D  "www.thesite.com" -p http://www.thesite.com

    I was greeted with a restricted message, which lead me immediately to believe that wget was violating a robots.txt or similar restriction. I added to the above command:

    -U "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv: Gecko/2009011913 Firefox/3.0.6 (.NET CLR 3.5.30729)"

    which set the user agent to my browser settings.

  2. Stage #2 – prepare the files for pattern matching
    The easiest method I could imagine using was to combine all the (relevant) files into one file. I’m no Unix expert, but I was slightly familiar with cat. After some research, I combined cat with find, and later learned some xargs to make cat work for the job. But I kept running into problems, sometimes with syntax, sometimes with argument list size, and sometimes other unclear File errors.
  3. Stage #1 revisited – download a second site
    I decided to try the next site by downloading the pages directly into one file, using:

    wget -r -l inf --no-remove-listing -w2 -D "www.site2.com" http://www.site2.com -np -O site2.html -U "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv: Gecko/2009011913 Firefox/3.0.6 (.NET CLR 3.5.30729)"
  4. Stage #3 – the script
    My approach was to use a simple script using regular expressions to convert the data into a CSV file. Javascript will run pretty slow, so I modified an existing PHP script for our purposes.
    I gzipped the files on my unix box, and copied them to my local Windows machine.
    After getting the right regular expression using  The Regex Coach, I pushed the data file through, but hit some problems. As it turns out, one large file, what I was trying to acheive in Stage #2, didn’t fare well with preg_match_all. It seems the function hit a limit and stopped returning results.
    The previous script was equipped to read a directory and process each file independantly. So I found a class to emulate the unix split function, dividing the huge string into smaller, tolerable files. Of course, I might have been splitting the blocks and messing with the reg. exp. So instead I split the multi-file file into single-file strings

    $ar = explode('<html', $str);

    and handled each original “file” independantly.

  5. Stage #2 revisited – what to do with a complete site
    Now I have a site with lots of pages in lots of directories. What to do? I tried using Windows Search/Explorer to find the relevant files from the many directories, and copy them into a single directory to be processed by our now flexible script. But Search only returned 5,000 results (I found out later the file count was closer to 70,000), and when I tried to copy/move the files into a single directory, Windows hung for a few hours, gobbled the CPU and did nothing. A quick search found me XXCOPY, which I easily installed, flattened the directories, and ran our script without a hitch.

I’m sure there are quicker ways to do it by those Unix experts, but now I have this flexible script that can operate on a number of input types fairly easily, so I’m happy.

Let me know if you’re interested in seeing this PHP script.


Get every new post delivered to your Inbox.

Join 286 other followers