We all know that it’s a web developer’s job to turn a database into a live web site (often design the database too). But what about when a client wants to take the data on a live site, and turn it into a database/data file.
- Stage #1 – download the site
Retreiving the site requires planning and foresight for the next step. Using wget, my first approach was to mirror the entire site
wget -m -w2 -x -E -k -D "www.thesite.com" -p http://www.thesite.com
I was greeted with a restricted message, which lead me immediately to believe that wget was violating a robots.txt or similar restriction. I added to the above command:
-U "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:126.96.36.199) Gecko/2009011913 Firefox/3.0.6 (.NET CLR 3.5.30729)"
which set the user agent to my browser settings.
- Stage #2 – prepare the files for pattern matching
The easiest method I could imagine using was to combine all the (relevant) files into one file. I’m no Unix expert, but I was slightly familiar with cat. After some research, I combined cat with find, and later learned some xargs to make cat work for the job. But I kept running into problems, sometimes with syntax, sometimes with argument list size, and sometimes other unclear File errors.
- Stage #1 revisited – download a second site
I decided to try the next site by downloading the pages directly into one file, using:
wget -r -l inf --no-remove-listing -w2 -D "www.site2.com" http://www.site2.com -np -O site2.html -U "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:188.8.131.52) Gecko/2009011913 Firefox/3.0.6 (.NET CLR 3.5.30729)"
- Stage #3 – the script
I gzipped the files on my unix box, and copied them to my local Windows machine.
After getting the right regular expression using The Regex Coach, I pushed the data file through, but hit some problems. As it turns out, one large file, what I was trying to acheive in Stage #2, didn’t fare well with preg_match_all. It seems the function hit a limit and stopped returning results.
The previous script was equipped to read a directory and process each file independantly. So I found a class to emulate the unix split function, dividing the huge string into smaller, tolerable files. Of course, I might have been splitting the blocks and messing with the reg. exp. So instead I split the multi-file file into single-file strings
$ar = explode('<html', $str);
and handled each original “file” independantly.
- Stage #2 revisited – what to do with a complete site
Now I have a site with lots of pages in lots of directories. What to do? I tried using Windows Search/Explorer to find the relevant files from the many directories, and copy them into a single directory to be processed by our now flexible script. But Search only returned 5,000 results (I found out later the file count was closer to 70,000), and when I tried to copy/move the files into a single directory, Windows hung for a few hours, gobbled the CPU and did nothing. A quick search found me XXCOPY, which I easily installed, flattened the directories, and ran our script without a hitch.
I’m sure there are quicker ways to do it by those Unix experts, but now I have this flexible script that can operate on a number of input types fairly easily, so I’m happy.
Let me know if you’re interested in seeing this PHP script.