spider | IsraelWebDev

Going from a Live Site to a Database

3 03 2009

We all know that it’s a web developer’s job to turn a database into a live web site (often design the database too). But what about when a client wants to take the data on a live site, and turn it into a database/data file.

Stage #1 – download the site
Retreiving the site requires planning and foresight for the next step. Using wget, my first approach was to mirror the entire site
```
wget -m -w2 -x -E -k -D  "www.thesite.com" -p http://www.thesite.com
```
I was greeted with a restricted message, which lead me immediately to believe that wget was violating a robots.txt or similar restriction. I added to the above command:
```
-U "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.6) Gecko/2009011913 Firefox/3.0.6 (.NET CLR 3.5.30729)"
```
which set the user agent to my browser settings.
Stage #2 – prepare the files for pattern matching
The easiest method I could imagine using was to combine all the (relevant) files into one file. I’m no Unix expert, but I was slightly familiar with cat. After some research, I combined cat with find, and later learned some xargs to make cat work for the job. But I kept running into problems, sometimes with syntax, sometimes with argument list size, and sometimes other unclear File errors.

Stage #1 revisited – download a second site
I decided to try the next site by downloading the pages directly into one file, using:

wget -r -l inf --no-remove-listing -w2 -D "www.site2.com" http://www.site2.com -np -O site2.html -U "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.6) Gecko/2009011913 Firefox/3.0.6 (.NET CLR 3.5.30729)"

Stage #3 – the script
My approach was to use a simple script using regular expressions to convert the data into a CSV file. Javascript will run pretty slow, so I modified an existing PHP script for our purposes.
I gzipped the files on my unix box, and copied them to my local Windows machine.
After getting the right regular expression using The Regex Coach, I pushed the data file through, but hit some problems. As it turns out, one large file, what I was trying to acheive in Stage #2, didn’t fare well with preg_match_all. It seems the function hit a limit and stopped returning results.
The previous script was equipped to read a directory and process each file independantly. So I found a class to emulate the unix split function, dividing the huge string into smaller, tolerable files. Of course, I might have been splitting the blocks and messing with the reg. exp. So instead I split the multi-file file into single-file strings
```
$ar = explode('<html', $str);
```
and handled each original “file” independantly.
Viola!
Stage #2 revisited – what to do with a complete site
Now I have a site with lots of pages in lots of directories. What to do? I tried using Windows Search/Explorer to find the relevant files from the many directories, and copy them into a single directory to be processed by our now flexible script. But Search only returned 5,000 results (I found out later the file count was closer to 70,000), and when I tried to copy/move the files into a single directory, Windows hung for a few hours, gobbled the CPU and did nothing. A quick search found me XXCOPY, which I easily installed, flattened the directories, and ran our script without a hitch.

I’m sure there are quicker ways to do it by those Unix experts, but now I have this flexible script that can operate on a number of input types fairly easily, so I’m happy.

Let me know if you’re interested in seeing this PHP script.

Comments : Leave a Comment »
Tags: cat, CSV, PHP, spider, wget
Categories : PHP

IsraelWebDev

Going from a Live Site to a Database

About Zvi

My Sites

Other Sites

Recent Posts

Top Posts

Top Clicks

Twitter Updates

Meta