OFAC compliance with ElasticSearch

19 01 2015

The mission: To create a search engine capable of detecting possible matches from our contact entry from with the OFAC Specially Designated Nationals List (SDN).

The background: I was hired to develop a highly compliant software that tracks and facilitates the movement of funds between the U.S. and Israel. Recently, the U.S. government has taken a hard-line against corporations moving US dollars offshore on behalf of US citizens. Compliance demands that customers wishing to move funds must be checked against the OFAC SDN list, as well as the new consolidated non-SDN list which contains smaller lists such as the Palestinian Legislative Council (NS-PLC) list.

OFAC does not provide an API for querying this list, but has launched a search page alongside their publication (and regular updates) of plain text files. The search page description provides adequate details regarding their matching algorithms and basic expectations, those being character and word matching, and phonetics. Phonetics allows for basic spelling mistakes or discrepancies, particularly when dealing with non-English names.

Given our development phase and low starting volume, I developed the front-end and accessed a free API available online. Our testing, however, showed that the fuzzy logic implemented was not sufficient for our purposes (and perhaps other issues such as alias matching, list updates, closed source). A cup of coffee and some Google/Stackoverflow searches later, and I was sold on ElasticSearch.

Getting started: I put my frugality aside and decided to test development on a new AWS instance. I was happy to find that Bitnami had just the machine image I needed to get up and running in a snap.

For some reason, Pageant wasn’t working for me, so I loaded my private key directly into PuTTY and was able to log in with the “bitnami” user.
bitnami ppk

Going a bit out of development order, let’s first create our script to download the text files:

vi /opt/bitnami/elasticsearch/scripts/fetch_ofac.sh

#! /bin/sh
cd /opt/bitnami/elasticsearch/scripts/
wget http://www.treasury.gov/ofac/downloads/sdn.csv -O sdn_prim.csv
wget http://www.treasury.gov/ofac/downloads/add.csv -O sdn_add.csv
wget http://www.treasury.gov/ofac/downloads/alt.csv -O sdn_alt.csv
wget http://www.treasury.gov/ofac/downloads/sdn_comments.csv -O sdn_comments.csv
wget http://www.treasury.gov/ofac/downloads/consolidated/cons_prim.csv -O cons_prim.csv
wget http://www.treasury.gov/ofac/downloads/consolidated/cons_add.csv -O cons_add.csv
wget http://www.treasury.gov/ofac/downloads/consolidated/cons_alt.csv -O cons_alt.csv
wget http://www.treasury.gov/ofac/downloads/consolidated/cons_comments.csv -O cons_comments.csv
echo "Done fetching OFAC files"

We are downloading the files right into the ES scripts folder, and renaming some so we get a consistent “list_type.ext” file name.

Configuring ElasticSearch

The first step was installing the plugins I needed. Namely, the CSV-river plugin and the phonetic plugin. The first will allow us to run our download script and index our CSV files, and repeat as often as we need. The latter lets us to conduct searches using our choice of the most popular “sounds-like” algorithms, not just Soundex as used by the OFAC search tool.

cd /opt/bitnami/elasticsearch/
sudo bin/plugin -install elasticsearch/elasticsearch-analysis-phonetic/2.4.1
sudo bin/plugin -install river-csv -url https://github.com/AgileWorksOrg/elasticsearch-river-csv/releases/download/2.1.1/elasticsearch-river-csv-2.1.1.zip

*As of the time of this writing, elasticsearch-river-csv-2.1.1 did not support parent-child relationships, but luckily I found an old fork of the code which did, and did not break my code, so I later ran:

sudo /opt/bitnami/elasticsearch/bin/plugin -uninstall river-csv
sudo /opt/bitnami/elasticsearch/bin/plugin -install river-csv -url https://github.com/Trax-air/elasticsearch-river-csv/releases/download/2.1.1/elasticsearch-river-csv-2.1.1-SNAPSHOT.zip

To modify the ES settings, and make the phonetics search the default, I ran
sudo vi /opt/bitnami/elasticsearch/config/elasticsearch.yml

Bitnami had the “network.publish_host” setting configured to some IP, so I just commented it out to automatically derive that setting. Then I added to the end of the file:

index.analysis.analyzer.default.type: custom
index.analysis.analyzer.default.tokenizer: standard
index.analysis.analyzer.default.filter: [standard, lowercase, my_metaphone]
index.analysis.filter.my_metaphone.type: phonetic
index.analysis.filter.my_metaphone.encoder : doublemetaphone
index.analysis.filter.my_metaphone.replace: false

which is probably the same as this YAML, but I wasn’t sure if the YAML was getting picked up, and figured I’d best stick to the conventions in the file.

index :
  analysis :
    analyzer :
      default :
        type : custom
        tokenizer : standard
        filter : [standard, lowercase, my_metaphone]
    filter :
      my_metaphone :
        type : phonetic
        encoder : doublemetaphone
        replace : false

Restart the server using
sudo /opt/bitnami/ctlscript.sh restart

or just ES with
sudo /opt/bitnami/elasticsearch/scripts/ctl.sh stop
sudo /opt/bitnami/elasticsearch/scripts/ctl.sh start

Since the SDN and non-SDN lists have the same format and do not repeat unique IDs, I was able to load each list into the same type.
My mappings are:

curl -XPOST 'http://localhost:9200/ofac' -d '{
    "settings" : {
        "number_of_shards" : 1,
        "number_of_replicas" : 0
    },
    "mappings" : {
      "prim" : {
        "properties" : {
          "ent_num":{"type":"integer"},
          "SDN_Name":{
			  "type":"string",
			  "fields": {
				  "raw": {"type": "string", "index": "not_analyzed" }
			  }
		  },
		  "SDN_Type":{"type" : "string", "index" : "not_analyzed"}
        }
      },
      "alt" : {
        "_parent" : {
          "type" : "prim"
        },
        "properties" : {
          "ent_num":{"type" : "integer", "index" : "not_analyzed"},
		  "alt_num":{"type" : "integer"},
		  "alt_type": {"type": "string"},
		  "alt_name":{
			  "type":"string",
			  "fields": {
				  "raw": {"type": "string", "index": "not_analyzed" }
			  }
		  },
		  "alt_remarks":{"type": "string"}
        }
      },
      "add" : {
        "_parent" : {
          "type" : "prim"
        },
        "properties" : {
          "ent_num":{"type" : "integer", "index" : "not_analyzed"},
		  "add_num":{"type" : "integer"},
		  "address":{"type" : "string"},
		  "citystatezip":{"type": "string"},
		  "country":{"type": "string"},
		  "add_remarks": {"type": "string"}
        }
      },
      "comments" : {
        "_parent" : {
          "type" : "prim"
        },
        "properties" : {
          "ent_num":{"type" : "integer", "index" : "not_analyzed"},
		  "comment":{"type" : "string"}
        }
      }
    }
}'    

A few points here on the mappings:

  1. I’m using a single index “ofac”, and the different files get their own types: prim (primary record), alt (aliases), add (address details), comments
  2. All types besides prim have prim set as their parent. This tell ES that we will be specifying a parent ID for all those documents indexed by the forked version of CSV-River
  3. The prim.SDN_Name field will have a sub-field called “raw” that has the non-tokenized non-analyzed name data. We will use this for our Aggregation grouping later.

I then set up my CSV-Rivers to download and grab the files. Since the River renames the files after it indexes them, further downloads will not cause collisions.

curl -XPUT localhost:9200/_river/prim/_meta -d '
{
    "type" : "csv",
    "csv_file" : {
        "folder" : "/opt/bitnami/elasticsearch/scripts",
        "filename_pattern" : ".*prim\\.csv$",
        "poll":"6h",
        "fields" : [ "ent_num","SDN_Name","SDN_Type","Program","Title","Call_Sign","Vess_type","Tonnage","GRT","Vess_flag","Vess_owner","Remarks" ],
		"first_line_is_header" : "false",
		"field_id" : "ent_num",
		"script_before_all": "/opt/bitnami/elasticsearch/scripts/fetch_ofac.sh"
	}, 
	"index" : { 
		"index" : "ofac", 
		"type": "prim" 
	}
}'
curl -XPUT localhost:9200/_river/alt/_meta -d '
{
    "type" : "csv",
    "csv_file" : {
        "folder" : "/opt/bitnami/elasticsearch/scripts",
        "filename_pattern" : ".*alt\\.csv$",
        "poll":"6h",
        "fields" : [ "ent_num","alt_num","alt_type","alt_name","alt_remarks"],
		"first_line_is_header" : "false",
		"field_id" : "alt_num",
		"field_parent": "ent_num"
	}, 
	"index" : { 
		"index" : "ofac", 
		"type": "alt" 
	}
}'
curl -XPUT localhost:9200/_river/add/_meta -d '
{
    "type" : "csv",
    "csv_file" : {
        "folder" : "/opt/bitnami/elasticsearch/scripts",
        "filename_pattern" : ".*add\\.csv$",
        "poll":"6h",
        "fields" : [ "ent_num","add_num","address","citystatezip","country","add_remarks"],
		"first_line_is_header" : "false",
		"field_id" : "add_num",
		"field_parent": "ent_num"
	}, 
	"index" : { 
		"index" : "ofac", 
		"type": "add" 
	}
}'
curl -XPUT localhost:9200/_river/comments/_meta -d '
{
    "type" : "csv",
    "csv_file" : {
        "folder" : "/opt/bitnami/elasticsearch/scripts",
        "filename_pattern" : ".*comments\\.csv$",
        "poll":"6h",
        "fields" : [ "ent_num","comment"],
		"first_line_is_header" : "false",
		"field_parent": "ent_num"
	}, 
	"index" : { 
		"index" : "ofac", 
		"type": "comments" 
	}
}'

For better or worse, I think the before_script_all is asynchronous, which is a strange choice for a “before” hook, which means the downloads usually complete after the data is already indexed. But if my data is actually updated every 6 or 12 hours doesn’t make a big difference to our needs.

 The Search

There is much of the Query DSL that I didn’t learn in my crash-course in ElasticSearch, but the docs and examples online finally got me to the correct result. The most challenging part was returning the alias data along with the matching results. The latest stable release I was using was version 1.4.  I believe 1.5 will have some JOIN functionality, but I had to choose between recreating the index with the aliases, addresses, and comments nested in the primary data, or using aggregates to pull out the information I needed. I chose aggregates and the result looked like this:

{
  "query": {
    "filtered": {
      "filter": {
        "bool": {
          "must_not": {
            "term": {
              "SDN_Type": "individual"
            }
          }
        }
      },
      "query": {
        "bool": {
          "should": [
            {
              "match": {
                "prim.SDN_Name": {
                  "query": "cubanator",
                  "operator": "and"
                }
              }
            },
            {
              "has_child": {
                "type": "alt",
                "query": {
                  "match": {
                    "alt.alt_name": {
                      "query": "cubanator",
                      "operator": "and"
                    }
                  }
                }
              }
            }
          ],
          "minimum_number_should_match": "65%"
        }
      }
    }
  },
  "aggs": {
    "fullmatches": {
      "terms": {
        "field": "prim.SDN_Name.raw",
        "size": "10"
      },
      "aggs": {
        "aliases": {
          "children": {
            "type": "alt"
          },
          "aggs": {
            "names": {
              "top_hits": {
                "size": "10"
              }
            }
          }
        }
      }
    }
  }
}

The filter will dynamically remove either individual (“must”) or non-individual (“must_not”) results, depending on our needs. Then we run our tokenized, phonetics matching against the names and aliases. Finally, we also return a top-hit aggregate of at most 10 matching alias records per name match.

Our PHP code using the official ElasticSearch client looks like:

$q = strtolower( trim($_REQUEST['last_name'].' '.$_REQUEST['first_name'].' '.$_REQUEST['name']));
if(empty($q)) exit('[]');

require 'vendor/autoload.php';

$params = array();
$params['hosts'] = array ( 'YOUR IP HERE');
$client = new Elasticsearch\Client($params);

$searchParams = array();
$searchParams['index'] = 'ofac';
$searchParams['type']  = 'prim';
$searchParams['body']['query']['filtered']['filter']['bool'][($_REQUEST['api']=='companySearch'?'must_not':'must')]['term']['prim.SDN_Type'] = 'individual';
$searchParams['body']['query']['filtered']['query']['bool']['should'][0]['match']['prim.SDN_Name']=array('operator'=>'and', 'query'=>$q);
$searchParams['body']['query']['filtered']['query']['bool']['should'][1]['has_child'] = 
	array('type'=>'alt', 'query'=>
		array('match'=>
			array('alt.alt_name'=>
				array('operator'=>'and', 'query'=>$q) 
			)
		)
	);
$searchParams['body']['query']['filtered']['query']['bool']['minimum_number_should_match'] = '65%';

$searchParams['body']['aggs']['fullmatches'] = 
	array('terms' =>
			array('field' => "prim.SDN_Name.raw",
				"size"=> "10"
			),
		  'aggs' =>
			array('aliases' => array(
				'children' => array('type'=>'alt'),
				'aggs' => array(
					'names' => array('top_hits'=> array("size"=> "10")),
					//'top_hit' => array('max' => array('script' => '_score'))
				)
			  )
			)
	);


$retDoc = $client->search($searchParams);

if($retDoc['hits']['total'] > 0) {
	
	if(count($retDoc['aggregations']['fullmatches']['buckets'])>0){
		$aliases=array();
		foreach($retDoc['aggregations']['fullmatches']['buckets'] as $i=>$b){
			if($b['aliases']['doc_count'] > 0) {
				$prim_aliases = array();
				foreach($b['aliases']['names']['hits']['hits'] as $j=> $alt) {
					$prim_aliases[] = $alt['_source']['alt_name'];
				}
				$aliases[$b['key']] = implode(';',$prim_aliases);
			}
		}

	}
	
	$json = array();
	foreach($retDoc['hits']['hits'] as $i=>$result) {
		$s = str_replace('-0- ','',$result['_source']);
		$json[] = array('ofac_sdn_ent_num'	=> $result['_id'],
						'ofac_sdn_name' 	=> $s['SDN_Name'],
						'ofac_sdn_type'		=> $s['SDN_Type'],
						'ofac_sdn_program'	=> $s['Program'],
						'ofac_sdn_title'	=> $s['Title'],
						'ofac_sdn_call_sign'=> $s["Call_Sign"],
						'ofac_sdn_vess_type'=> $s["Vess_type"],
						'ofac_sdn_tonnage'	=> $s["Tonnage"],
						'ofac_sdn_grt'		=> $s["GRT"],
						'ofac_sdn_vess_flag'=> $s["Vess_flag"],
						'ofac_sdn_vess_owner'=>$s["Vess_owner"],
						'ofac_sdn_remarks'	=> trim($s["Remarks"].(isset($aliases[$s['SDN_Name']])?'     AKAs: '.$aliases[$s['SDN_Name']]:'')),
						'type'				=> 'Ofac'.($result['_type']=='prim' ? "Sdn":"Alt")
				);
	}
	exit(json_encode($json));
}		
else
	exit('[]');

We set up our client, pass in the query, run through the aliases creating an easy local lookup, and return the same response structure as the Free API. And my front-end was never the wiser.

ofac

Advertisements




Automatically update your php_browscap.ini

19 10 2010

My website uses the PHP function get_browser to see if a client is Javascript enabled. (I use this primarily to see if the visitor is a search engine bot and should see a long list of data or a real person and can handle AJAX-loaded data in tabs – most real people these days are using Javascript-enabled browsers.)

get_browser relies on the php_browscap.ini file, and the problem is that keeping the php_browscap.ini file up-to-date can be a chore, and should certainly be done automatically. (I notice the problem now that I use Chrome, and with every major release, the browscap file has no clue about my new browser, which is silly.)

So I finally sat down and wrote a basic script to fetch the updated file and restart my web server, which I’ll run weekly. This works for my FreeBSD box – change the paths and download command to fit your installation.

/usr/local/bin/wget -Ophp_browscap.ini "http://browsers.garykeith.com/stream.asp?PHP_BrowsCapINI"
mv -f /usr/local/etc/php/php_browscap.ini /usr/local/etc/php/php_browscap.old.ini
mv -f php_browscap.ini /usr/local/etc/php/
/usr/local/etc/rc.d/apache22 restart

I saved to: usr/local/www/mysite.com/get_php_browscap.sh
Then add it to your crontab:

# php_browscap update
5       4       *       *       6       root    /usr/local/www/mysite.com/get_php_browscap.sh >> /var/log/cron.log

I run it Saturday morning.
Any questions?





How to browse PHP’s print_r output the right way

16 06 2010

I’ve had the pleasure recently of working with some complex object oriented PHP with massive objects or lists of objects. The easiest way to have a look at your data for analyzing and debugging is to print_r it. Unfortunately, skimming through this data can be tedious, especially if you want to skip a couple of large, nested objects that are irrelevant to your data analysis.

When programming, I always keep Notepad++ open so I can keep a bunch of data accesible in tabs, such as notes, texts, data files, etc (and my tabs of saved files are seamlessly preserved between sessions, which is critical, of course). Notepad++ can handle bunches of different file formats out of the box, but PHP print_r isn’t one of them.

So I’ve pasted my print_r output into Notepad++, and the new file language is “Normal Text”. Immediately, bracket matching/highlighting works, which is great. I can go to an opening parens and see where it ends. So that’s it, right? Well, Notepad++ can also do code folding, like when you’re browsing a class, and you want to see all the class functions without all that pesky function code cluttering your view. So if you’ve got a PHP file open, for example, you can either click the [-]  to the left of the code to fold that function, or go to View->Fold All, and then just click [+] to open the class and see the top-level items in the class in plain view.

So what about the print_r? Matlab! I haven’t used Matlab since my first or second year in engineering school, and never thought it would useful, but here it is. I don’t remember what the code looks like, and I don’t care. All I know is that when I select Language->Matlab, it let’s me do code folding on my print_r output, and that’s all that matters. A bunch of the other languages work well too, or the same as Matlab, but it was the first and best for my needs.

FYI, this breaks if you have variable data containing a chunk of text with newlines and whatnot, but so it goes.





(PHP) Timestamp for a WORKING date/time a number of minutes/days from a given start date

3 12 2009

That title long enough? It’s hard to describe this function, but you get it now, right?

Based on this function here.

/**
* getTimePlusOffsetDuringWork
*
* The function returns the timestamp of a date + business minutes
* @param timestamp|string startDate Beginning date/time as base for offset
* @param numeric offsetMinutes Work minutes to add to the startDate
* @param mixed Array of holiday dates, preferrably in the 'MM-DD' format
* @return integer Timestamp for the resulting time during Work hours,
*     given the simple equation StartDate + offsetMinutes.
*     Ignores after-hours, weekends, and holidays [for the first year after the startDate].
* Limitations: too many to list entirely.
*     9am-5pm work day only
*     1 year of holidays only
*     no week/weekday dependent holidays
*     Need to set timezone prior to calling this function if used outside of the server's timezone
*     Relies on English locale setting
* @author Zvi Landsman
**/
function getTimePlusOffsetDuringWork($startDate,$offsetMinutes,$holidays=array("12-25","12-26","01-01","07-04")){
    $j = $i = 0; //changed this to 0 or you were always starting one day ahead
    $tmp1 = validation::is_timestamp($startDate)?$startDate:strtotime($startDate); //worked out a timstamp to start with
    $offsetMinutes = round($offsetMinutes);
    $simple_time = strtotime("+$offsetMinutes minutes", $tmp1);
    if($simple_time <= strtotime('17:00', $tmp1))
    {
        //offset is still today
        return $simple_time;
    }
    //Check if start time is after-hours 
    if ($tmp1 > strtotime('17:00', $tmp1))
    {
        $tmp1 = strtotime("09:00 +1 day", $tmp1);
    }
    //This checks if the minute offset puts us into tomorrow
    $newdate = $tmp1;
    $minutes_left = $offsetMinutes%480;
    $newtime = strtotime("+ $minutes_left minutes", $newdate);
    if($newtime > strtotime('17:00', $newdate))
    {
        $j++;
        $minutes_left = round(($newtime - strtotime('17:00', $newdate))/60);
        $tmp1 = strtotime("+ $minutes_left minutes", strtotime("09:00 +1 day", $tmp1));
    }
    else
    {
        $tmp1 = strtotime("+ $minutes_left minutes", $tmp1);
    }

    //create the holiday list for the first upcoming occurances only
    $holidayList = array();
    foreach ($holidays as $holiday)
    {
        $time_stamp=strtotime($holiday, $tmp1);

        $date=($time_stamp < $tmp1)?
            strftime("%d-%m-%Y",strtotime($holiday.' +1 year')):
            strftime("%d-%m-%Y",$time_stamp);
        $holidayList[] = $date;
    }

    $days_to_offset = floor($offsetMinutes/480);
    while($i < $days_to_offset)
    {
        $tmp2 = strtotime("+$j day", $tmp1);
        $day = strftime("%A",$tmp2);

        $tmp = strftime("%d-%m-%Y",$tmp2);
        if(($day != "Sunday") && ($day != "Saturday" )&&(!in_array($tmp, $holidayList)))
        {
            $i = $i + 1;
            $j = $j + 1;

        }
        else
        {
            $j = $j + 1;

        }
    }
    //$j = $j -1;
    return strtotime("+$j days",$tmp1);
}

I’m completely open to improvements in the comments.





Going from a Live Site to a Database

3 03 2009

We all know that it’s a web developer’s job to turn a database into a live web site (often design the database too). But what about when a client wants to take the data on a live site, and turn it into a database/data file.

  1. Stage #1 – download the site
    Retreiving the site requires planning and foresight for the next step. Using wget, my first approach was to mirror the entire site

    wget -m -w2 -x -E -k -D  "www.thesite.com" -p http://www.thesite.com

    I was greeted with a restricted message, which lead me immediately to believe that wget was violating a robots.txt or similar restriction. I added to the above command:

    -U "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.6) Gecko/2009011913 Firefox/3.0.6 (.NET CLR 3.5.30729)"

    which set the user agent to my browser settings.

  2. Stage #2 – prepare the files for pattern matching
    The easiest method I could imagine using was to combine all the (relevant) files into one file. I’m no Unix expert, but I was slightly familiar with cat. After some research, I combined cat with find, and later learned some xargs to make cat work for the job. But I kept running into problems, sometimes with syntax, sometimes with argument list size, and sometimes other unclear File errors.
  3. Stage #1 revisited – download a second site
    I decided to try the next site by downloading the pages directly into one file, using:

    wget -r -l inf --no-remove-listing -w2 -D "www.site2.com" http://www.site2.com -np -O site2.html -U "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.6) Gecko/2009011913 Firefox/3.0.6 (.NET CLR 3.5.30729)"
  4. Stage #3 – the script
    My approach was to use a simple script using regular expressions to convert the data into a CSV file. Javascript will run pretty slow, so I modified an existing PHP script for our purposes.
    I gzipped the files on my unix box, and copied them to my local Windows machine.
    After getting the right regular expression using  The Regex Coach, I pushed the data file through, but hit some problems. As it turns out, one large file, what I was trying to acheive in Stage #2, didn’t fare well with preg_match_all. It seems the function hit a limit and stopped returning results.
    The previous script was equipped to read a directory and process each file independantly. So I found a class to emulate the unix split function, dividing the huge string into smaller, tolerable files. Of course, I might have been splitting the blocks and messing with the reg. exp. So instead I split the multi-file file into single-file strings

    $ar = explode('<html', $str);

    and handled each original “file” independantly.
    Viola!

  5. Stage #2 revisited – what to do with a complete site
    Now I have a site with lots of pages in lots of directories. What to do? I tried using Windows Search/Explorer to find the relevant files from the many directories, and copy them into a single directory to be processed by our now flexible script. But Search only returned 5,000 results (I found out later the file count was closer to 70,000), and when I tried to copy/move the files into a single directory, Windows hung for a few hours, gobbled the CPU and did nothing. A quick search found me XXCOPY, which I easily installed, flattened the directories, and ran our script without a hitch.

I’m sure there are quicker ways to do it by those Unix experts, but now I have this flexible script that can operate on a number of input types fairly easily, so I’m happy.

Let me know if you’re interested in seeing this PHP script.





Google Talk Status API in PHP

5 02 2009

I just put together this API in PHP for retrieving the online status of a Google Talk user. It’s based on the Google Talk badge but allows for server-side access to a user’s status.

Read the rest of this entry »