OFAC compliance with ElasticSearch

19 01 2015

The mission: To create a search engine capable of detecting possible matches from our contact entry from with the OFAC Specially Designated Nationals List (SDN).

The background: I was hired to develop a highly compliant software that tracks and facilitates the movement of funds between the U.S. and Israel. Recently, the U.S. government has taken a hard-line against corporations moving US dollars offshore on behalf of US citizens. Compliance demands that customers wishing to move funds must be checked against the OFAC SDN list, as well as the new consolidated non-SDN list which contains smaller lists such as the Palestinian Legislative Council (NS-PLC) list.

OFAC does not provide an API for querying this list, but has launched a search page alongside their publication (and regular updates) of plain text files. The search page description provides adequate details regarding their matching algorithms and basic expectations, those being character and word matching, and phonetics. Phonetics allows for basic spelling mistakes or discrepancies, particularly when dealing with non-English names.

Given our development phase and low starting volume, I developed the front-end and accessed a free API available online. Our testing, however, showed that the fuzzy logic implemented was not sufficient for our purposes (and perhaps other issues such as alias matching, list updates, closed source). A cup of coffee and some Google/Stackoverflow searches later, and I was sold on ElasticSearch.

Getting started: I put my frugality aside and decided to test development on a new AWS instance. I was happy to find that Bitnami had just the machine image I needed to get up and running in a snap.

For some reason, Pageant wasn’t working for me, so I loaded my private key directly into PuTTY and was able to log in with the “bitnami” user.
bitnami ppk

Going a bit out of development order, let’s first create our script to download the text files:

vi /opt/bitnami/elasticsearch/scripts/fetch_ofac.sh

#! /bin/sh
cd /opt/bitnami/elasticsearch/scripts/
wget http://www.treasury.gov/ofac/downloads/sdn.csv -O sdn_prim.csv
wget http://www.treasury.gov/ofac/downloads/add.csv -O sdn_add.csv
wget http://www.treasury.gov/ofac/downloads/alt.csv -O sdn_alt.csv
wget http://www.treasury.gov/ofac/downloads/sdn_comments.csv -O sdn_comments.csv
wget http://www.treasury.gov/ofac/downloads/consolidated/cons_prim.csv -O cons_prim.csv
wget http://www.treasury.gov/ofac/downloads/consolidated/cons_add.csv -O cons_add.csv
wget http://www.treasury.gov/ofac/downloads/consolidated/cons_alt.csv -O cons_alt.csv
wget http://www.treasury.gov/ofac/downloads/consolidated/cons_comments.csv -O cons_comments.csv
echo "Done fetching OFAC files"

We are downloading the files right into the ES scripts folder, and renaming some so we get a consistent “list_type.ext” file name.

Configuring ElasticSearch

The first step was installing the plugins I needed. Namely, the CSV-river plugin and the phonetic plugin. The first will allow us to run our download script and index our CSV files, and repeat as often as we need. The latter lets us to conduct searches using our choice of the most popular “sounds-like” algorithms, not just Soundex as used by the OFAC search tool.

cd /opt/bitnami/elasticsearch/
sudo bin/plugin -install elasticsearch/elasticsearch-analysis-phonetic/2.4.1
sudo bin/plugin -install river-csv -url https://github.com/AgileWorksOrg/elasticsearch-river-csv/releases/download/2.1.1/elasticsearch-river-csv-2.1.1.zip

*As of the time of this writing, elasticsearch-river-csv-2.1.1 did not support parent-child relationships, but luckily I found an old fork of the code which did, and did not break my code, so I later ran:

sudo /opt/bitnami/elasticsearch/bin/plugin -uninstall river-csv
sudo /opt/bitnami/elasticsearch/bin/plugin -install river-csv -url https://github.com/Trax-air/elasticsearch-river-csv/releases/download/2.1.1/elasticsearch-river-csv-2.1.1-SNAPSHOT.zip

To modify the ES settings, and make the phonetics search the default, I ran
sudo vi /opt/bitnami/elasticsearch/config/elasticsearch.yml

Bitnami had the “network.publish_host” setting configured to some IP, so I just commented it out to automatically derive that setting. Then I added to the end of the file:

index.analysis.analyzer.default.type: custom
index.analysis.analyzer.default.tokenizer: standard
index.analysis.analyzer.default.filter: [standard, lowercase, my_metaphone]
index.analysis.filter.my_metaphone.type: phonetic
index.analysis.filter.my_metaphone.encoder : doublemetaphone
index.analysis.filter.my_metaphone.replace: false

which is probably the same as this YAML, but I wasn’t sure if the YAML was getting picked up, and figured I’d best stick to the conventions in the file.

index :
  analysis :
    analyzer :
      default :
        type : custom
        tokenizer : standard
        filter : [standard, lowercase, my_metaphone]
    filter :
      my_metaphone :
        type : phonetic
        encoder : doublemetaphone
        replace : false

Restart the server using
sudo /opt/bitnami/ctlscript.sh restart

or just ES with
sudo /opt/bitnami/elasticsearch/scripts/ctl.sh stop
sudo /opt/bitnami/elasticsearch/scripts/ctl.sh start

Since the SDN and non-SDN lists have the same format and do not repeat unique IDs, I was able to load each list into the same type.
My mappings are:

curl -XPOST 'http://localhost:9200/ofac' -d '{
    "settings" : {
        "number_of_shards" : 1,
        "number_of_replicas" : 0
    },
    "mappings" : {
      "prim" : {
        "properties" : {
          "ent_num":{"type":"integer"},
          "SDN_Name":{
			  "type":"string",
			  "fields": {
				  "raw": {"type": "string", "index": "not_analyzed" }
			  }
		  },
		  "SDN_Type":{"type" : "string", "index" : "not_analyzed"}
        }
      },
      "alt" : {
        "_parent" : {
          "type" : "prim"
        },
        "properties" : {
          "ent_num":{"type" : "integer", "index" : "not_analyzed"},
		  "alt_num":{"type" : "integer"},
		  "alt_type": {"type": "string"},
		  "alt_name":{
			  "type":"string",
			  "fields": {
				  "raw": {"type": "string", "index": "not_analyzed" }
			  }
		  },
		  "alt_remarks":{"type": "string"}
        }
      },
      "add" : {
        "_parent" : {
          "type" : "prim"
        },
        "properties" : {
          "ent_num":{"type" : "integer", "index" : "not_analyzed"},
		  "add_num":{"type" : "integer"},
		  "address":{"type" : "string"},
		  "citystatezip":{"type": "string"},
		  "country":{"type": "string"},
		  "add_remarks": {"type": "string"}
        }
      },
      "comments" : {
        "_parent" : {
          "type" : "prim"
        },
        "properties" : {
          "ent_num":{"type" : "integer", "index" : "not_analyzed"},
		  "comment":{"type" : "string"}
        }
      }
    }
}'    

A few points here on the mappings:

  1. I’m using a single index “ofac”, and the different files get their own types: prim (primary record), alt (aliases), add (address details), comments
  2. All types besides prim have prim set as their parent. This tell ES that we will be specifying a parent ID for all those documents indexed by the forked version of CSV-River
  3. The prim.SDN_Name field will have a sub-field called “raw” that has the non-tokenized non-analyzed name data. We will use this for our Aggregation grouping later.

I then set up my CSV-Rivers to download and grab the files. Since the River renames the files after it indexes them, further downloads will not cause collisions.

curl -XPUT localhost:9200/_river/prim/_meta -d '
{
    "type" : "csv",
    "csv_file" : {
        "folder" : "/opt/bitnami/elasticsearch/scripts",
        "filename_pattern" : ".*prim\\.csv$",
        "poll":"6h",
        "fields" : [ "ent_num","SDN_Name","SDN_Type","Program","Title","Call_Sign","Vess_type","Tonnage","GRT","Vess_flag","Vess_owner","Remarks" ],
		"first_line_is_header" : "false",
		"field_id" : "ent_num",
		"script_before_all": "/opt/bitnami/elasticsearch/scripts/fetch_ofac.sh"
	}, 
	"index" : { 
		"index" : "ofac", 
		"type": "prim" 
	}
}'
curl -XPUT localhost:9200/_river/alt/_meta -d '
{
    "type" : "csv",
    "csv_file" : {
        "folder" : "/opt/bitnami/elasticsearch/scripts",
        "filename_pattern" : ".*alt\\.csv$",
        "poll":"6h",
        "fields" : [ "ent_num","alt_num","alt_type","alt_name","alt_remarks"],
		"first_line_is_header" : "false",
		"field_id" : "alt_num",
		"field_parent": "ent_num"
	}, 
	"index" : { 
		"index" : "ofac", 
		"type": "alt" 
	}
}'
curl -XPUT localhost:9200/_river/add/_meta -d '
{
    "type" : "csv",
    "csv_file" : {
        "folder" : "/opt/bitnami/elasticsearch/scripts",
        "filename_pattern" : ".*add\\.csv$",
        "poll":"6h",
        "fields" : [ "ent_num","add_num","address","citystatezip","country","add_remarks"],
		"first_line_is_header" : "false",
		"field_id" : "add_num",
		"field_parent": "ent_num"
	}, 
	"index" : { 
		"index" : "ofac", 
		"type": "add" 
	}
}'
curl -XPUT localhost:9200/_river/comments/_meta -d '
{
    "type" : "csv",
    "csv_file" : {
        "folder" : "/opt/bitnami/elasticsearch/scripts",
        "filename_pattern" : ".*comments\\.csv$",
        "poll":"6h",
        "fields" : [ "ent_num","comment"],
		"first_line_is_header" : "false",
		"field_parent": "ent_num"
	}, 
	"index" : { 
		"index" : "ofac", 
		"type": "comments" 
	}
}'

For better or worse, I think the before_script_all is asynchronous, which is a strange choice for a “before” hook, which means the downloads usually complete after the data is already indexed. But if my data is actually updated every 6 or 12 hours doesn’t make a big difference to our needs.

 The Search

There is much of the Query DSL that I didn’t learn in my crash-course in ElasticSearch, but the docs and examples online finally got me to the correct result. The most challenging part was returning the alias data along with the matching results. The latest stable release I was using was version 1.4.  I believe 1.5 will have some JOIN functionality, but I had to choose between recreating the index with the aliases, addresses, and comments nested in the primary data, or using aggregates to pull out the information I needed. I chose aggregates and the result looked like this:

{
  "query": {
    "filtered": {
      "filter": {
        "bool": {
          "must_not": {
            "term": {
              "SDN_Type": "individual"
            }
          }
        }
      },
      "query": {
        "bool": {
          "should": [
            {
              "match": {
                "prim.SDN_Name": {
                  "query": "cubanator",
                  "operator": "and"
                }
              }
            },
            {
              "has_child": {
                "type": "alt",
                "query": {
                  "match": {
                    "alt.alt_name": {
                      "query": "cubanator",
                      "operator": "and"
                    }
                  }
                }
              }
            }
          ],
          "minimum_number_should_match": "65%"
        }
      }
    }
  },
  "aggs": {
    "fullmatches": {
      "terms": {
        "field": "prim.SDN_Name.raw",
        "size": "10"
      },
      "aggs": {
        "aliases": {
          "children": {
            "type": "alt"
          },
          "aggs": {
            "names": {
              "top_hits": {
                "size": "10"
              }
            }
          }
        }
      }
    }
  }
}

The filter will dynamically remove either individual (“must”) or non-individual (“must_not”) results, depending on our needs. Then we run our tokenized, phonetics matching against the names and aliases. Finally, we also return a top-hit aggregate of at most 10 matching alias records per name match.

Our PHP code using the official ElasticSearch client looks like:

$q = strtolower( trim($_REQUEST['last_name'].' '.$_REQUEST['first_name'].' '.$_REQUEST['name']));
if(empty($q)) exit('[]');

require 'vendor/autoload.php';

$params = array();
$params['hosts'] = array ( 'YOUR IP HERE');
$client = new Elasticsearch\Client($params);

$searchParams = array();
$searchParams['index'] = 'ofac';
$searchParams['type']  = 'prim';
$searchParams['body']['query']['filtered']['filter']['bool'][($_REQUEST['api']=='companySearch'?'must_not':'must')]['term']['prim.SDN_Type'] = 'individual';
$searchParams['body']['query']['filtered']['query']['bool']['should'][0]['match']['prim.SDN_Name']=array('operator'=>'and', 'query'=>$q);
$searchParams['body']['query']['filtered']['query']['bool']['should'][1]['has_child'] = 
	array('type'=>'alt', 'query'=>
		array('match'=>
			array('alt.alt_name'=>
				array('operator'=>'and', 'query'=>$q) 
			)
		)
	);
$searchParams['body']['query']['filtered']['query']['bool']['minimum_number_should_match'] = '65%';

$searchParams['body']['aggs']['fullmatches'] = 
	array('terms' =>
			array('field' => "prim.SDN_Name.raw",
				"size"=> "10"
			),
		  'aggs' =>
			array('aliases' => array(
				'children' => array('type'=>'alt'),
				'aggs' => array(
					'names' => array('top_hits'=> array("size"=> "10")),
					//'top_hit' => array('max' => array('script' => '_score'))
				)
			  )
			)
	);


$retDoc = $client->search($searchParams);

if($retDoc['hits']['total'] > 0) {
	
	if(count($retDoc['aggregations']['fullmatches']['buckets'])>0){
		$aliases=array();
		foreach($retDoc['aggregations']['fullmatches']['buckets'] as $i=>$b){
			if($b['aliases']['doc_count'] > 0) {
				$prim_aliases = array();
				foreach($b['aliases']['names']['hits']['hits'] as $j=> $alt) {
					$prim_aliases[] = $alt['_source']['alt_name'];
				}
				$aliases[$b['key']] = implode(';',$prim_aliases);
			}
		}

	}
	
	$json = array();
	foreach($retDoc['hits']['hits'] as $i=>$result) {
		$s = str_replace('-0- ','',$result['_source']);
		$json[] = array('ofac_sdn_ent_num'	=> $result['_id'],
						'ofac_sdn_name' 	=> $s['SDN_Name'],
						'ofac_sdn_type'		=> $s['SDN_Type'],
						'ofac_sdn_program'	=> $s['Program'],
						'ofac_sdn_title'	=> $s['Title'],
						'ofac_sdn_call_sign'=> $s["Call_Sign"],
						'ofac_sdn_vess_type'=> $s["Vess_type"],
						'ofac_sdn_tonnage'	=> $s["Tonnage"],
						'ofac_sdn_grt'		=> $s["GRT"],
						'ofac_sdn_vess_flag'=> $s["Vess_flag"],
						'ofac_sdn_vess_owner'=>$s["Vess_owner"],
						'ofac_sdn_remarks'	=> trim($s["Remarks"].(isset($aliases[$s['SDN_Name']])?'     AKAs: '.$aliases[$s['SDN_Name']]:'')),
						'type'				=> 'Ofac'.($result['_type']=='prim' ? "Sdn":"Alt")
				);
	}
	exit(json_encode($json));
}		
else
	exit('[]');

We set up our client, pass in the query, run through the aliases creating an easy local lookup, and return the same response structure as the Free API. And my front-end was never the wiser.

ofac





[Fixed] jQuery News Ticker doesn’t show overflow text

10 07 2013

The ticker at http://www.jquerynewsticker.com/ is great, but what do you do when your text extends beyond the viewing window of the ticker? Shouldn’t the ticker scroll so you can see the rest of the text?

Yup!

tickerscroll

Github Fork: https://github.com/IsraelWebDev/jQuery-News-Ticker

Demo: http://jsfiddle.net/israelwebdev/YsSpY/8/





Local JavaScript Scripting with Node.JS

11 03 2013

It’s not much, but it’s my first little scriptlet with Node.JS,

specific purpose is to move files from one directory tree to an empty directory, one at a time, while another script processes each file as it enters the empty directory.

 





BuddyPress Metas & Title using xprofile field data

17 08 2011

Populating the BuddyPress user Profile page’s title, meta description, and meta keywords with some of the user-generated content, delivered via xprofile fields, seems so much better than the default options, or even what’s available in seopress, the only current BuddyPress SEO plugin.

While we’re at it, setting the on-page SEO data for Activities based on the content is a great move too.

You’ll have to modify the fields based on the data you have available, and what fields feed into which page data, but this code, placed in your theme’s functions.php file, is a good start.
Want to make a plugin based on this code for the lay-men? Feel free.

Note: extractCommonWords function based on the function here.

function bp_xprofile_title($title=null, $without_site_name=null){
	global $bp, $current_blog;

	if(defined('BP_ENABLE_MULTIBLOG')){
		$blog_title=get_blog_option($current_blog->blog_id,'blogname');
	}else {
		$blog_title=get_blog_option(BP_ROOT_BLOG,'blogname');
	}

	if(bp_is_user_profile()){
		$title_array = array();
		if($data = xprofile_get_field_data( 'Company Name', $bp->displayed_user->id))
			$title_array[] = "$data ({$bp->displayed_user->fullname})";
		if($data = xprofile_get_field_data( 'Business Overview', $bp->displayed_user->id))
			$title_array[] = $data;
		$title_array[] = $blog_title;
		$title = mb_strimwidth(strip_tags( esc_attr( implode( ' | ', $title_array))), 0, 70, '...');
	}
	if (bp_is_activity_component() && is_numeric( $bp->current_action )) {
		$activity = bp_activity_get_specific( array( 'activity_ids' => $bp->current_action ) );
		if ( $activity = $activity['activities'][0])
			if(!empty($activity->content))
				$title = mb_strimwidth(preg_replace("/[^A-Za-z0-9\s\s+\-]/", "", strip_tags( $activity->content)), 0, 70-3-3-strlen($blog_title), '...') . " | $blog_title";
	}
	return $title;
}
add_filter('bp_page_title', 'bp_xprofile_title',0);

function extractCommonWords($string){
	$stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www');
    
	$string = strip_tags($string);
	$string = trim($string); // trim the string
	$string = preg_replace('/[^a-zA-Z0-9\s\s+\-]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too…
	$string = strtolower($string); // make it lowercase
    
	preg_match_all('/\b.*?\b/i', $string, $matchWords);
	$matchWords = $matchWords[0];
       
	foreach ( $matchWords as $key=>$item ) {
		if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3 ) {
			unset($matchWords[$key]);
		}
	}   
	$wordCountArr = array();
	if ( is_array($matchWords) ) {
		foreach ( $matchWords as $key => $val ) {
			$val = strtolower($val);
			if ( isset($wordCountArr[$val]) ) {
				$wordCountArr[$val]++;
			} else {
				$wordCountArr[$val] = 1;
			}
		}
	}
	arsort($wordCountArr);
	$wordCountArr = array_slice($wordCountArr, 0, 10);
	return array_keys($wordCountArr);
}

function bp_xprofile_meta()
{
	global $bp;

	if(bp_is_user_profile()){
		$array = array();
		if($data = xprofile_get_field_data( 'Business Category', $bp->displayed_user->id))
			$array[] = htmlspecialchars_decode($data);
		if($data = xprofile_get_field_data( 'Years of Experience', $bp->displayed_user->id))
			$array[] = "$data years experience";
		if($data = xprofile_get_field_data( 'Location', $bp->displayed_user->id))
			$array[] = $data;
		if($data = xprofile_get_field_data( 'Business Overview', $bp->displayed_user->id))
			$array[] = $data;

		$desc = mb_strimwidth(preg_replace("/[^A-Za-z0-9\s\s+\-&]/", "", strip_tags( implode( ' - ', $array))), 0, 150, '...');

		echo '<meta name="description" content="'.$desc.'" />';

		if($data = xprofile_get_field_data( 'Profile Details', $bp->displayed_user->id))
			$array[] = $data;
		if($data = xprofile_get_field_data( 'Company Name', $bp->displayed_user->id))
			$array[] = $data;
		$array[] = $bp->displayed_user->fullname;

		$key_text = preg_replace("/[^A-Za-z0-9\s\s+\-]/", "", strip_tags( implode( ' ', $array)));
		$keys = extractCommonWords($key_text);
		echo '<meta name="keywords" content="'.implode(',', $keys).'" />';
	}

	if (bp_is_activity_component() && is_numeric( $bp->current_action )) {
		$activity = bp_activity_get_specific( array( 'activity_ids' => $bp->current_action ) );
		if ( $activity = $activity['activities'][0])
			if(!empty($activity->content)){
				$content = preg_replace("/[^A-Za-z0-9\s\s+\-]/", "", strip_tags( $activity->content));
				echo '<meta name="description" content="'. mb_strimwidth($content, 0, 150, '...').'" />';
	
				$keys = extractCommonWords($content);
				echo '<meta name="keywords" content="'.implode(',', $keys).'" />';
			}
	}
}
add_action('wp_head', 'bp_xprofile_meta',1);




WordPress Network store in an instant!

28 03 2011

I’m getting really excited to use WPMU‘s MarketPress e-commerce plugin on the new JobShuk website.

As I wrote in this Tweet to the guy behind WordPress’ most popular e-commerce plugin for standalone installs, the 2 most exciting features of MarketPress are the unified marketplace across all users and revenue sharing. That means visitors can search/browse products contained in all the individual stores. Additionally, the Marketplace admin (me) can take a small percentage of the purchase price without getting into drop-shipping, affiliate payments, etc. That’s some easy money.

Now if only I could get MarketPress free for life. That would be a dream…





Android Spinner, ArrayAdapter, and NullPointerException

12 01 2011

This might be obvious, if you’re bothering to read this, but when populating an Android Spinner with an ArrayAdapter, sourced by an array, the source array must not contain uninitialized elements.

For example, if you’re doing this:

String[] colors = new String[colors_json.length()+1];
for(int m=1; m<=colors_json.length(); m++){
	if(colors_json.has("label")){
		colors[m] = colors_json.getString("label");
	}
}

then you might encounter a case when there are uninitialized elements in colors.

Instead, you can try something like this:

String[] colors0 = new String[colors_json.length()+1];
int index = 1;
for(int m=1; m<=colors_json.length(); m++){
	if(colors_json.has("label")){
		colors0[index] = colors_json.getString("label");
		index++;
	}
}
colors = new String[index];
System.arraycopy(colors0, 1, colors, 1, index-1);

You can then continue safely to set:

colors[0] = "Color";
color_adapter = new ArrayAdapter(this, android.R.layout.simple_spinner_item, colors);
color_adapter.setDropDownViewResource(android.R.layout.simple_spinner_dropdown_item);
color_spinner.setAdapter(color_adapter);

without concern of hitting a nasty NullPointerException.





Automatically update your php_browscap.ini

19 10 2010

My website uses the PHP function get_browser to see if a client is Javascript enabled. (I use this primarily to see if the visitor is a search engine bot and should see a long list of data or a real person and can handle AJAX-loaded data in tabs – most real people these days are using Javascript-enabled browsers.)

get_browser relies on the php_browscap.ini file, and the problem is that keeping the php_browscap.ini file up-to-date can be a chore, and should certainly be done automatically. (I notice the problem now that I use Chrome, and with every major release, the browscap file has no clue about my new browser, which is silly.)

So I finally sat down and wrote a basic script to fetch the updated file and restart my web server, which I’ll run weekly. This works for my FreeBSD box – change the paths and download command to fit your installation.

/usr/local/bin/wget -Ophp_browscap.ini "http://browsers.garykeith.com/stream.asp?PHP_BrowsCapINI"
mv -f /usr/local/etc/php/php_browscap.ini /usr/local/etc/php/php_browscap.old.ini
mv -f php_browscap.ini /usr/local/etc/php/
/usr/local/etc/rc.d/apache22 restart

I saved to: usr/local/www/mysite.com/get_php_browscap.sh
Then add it to your crontab:

# php_browscap update
5       4       *       *       6       root    /usr/local/www/mysite.com/get_php_browscap.sh >> /var/log/cron.log

I run it Saturday morning.
Any questions?








Follow

Get every new post delivered to your Inbox.

Join 286 other followers