Apr. 4th, 2009 09:28 am
lederhosen: (Default)
[personal profile] lederhosen
Last week: co-worker needed to join two datasets (staff experience to calls made by staff). Unfortunately the variable that SHOULD be the match variable (employee number) isn't 100% reliable - some people have numbers beginning '7595...' on one file and '7959...' on the other, that sort of thing. The other possible matching variable is name, but that's not 100% reliable either - there are people who are 'Anne' on one file and 'Patricia' on the other (legal name vs preferred name), variant spellings, name changes due to marriage and so on. We also have two employees with the exact same name.

So I set up a two-stage join: match by ID number, find unmatched records, match those by name, recombine the two files, etc etc. Messy and complicated, took quite some time to write and debug.

Then the other day, while poking around looking for something else, I discovered that I could have done it much more elegantly in a couple of lines. I'm using SAS EG, which has a point-and-click interface that allows SQL joins on equality etc; while I know how to write those in SQL for myself, I hadn't realised that I could also write a few things that weren't in the point-and-click options*. The trick is to write a fuzzy-logic join:

where (
(a.emp_id=b.emp_id)*3 >= &match_tolerance)

In practice, probably slightly fiddlier than that, but that's the basic idea. Wish I'd known earlier that I could do that.

Oh well, probably not the last time I'll need to do that, and I'm sure some of my co-workers can use this trick too. I'm trying to encourage them to shift to SQL joins instead of the SAS merge operation, because it seems to cause problems for anything other than one-to-one joins.

*My knowledge of SQL and SAS code is 'what I've picked up along the job'; while I know quite a few tricks, without formal training it's easy to miss important basics, as here.
Identity URL: 
Account name:
If you don't have an account you can create one now.
HTML doesn't work in the subject.


If you are unable to use this captcha for any reason, please contact us by email at

Notice: This account is set to log the IP addresses of everyone who comments.
Links will be displayed as unclickable URLs to help prevent spam.


lederhosen: (Default)

April 2017

2345 678

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Apr. 25th, 2017 10:40 am
Powered by Dreamwidth Studios