Gah!

Apr. 4th, 2009 09:28 am
lederhosen: (Default)
[personal profile] lederhosen
Last week: co-worker needed to join two datasets (staff experience to calls made by staff). Unfortunately the variable that SHOULD be the match variable (employee number) isn't 100% reliable - some people have numbers beginning '7595...' on one file and '7959...' on the other, that sort of thing. The other possible matching variable is name, but that's not 100% reliable either - there are people who are 'Anne' on one file and 'Patricia' on the other (legal name vs preferred name), variant spellings, name changes due to marriage and so on. We also have two employees with the exact same name.

So I set up a two-stage join: match by ID number, find unmatched records, match those by name, recombine the two files, etc etc. Messy and complicated, took quite some time to write and debug.

Then the other day, while poking around looking for something else, I discovered that I could have done it much more elegantly in a couple of lines. I'm using SAS EG, which has a point-and-click interface that allows SQL joins on equality etc; while I know how to write those in SQL for myself, I hadn't realised that I could also write a few things that weren't in the point-and-click options*. The trick is to write a fuzzy-logic join:

where (
(a.firstname=b.firstname)*1+
(a.lastname=b.lastname)*1+
(a.statename=b.statename)*1+
(a.emp_id=b.emp_id)*3 >= &match_tolerance)

In practice, probably slightly fiddlier than that, but that's the basic idea. Wish I'd known earlier that I could do that.

Oh well, probably not the last time I'll need to do that, and I'm sure some of my co-workers can use this trick too. I'm trying to encourage them to shift to SQL joins instead of the SAS merge operation, because it seems to cause problems for anything other than one-to-one joins.

*My knowledge of SQL and SAS code is 'what I've picked up along the job'; while I know quite a few tricks, without formal training it's easy to miss important basics, as here.
From:
Anonymous
OpenID
Identity URL: 
User
Account name:
Password:
If you don't have an account you can create one now.
Subject:
HTML doesn't work in the subject.

Message:

If you are unable to use this captcha for any reason, please contact us by email at support@dreamwidth.org


 
Notice: This account is set to log the IP addresses of everyone who comments.
Links will be displayed as unclickable URLs to help prevent spam.

Profile

lederhosen: (Default)
lederhosen

April 2017

S M T W T F S
      1
2345 678
9101112131415
16171819202122
23242526272829
30      

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated May. 30th, 2017 05:34 am
Powered by Dreamwidth Studios