![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
Last week: co-worker needed to join two datasets (staff experience to calls made by staff). Unfortunately the variable that SHOULD be the match variable (employee number) isn't 100% reliable - some people have numbers beginning '7595...' on one file and '7959...' on the other, that sort of thing. The other possible matching variable is name, but that's not 100% reliable either - there are people who are 'Anne' on one file and 'Patricia' on the other (legal name vs preferred name), variant spellings, name changes due to marriage and so on. We also have two employees with the exact same name.
So I set up a two-stage join: match by ID number, find unmatched records, match those by name, recombine the two files, etc etc. Messy and complicated, took quite some time to write and debug.
Then the other day, while poking around looking for something else, I discovered that I could have done it much more elegantly in a couple of lines. I'm using SAS EG, which has a point-and-click interface that allows SQL joins on equality etc; while I know how to write those in SQL for myself, I hadn't realised that I could also write a few things that weren't in the point-and-click options*. The trick is to write a fuzzy-logic join:
where (
(a.firstname=b.firstname)*1+
(a.lastname=b.lastname)*1+
(a.statename=b.statename)*1+
(a.emp_id=b.emp_id)*3 >= &match_tolerance)
In practice, probably slightly fiddlier than that, but that's the basic idea. Wish I'd known earlier that I could do that.
Oh well, probably not the last time I'll need to do that, and I'm sure some of my co-workers can use this trick too. I'm trying to encourage them to shift to SQL joins instead of the SAS merge operation, because it seems to cause problems for anything other than one-to-one joins.
*My knowledge of SQL and SAS code is 'what I've picked up along the job'; while I know quite a few tricks, without formal training it's easy to miss important basics, as here.
So I set up a two-stage join: match by ID number, find unmatched records, match those by name, recombine the two files, etc etc. Messy and complicated, took quite some time to write and debug.
Then the other day, while poking around looking for something else, I discovered that I could have done it much more elegantly in a couple of lines. I'm using SAS EG, which has a point-and-click interface that allows SQL joins on equality etc; while I know how to write those in SQL for myself, I hadn't realised that I could also write a few things that weren't in the point-and-click options*. The trick is to write a fuzzy-logic join:
where (
(a.firstname=b.firstname)*1+
(a.lastname=b.lastname)*1+
(a.statename=b.statename)*1+
(a.emp_id=b.emp_id)*3 >= &match_tolerance)
In practice, probably slightly fiddlier than that, but that's the basic idea. Wish I'd known earlier that I could do that.
Oh well, probably not the last time I'll need to do that, and I'm sure some of my co-workers can use this trick too. I'm trying to encourage them to shift to SQL joins instead of the SAS merge operation, because it seems to cause problems for anything other than one-to-one joins.
*My knowledge of SQL and SAS code is 'what I've picked up along the job'; while I know quite a few tricks, without formal training it's easy to miss important basics, as here.
no subject
Date: 2009-04-04 11:40 pm (UTC)