Data preparation and fuzzy matching techniques for improved statistical modeling

Abstract

Data comes in all forms, shapes, sizes and complexities. Stored in files and data sets, SAS ${}^{\@setsize{\tiny}{6pt}{\vpt}{\@vpt}\textregistered}$ users know all too well that data can be, and often is, problematic and plagued with a variety of issues. Although today’s statistical software programs are extremely powerful, they are typically not designed to overcome poor quality data. This paper describes and recommends a comprehensive data preparation and fuzzy matching process to follow to enable improved statistical modeling. Statistical techniques are also available for comparing the results of the process.

Most statistical software users are aware that two or more data files can be joined, or combined, without a problem when the data files have identifiers with unique and reliable values. However, many files do not have unique identifiers, or “keys”, and need to be joined using character values, like names or E-mail addresses. To add to the difficulty and confusion, these identifiers might be spelled differently, or use different abbreviation or capitalization protocols. This paper describes a versatile 6-step approach to handling data preparation and fuzzy matching issues for improved statistical modeling. The steps include the identification and understanding of potential matching scenarios; exploring data values and data types; data cleaning and validation; data transformation; traditional merge and join techniques; and an assortment of techniques to successfully merge, join and match less than perfect, or “messy”, data by doing phonetic matching using special-purpose character-handling functions like the SOUNDEX algorithm, and the SPEDIS, COMPLEV, and COMPGED fuzzy matching functions. Although the programming techniques described in this paper are illustrated using SAS code, many, if not most, of the techniques can be applied to any software platform that supports character-handling capabilities.

Keywords

SAS fuzzy matching character-handling functions phonetic matching SOUNDEX SPEDIS edit distance Levenshtein COMPLEV COMPGED

Get full access to this article

View all access options for this article.

References

Cadieux

& Brethiem

. (2014). Matching Rules: Too Loose, Too Tight, or Just Right? Proceedings of the 2014 SAS Global Forum (SGF) Conference, Paper 1674.

Cody

. (2017). Cody’s Data Cleaning Techniques Using SAS

{}^{\@setsize{\tiny}{6pt}{\vpt}{\@vpt}\textregistered}

Third Edition, SAS Press, SAS Institute, Cary, NC, USA.

Downey

Sun

& Norquest

. (2017). alineR: an R Package for Optimizing Feature-Weighted Alignments and Linguistic Distances. The R Journal, 9(1), 138-152.

Dunn

. (2014). Getting the Warm and Fuzzy Feeling with Inexact Matching, Proceedings of the 2014 SAS Global Forum (SGF) Conference, Paper 1316.

Foley

. (1999). Fuzzy Merges: Examples and Techniques, Proceedings of the 1999 SAS Users Group International (SUGI) Conference, Paper 46.

Kim

& Shawe-Taylor

. (1992). An approximate string-matching algorithm, Theoretical Computer Science, 92, 107-117.

Lafler

& Sloan

. (2017). A Quick Look at Fuzzy Matching Programming Techniques Using SAS

{}^{\@setsize{\tiny}{6pt}{\vpt}{\@vpt}\textregistered}

Software, Proceedings of the 2017 Western Users of SAS Software (WUSS) Conference, Paper 129.

Lafler

. (2017). Removing Duplicates Using SAS

{}^{\@setsize{\tiny}{6pt}{\vpt}{\@vpt}\textregistered}

Proceedings of the 2017 South Central SAS Users Group (SCSUG) Conference.

Levenshtein Distance. Accessed on April 24

{}^{\text{th}}

, 2018, https://rosettacode.org/wiki/Levenshtein_distance.

10.

McCoy

& Frank

. (2018). Phonologically Informed Edit Distance Algorithms for Word Alignment with Low-Resource Languages, Proceedings of the Society for Computation in Linguistics, Volume 1, Article 12.

11.

Patridge

. (1997). The Fuzzy Feeling SAS Provides: Electronic Matching of Records without Common Keys, Proceedings of the 1997 SAS Users Group International (SUGI) Conference, Paper 28.

12.

Rho, Transforming SAS Data Sets, (2000). http://www.rhoworld.com/pdf/ch599.pdf, 1-41.

13.

Roesch

. (2012). Matching Data Using Sounds-Like Operators and SAS

{}^{\@setsize{\tiny}{6pt}{\vpt}{\@vpt}\textregistered}

Compare Functions, Proceedings of the 2012 SAS Global Forum (SGF) Conference, Paper 122.

14.

Russell

. (January 27, 2015). How to Perform a Fuzzy Match Using SAS Functions. blogs.sas.com.

15.

SAS Usage Note 1566 (2000): Why duplicate observations occur when using PRO SORT with the NODUPRECS option, http://support.sas.com/kb/1/566.html.

16.

Sloan

& Hoicowitz

. (2016). Fuzzy Matching: Where Is It Appropriate and How Is It Done? SAS Can Help., Proceedings of the 2016 SAS Global Forum (SGF) Conference, Paper 7760.

17.

Sloan

& Lafler

. (2018). Fuzzy Matching Programming Techniques Using SAS Software, Proceedings of the 2018 SAS Global Forum (SGF) Conference, Paper 2886.

18.

Staum

. (2007). Fuzzy Matching using the COMPGED Function, Proceedings of the 2007 NorthEast SAS Users Group (NESUG) Conference, Paper AP23.

19.

Teres

. (2011). Using SQL Joins to Perform Fuzzy Matches on Multiple Identifiers, Proceedings of the 2011 NorthEast SAS Users Group (NESUG) Conference, Paper PS07.

20.

Zirbel

. (2009). Learn the Basics of PROC TRANSPOSE, Proceedings of the 2009 SAS Global Forum (SGF) Conference, Paper 060.