An artificial intelligence (AI) representation of a person.
Pioneers in NLP
“The historical sense involves a perception, not only of the pastness of the past, but of its presence” ― T.S. Eliot.
In 1965, Russian mathematician Vladimir Levenshtein published an algorithm for computing the “edit” distance between two strings. Fred Damerau was also working in this area at IBM Research in the U.S. He published a ground-breaking paper in 1964 on using computers to correct spelling errors. In the original Levenshtein algorithm, distance was computed by the number of deletions, insertions, and substitutions to match the text strings. The Damerau-Levenshtein distance measure extended this the original algorithm to include transpositions of adjacent characters. This work from over 50 years ago is the foundation for modern auto-correct and search routines!
Record Linkage Exercise
Algorithms based on Levenshtein distance are commonly used in record linkage that involves matching based textual identifying information (e.g., names, addresses). Good examples abound from healthcare. The Centers for Medicare & Medicaid Services (CMS) sometimes links healthcare provider data from internal and external sources (e.g., Medicare with state Medicaid provider data). The outside sources may not carry National Provider Identifiers, or NPIs, and so identity resolution based on names and addresses is necessary.
Softrams’ App Foundry recently examined the effectiveness of different text matching algorithms currently available in Python. The problem we chose was to match and link two datasets containing information about hospitals by using name and address. It was necessary to use a technique that identified strings that match “approximately”. For example, matching (e.g., Mercy Hospital Inc. with. Mercy Hospital). Most of the text matching routines give a score of similarity.
We examined several distance methods: Cosine, Jaro-Winkler, and “fuzzy matching” using the Levenshtein algorithm. Fuzzy matching proved to be effective due to its speed and reliability. In addition, it works well when the strings are long.
We installed and loaded several python libraries including Fuzzywuzzy to provide the algorithm and Nltk which has routines called lemmatizers and stemmers that help clean and prepare the data before analysis. As part of that pre-processing, we performed the following steps:
- Remove or replace a few special characters (i.e., ‘&’ symbol is replaced by ‘and’).
- Tokenization: break down the text into sentences and the sentences into words. Convert all words to lowercase and remove punctuation.
- Lemmatization: remove the grammar tense and transform each word into its original form.
- Stemming: extract the root form of the words.
After this text preparation, we implemented 4 variation functions to match address and name separately between datasets:
- Ratio: perfect when the strings have similar lengths and order.
- Partial_ratio: Allow us to perform substring matching. It is used for sentences with different lengths.
- Token_sort_ratio: Used when the sentences have the same meaning, but their order is different.
- Token_set_ratio: Similar to token_sort_ratio, except it takes out the common tokens before calculating the similarity score. It is used when the two sentences have different lengths.
The token_sort_ratio appeared to be the most appropriate for our use case, and after further examining of the outputs, we decided to use this function.
The function output returns a percentage between 0 and 100, 0 being not similar at all and 100 being identical. We were matching hospitals among thousands across the country, so there was a need to set some rules for determining the best matches. We found it to be effective to keep the name and address of the hospitals separate and produce similarity scores for each.
We filtered in records that have similarity score above defined threshold values, excluding low similarity matches. We then had to set rules for accepting matches when we had a) high address similarity scores with low name scores, and b) low addresses similarity scores. Addresses that matched well were easier to identify as matches. That is, of course, because there are many hospitals with similar names across the country, and some hospitals have several different locations.
With the rule criteria in place, we were able to match >99% of the hospitals correctly!
When performing fuzzy matching with Levenshtein distance, each sentence is compared against all the other sentences in the dataset. This is known as quadratic time and can quickly form a barrier when used on even modest data sets of greater than a few thousand records. The solution to this problem comes from a well-known NLP algorithm of Term Frequency-Inverse Document Frequency (or TF-IDF). TF-IDF looks for close matches using the K-Nearest Neighbors (KNN) and cosine similarity. This method of finding close matches has been proven to improve the computation time efficiently.