Merging datasets in Stata on long strings and less precise matching criterion

  • 0

Merging datasets in Stata on long strings and less precise matching criterion

Category:Uncategorized

The Problem

Consider the following two datasets where company names are not exactly the same in both the datasets, still we want to merge them using the name variable as the merging criterion. How that can be done in Stata.

name in dataset 1 name dataset 2 symbol data set2
The Pakistan General Insurance Co. Ltd. The Pakistan General Insurance PKGI
Security Leasing Corporation Ltd. Security Leasing Corporation L SLCL
NATIONAL BANK OF PAKISTAN NATIONAL BANK OF PAKISTAN NBP
THE BANK OF KHYBER THE BANK OF KHYBER BOK
THE BANK OF PUNJAB Ltd THE BANK OF PUNJAB BOP
SILKBANK LIMITED . SILKBANK LIMITED SILK
Awwal Modaraba Comay Awwal Modaraba AWWAL
B.F. Modaraba. B.F. Modaraba BFMOD

The above table shows that the company names not only differ in terms of different number of characters but also in terms of capitalization. However, there is a general patterns of similarity in characters in the first few characters, starting from left to right. Given that, we can split the problem into two parts.

 

1.Extract similar characters

Extract the first few characters that are similar in both the dataset and merge the data using those similar characters. For example, in the second row of the above table, “The Pakistan General Insurance” part is similar in both the tables. If we count these characters, they are 30. So how exactly are we going to find the matching number of characters in each case? We shall not do that. Instead, we shall take the iterative path where:


1. we start with extracting the first n-number of characters from both the key variables in the two datasets and merge using the extracted (truncated) variables. If the merge succeeds, we shall save the merged data separately. Also, we shall delete, from the initial file, those records which were successfully merged, and further process those that did not merge. In this first step, we shall normally start with extracting a large number of characters, for example, up to 30 characters in the case of row 2 of the above table.

Please note that the relevant Stata function is substr() for extracting a given number of characters from a variable. We shall discuss it further as we proceed in this article.

2. In the next iteration, we shall further process those records which did not merge. This time we shall extract one character less than what we used in the preceeding step. The idea is that if two variables did not have the first 30 characters in common, they might have the first 29 characters in common. But this time, we need to be careful as we reduce the extraction of the initial number of characters, the chances of matching incorrect records increases. For example, consider the following two records where the first 12 characters are exactly the same in both the records i.e. “THE BANK OF “, therefore, if we merge the two data sets using only the first 12 characters, we shall incorrectly merge THE BANK OF KHYBER into THE BANK OF PUNJAB

THE BANK OF KHYBERTHE BANK OF PUNJAB

So using the first 29 characters of the two variables, we shall proceed to merge. Again, we shall retain the successfully merged records, append it to the already saved successful merged data, and delete it from the initial dataset, just we did in the first step above.

3. The iterative process will continue in the same fashion, and each time we need to pay more attention to the merged data to identify any incorrect merges as explained in step 2.

4. We shall stop at a point where the data seems to have a reasonable number of similar observations based on the initial characters.

 

2. Use lower case

The second part of the problem is the dissimilarity in the capitalization of the names. This part of the problem is easy to handle as we can use the function lower() to convert all the names to lower cases and then use it in the merging process.

 

An Example

Let’s use a simple example to implement what we have read so far. In this example, we are going to use the same data as given in the above table. From that data, we shall create two datasets. One of the dataset will remain in the Stata memory, we shall call it data_memory. The other data set will be save to a file, we shall call it data_file. The data_file has two variables, name and symbol. We shall merge the data_memory into data_file using variable name as the merging criterion. To create the two dataset, we can copy and paste the following code to Stata do editor and run it.

* Copy the following code and run from Stata do editor

*--------------------------------------------------------------------------
* Create the data_memory
*--------------------------------------------------------------------------
clear
input str30 name str5 symbol
"The Pakistan General Insurance" "PKGI"
"Security Leasing Corporation L" "SLCL"
"NATIONAL BANK OF PAKISTAN" "NBP"
"THE BANK OF KHYBER" "BOK"
"THE BANK OF PUNJAB" "BOP"
"SILKBANK LIMITED" "SILK"
"Awwal Modaraba" "AWWAL"
"B.F. Modaraba" "BFMOD"
end
save data_memory, replace

*--------------------------------------------------------------------------
* Now to create the data_file
*--------------------------------------------------------------------------
clear
input str39 name
"The Pakistan General Insurance Co. Ltd."
"Security Leasing Corporation Ltd."
"NATIONAL BANK OF PAKISTAN"
"THE BANK OF KHYBER"
"THE BANK OF PUNJAB Ltd"
"SILKBANK LIMITED ."
"Awwal Modaraba Comay"
"B.F. Modaraba."
end

save data_file, replace

Now that we have created the two datasets, let’s start the process step by step. In the following code box, the code perform the first step as discussed above. I shall then explain what actually the code does.

use data_file, clear
gen name2=lower(substr(name,1,30))
clonevar name_old = name
save data_file, replace
use data_memory
gen name2=lower(substr(name,1,30))
merge 1:1 name2 using data_file
save temporary, replace
keep if _merge == 3
save matched, replace
list name name2 symbol
+----------------------------------------------------------------------+
name name2 |
---------------------------------------------------------------------- |
NATIONAL BANK OF PAKISTAN national bank of pakistan |
Security Leasing Corporation Ltd. security leasing corporation |
THE BANK OF KHYBER the bank of khyber |
The Pakistan General Insurance Co. Ltd. the pakistan general insurance
-----------------------------------------------------------------------


use temporary, clear
keep if _merge == 2
keep name
save remaining, replace
list

 +--------------------+
| name |
+--------------------+
| Awwal Modaraba |
| B.F. Modaraba |
| SILKBANK LIMITED |
| THE BANK OF PUNJAB |
+--------------------+

 

Explanation

In the above code block, we loaded the first dataset i.e. data_memory and created a truncated variable (name2) from the original variable using the following line of code. The first function in this line is lower() that converts all cases to lower cases. The second function is substr() that extracts bytes 1 to 30 from the string variable name. So this line will convert “Security Leasing Corporation Ltd.” to “security leasing corporation

gen name2=lower(substr(name,1,30))

After creating the name2 variable, we saved and replaced the data_file. We then loaded the other dataset, that is, data_memory and created a similar truncated variable as above, and then merged the two datasets using this truncated variable. The merged dataset has all useful information and we saved it in a temporary.dta file for further processing.

The first step in this processing is to isolate successful merges, that is flagged by the _merge == 3 code, therefore, we kept all such observations and saved them to matched.dta file. The list command shows that we have 4 successful merges in this go. We then reload the temporary.dta file to isolate those records that did not merge. Therefore, those records are saved to another file, which we named as remaining.dta.

In our next iteration, we shall take all of the above steps using the remaining.dta file, instead of data_file.dta file.

 
*-------------------------------------------------------------------------
* Iteration 2 - using first 29 characters
*-------------------------------------------------------------------------
use data_file, clear
drop name2
gen name2=lower(substr(name,1,29))
save data_file, replace
use remaining, clear
gen name2=lower(substr(name,1,29))
merge 1:1 name2 using data_file
save temporary, replace
keep if _merge == 3
save matched, replace
list name name2 symbol

 +--------------------------------------------------+
 | name name2 symbol |
 |--------------------------------------------------|
| Awwal Modaraba awwal modaraba AWWAL |
| B.F. Modaraba b.f. modaraba BFMOD |
| SILKBANK LIMITED silkbank limited SILK |
| THE BANK OF PUNJAB the bank of punjab BOP |
+--------------------------------------------------+

save remaining, replace
 

Is there a simple way to do this?

 

smerge Program

The above tutorial is best for a learning the merging process, however, it is too lengthy and is not optimal for quick applications. I have managed to get the code in a standard Stata ado program. I call this program as smerge, that is merge based on sub-string.

 

Installation

The program can be installed by typing the following line in Stata command window

net install smerge, from(http://fintechprofessor.com) replace

The syntax of the program is given below:

 smerge varname, use(file_name) [lower() higher()]

smerge is the program name, followed by the varname, that is the variable name which should exist in both the files, i.e. the file currently loaded in the memory of Stata and the file on disk. After the comma, the program options are listed. Option use(file_name) is a required option. In the brackets, we have to provide the file name with which w want to merge the data. It can be a full directory path with the file name or simply the file name if the file is in the current directory.

options enclosed in brackets show optional options in Stata. There are two optional options in smerge program. If these options are not specified, then the default values are used, which are 30 for the higher and 10 for the lower. The higher() option sets the initial value of the number of string bites to be extracted from the varname to make a truncated variable that will be used for the first iteration in the merge process. The initial value is decremented by one in the next iteration until it reaches the limit set by the option lower().

 

An Example using smerge

Let use the same dataset which we created above. We shall first load the file data_memory and merge it with the file data_file.dta. Both the datasets have the variable name.

use data_memory.dta, clear
smerge name, use(data_file) lower(10) higher(30)
list

. list
+---------------------------------------------------------+
| name symbol nstr |
|---------------------------------------------------------|
| The Pakistan General Insurance Co. Ltd. PKGI 30 |
| Security Leasing Corporation Ltd. SLCL 30 |
| NATIONAL BANK OF PAKISTAN NBP 30 |
| THE BANK OF KHYBER BOK 30 |
| THE BANK OF PUNJAB Ltd BOP 18 |
|---------------------------------------------------------|
| SILKBANK LIMITED . SILK 16 |
| Awwal Modaraba Comay AWWAL 14 |
| B.F. Modaraba. BFMOD 13 |
+---------------------------------------------------------+
 

Explanations

Read the second line in the above code block. We typed smerge and then the variable name. The name variable is the key variable that we are using as the merging criterion. Option use(data_file) tells smerge program to merge the current data in memeory with the file data_file. And option higher(30) and lower(10) sets the limits for sub-string extraction for creating the truncated variables that will be used for merging. And that’s it. We just need one line of code to do all the steps that we went through in the earlier tutorial.

The list command shows three variables, one of them is nstr. This variable records the number of strings from left to right of the varname. This number is recorded when the merge is successful. So the list command shows that the four records were merged in the first iteration i.e when we extracted the first 30 characters. The 5th record was merged when we extracted 18 characters i.e. The full entry is THE BANK OF PUNJAB Ltd, and the first 18 characters are THE BANK OF PUNJAB


Leave a Reply

two × five =