The Problem
Consider the following two datasets where company names are not exactly the same in both the datasets,
name in dataset 1 | name dataset 2 | symbol data set2 |
The Pakistan General Insurance Co. Ltd. | The Pakistan General Insurance | PKGI |
Security Leasing Corporation Ltd. | Security Leasing Corporation L | SLCL |
NATIONAL BANK OF PAKISTAN | NATIONAL BANK OF PAKISTAN | NBP |
THE BANK OF KHYBER | THE BANK OF KHYBER | BOK |
THE BANK OF PUNJAB Ltd | THE BANK OF PUNJAB | BOP |
SILKBANK LIMITED . | SILKBANK LIMITED | SILK |
Awwal Modaraba Comay | Awwal Modaraba | AWWAL |
B.F. Modaraba. | B.F. Modaraba | BFMOD |
The above table shows that the company names not only differ in terms of different number of characters but also in terms of capitalization. However, there is a general patterns of similarity in the first few characters, starting from left to right. Given that, we can split the problem into two parts.
1.Extract similar characters
Extract the first few characters that are similar in both the dataset and merge the data using those similar characters. For example, in the second row of the above table, “The Pakistan General Insurance” part is similar in both the tables. If we count these characters, they are 30. So how exactly are we going to find the matching number of characters in each case? We shall not do that. Instead, we shall take the iterative path where:
1. we start with extracting the first n-number of characters from both the key variables in the two datasets and merge using the extracted (truncated) variables. If the merge succeeds, we shall save the merged data separately. Also, we shall delete, from the initial file, those records which were successfully merged, and further process those that did not merge. In this first step, we shall normally start with extracting a large number of characters, for example, up to 30 characters in the case of row 2 of the above table.
Please note that the relevant Stata function is substr() for extracting a given number of characters from a variable. We shall discuss it further as we proceed in this article.
2. In the next iteration, we shall further process those records which did not merge. This time we shall extract one character less than what we used in the
THE BANK OF KHYBER | THE BANK OF PUNJAB |
So using the first 29 characters of the two variables, we shall proceed to merge. Again, we shall retain the successfully merged records, append it to the already saved successful merged data, and delete it from the initial dataset, just we did in the first step above.
3. The iterative process will continue in the same fashion, and each time we need to pay more attention to the merged data to identify any incorrect merges as explained in step 2.
4. We shall stop at a point where the data seems to have a reasonable number of similar observations based on the initial characters.
2. Use lower case
The second part of the problem is the dissimilarity in the capitalization of the names. This part of the problem is easy to handle as we can use the function lower() to convert all the names to lower cases and then use it in the merging process.
An Example
Let’s use a simple example to implement what we have read so far. In this example, we are going to use the same data as given in the above table. From that data, we shall create two datasets. One of the dataset will remain in the Stata memory, we shall call it data_memory. The other data set will be saved to a file, we shall call it data_file. The data_file has two variables, name and symbol. We shall merge the data_memory into data_file using variable name as the merging criterion. To create the two dataset, we can copy and paste the following code to Stata do editor and run it.
* Copy the following code and run from Stata do editor
*--------------------------------------------------------------------------
* Create the data_memory
*--------------------------------------------------------------------------
clear
input str30 name str5 symbol
"The Pakistan General Insurance" "PKGI"
"Security Leasing Corporation L" "SLCL"
"NATIONAL BANK OF PAKISTAN" "NBP"
"THE BANK OF KHYBER" "BOK"
"THE BANK OF PUNJAB" "BOP"
"SILKBANK LIMITED" "SILK"
"Awwal Modaraba" "AWWAL"
"B.F. Modaraba" "BFMOD"
end
save data_memory, replace
*--------------------------------------------------------------------------
* Now to create the data_file
*--------------------------------------------------------------------------
clear
input str39 name
"The Pakistan General Insurance Co. Ltd."
"Security Leasing Corporation Ltd."
"NATIONAL BANK OF PAKISTAN"
"THE BANK OF KHYBER"
"THE BANK OF PUNJAB Ltd"
"SILKBANK LIMITED ."
"Awwal Modaraba Comay"
"B.F. Modaraba."
end
save data_file, replace
Now that we have created the two datasets, let’s start the process step by step. In the following code box, the code perform the first step as discussed above. I shall then explain what actually the code does.
use data_file, clear
gen name2=lower(substr(name,1,30))
clonevar name_old = name
save data_file, replace
use data_memory
gen name2=lower(substr(name,1,30))
merge 1:1 name2 using data_file
save temporary, replace
keep if _merge == 3
save matched, replace
list name name2 symbol
+----------------------------------------------------------------------+
name name2 |
---------------------------------------------------------------------- |
NATIONAL BANK OF PAKISTAN national bank of pakistan |
Security Leasing Corporation Ltd. security leasing corporation |
THE BANK OF KHYBER the bank of khyber |
The Pakistan General Insurance Co. Ltd. the pakistan general insurance
-----------------------------------------------------------------------
use temporary, clear
keep if _merge == 2
keep name
save remaining, replace
list
+--------------------+
| name |
+--------------------+
| Awwal Modaraba |
| B.F. Modaraba |
| SILKBANK LIMITED |
| THE BANK OF PUNJAB |
+--------------------+
Explanation
In the above code block, we loaded the first dataset i.e. data_memory and created a truncated variable (name2) from the original variable using the following line of code. The first function in this line is lower() that converts all cases to lower cases. The second function is substr() that extracts bytes 1 to 30 from the string variable name. So this line will convert “Security Leasing Corporation Ltd.” to “security leasing corporation“
gen name2=lower(substr(name,1,30))
After creating the name2 variable, we saved and replaced the data_file. We then loaded the other dataset, that is, data_memory and created a similar truncated variable as above, and then merged the two datasets using this truncated variable. The merged dataset has all useful information and we saved it in a temporary.dta file for further processing.
The first step in this processing is to isolate successful merges, that is flagged by the _merge == 3 code, therefore, we kept all such observations and saved them to
In our next iteration, we shall take all of the above steps using the remaining.dta file, instead of data_file.dta file.
*-------------------------------------------------------------------------
* Iteration 2 - using first 29 characters
*-------------------------------------------------------------------------
use data_file, clear
drop name2
gen name2=lower(substr(name,1,29))
save data_file, replace
use remaining, clear
gen name2=lower(substr(name,1,29))
merge 1:1 name2 using data_file
save temporary, replace
keep if _merge == 3
save matched, replace
list name name2 symbol+--------------------------------------------------+
| name name2 symbol |
|--------------------------------------------------|
| Awwal Modaraba awwal modaraba AWWAL |
| B.F. Modaraba b.f. modaraba BFMOD |
| SILKBANK LIMITED silkbank limited SILK |
| THE BANK OF PUNJAB the bank of punjab BOP |
+--------------------------------------------------+
save remaining, replace
Is there a simple way to do this?
smerge Program
The above tutorial is best for a learning the merging process, however, it is too lengthy and is not optimal for quick applications. I have managed to get the code in a standard Stata ado program. I call this program as smerge, that is merge based on sub-string.
Installation
The program can be installed by typing the following line in Stata command window
net install smerge, from(http://fintechprofessor.com) replace
The syntax of the program is given below:
smerge varname, use(file_name) [lower() higher()]
smerge is the program name, followed by the
options enclosed in brackets [] show optional options in Stata. There are two optional options in smerge program. If these options are not specified, then the default values are used, which are 30 for the higher and 10 for the lower. The higher() option sets the initial value of the number of string bites to be extracted from the varname to make a truncated variable that will be used for the first iteration in the merge process. The initial value is decremented by one in the next iteration until it reaches the limit set by the option lower().
An Example using smerge
Let use the same dataset which we created above. We shall first load the file data_memory and merge it with the file data_file.dta. Both the datasets have the variable name.
use data_memory.dta, clear
smerge name, use(data_file) lower(10) higher(30)
list
. list
+---------------------------------------------------------+
| name symbol nstr |
|---------------------------------------------------------|
| The Pakistan General Insurance Co. Ltd. PKGI 30 |
| Security Leasing Corporation Ltd. SLCL 30 |
| NATIONAL BANK OF PAKISTAN NBP 30 |
| THE BANK OF KHYBER BOK 30 |
| THE BANK OF PUNJAB Ltd BOP 18 |
|---------------------------------------------------------|
| SILKBANK LIMITED . SILK 16 |
| Awwal Modaraba Comay AWWAL 14 |
| B.F. Modaraba. BFMOD 13 |
+---------------------------------------------------------+
Explanations
Read the second line in the above code block. We typed smerge and then the variable name. The name variable is the key variable that we are using as the merging criterion. Option use(data_file) tells smerge program to merge the current data in
The list command shows three variables, one of them is
Hello, trying to smerge 2 datasets, when I get the following msg: file remaining.dta not found. Any ideas why is that? Thanks in advance
Dennis
Can you please send the code that you are using. Are you using the example dataset given above or you are using your own dataset. If it is your own, can send it to attaullah.shah@imsciences.edu.pk so that I can diagnose the problem. Thanks
Hello, trying to smerge the same date you use. I get the following msg: _merge not found. Any ideas why is that? Thanks in advance
Massa
Thanks for pointing out. I have updated the smerge program. The new version of asdoc can be installed from my site. Copy and paste the following line in Stata and press enter.
Please note that the above line has to be copied in full. After installation of the new version, then restart Stata.
Dear Assalam,
Thank you very much for your answer.
I did modified this to make it work:
Please, tell me if I am right.
Thank you again.
PS: Your program is very important to me. Better than matchit.