Author Archives: Attaullah Shah

  • 0

asdoc – custom table title | MS Word | Stata


For those who are not yet familiar with asdoc, asdoc can be downloaded from SSC and can be used with almost all Stata commands. Here is a short blog post that shows how asdoc can be used with any Stata command. You can also watch several YouTube videos that show the use of asdoc

asdoc installation

 ssc install asdoc

Customizing the table title

The focus of this blog entry is to highlight some of the tricks that can be used in asdoc to customize the table titles. asdoc produces some generic table titles with many Stata commands. However, the default table titles can be replaced by using the option title(). In the following text, I discuss this and several other features of asdoc to modify the table titles:

1. Custom titles

If custom titles are needed, just use option title( title text). See the following example.

sysuse auto
asdoc tab rep78 , title(My cutom table title) replace

2. Title not required

We can use an empty title if the table title is not required. In the following example, I am using the backslash (\) inside the title option. This will produce empty title in the document.

asdoc tab rep78 , title(\) replace

3. Title is needed in italics

We can use many RTF directives inside the title option. So to get the title in italics font, the code would be:

asdoc tab rep78 , title(\i My custom table title) replace

4. Title with bigger font size

The default in asdoc is to report table titles in boldface and font size of 10pt. If bigger or smaller font size is needed, again we can pass the RTF directive \fs# to the document. So to produce a title with 14 pt, we shall type:

asdoc tab rep78 , title(\fs28 My custom table title) replace

  • 0

The Error option matrow() not allowed” when using asdoc with tabulate command


Ronald asked the following question: I am getting the error when using asdoc with tabulate command.
option matrow() not allowed

I asked Ronald to send his dataset for a closer look. Upon inspection, it turned out that Ronald was trying to tabulate values of a string variable, and asdoc had problem with that. As a temporary solution, I suggested the following two steps.

Step 1: Convert the string variable to numeric variable with value labels. The good news is that we can do that in one line of code using the encode command. So assume that the string variable name is country, we shall encode it:

encode country, gen(country_new)

where country_new is a new is a new variable with value labels.

Step 2: Now we can use it with asdoc, so to tabulate it and send the output to MS Word, we type:

asdoc tab country_new, replace

  • 0

Reporting odd ratios and Chi2 with asdoc


Richard Makurumidze has asked:It seems asdoc does not work with the chi (chi -square) and or (odds ratio) in logistic regression. Is this correct or am making some error?

Richard is referring to the nest option of asdoc that creates the nested regression tables. Without the nest option, asdoc produces detailed regression tables and exports odds ratio as a default option. However, with nest option, users must explicitly declare that they are interested in the odd ratios. This declaration is done using the eform option. In the following examples, I show how to get odd ratios with both the detailed and the nested regressions.

Reporting the odd ratios

We shall use the example data that is available on the Stata web server. The data can be downloaded by typing the following in the Stata command window.

webuse lbw, clear

Getting odd ratios in the detailed regression tables

 asdoc logistic low age lwt i.race smoke ptl ht ui, replace 

Getting odd ratios in the nested regression tables

 asdoc logistic low age lwt i.race smoke ptl ht ui, replace nest eform

This is how the output looks like.

Reporting the Chi2

Richard’s second querry is related to reporting the Chi2 test value. Since asdoc tries to find the r-squared values in regression commands, it is possible that this value is not available in some commands such as in the case of the logisitc regression. Users can add additional statistics to the regression table by using the option add(). There is a detailed discussion on this option in the asdoc help file, which we can access by typing:

help asdoc

Below, I show how we can use this option for reporting the Chi2 test value. Please note that Stata regression commands leaves behind several statistics in the e() macro which we can report with asdoc.

 asdoc logistic low age lwt i.race smoke  ui, replace nest add(Chi2, `e(chi2)')

* Add another regression

asdoc logistic low age lwt i.race smoke ptl ht ui, nest add(Chi2, `e(chi2)')


Option add() has two elements – the text Chi2 and the macro `e(chi2)’. These two elements are separated by the comma. This is how option add works. The inputs of option add() should be added in pairs of two, each one separated by a comma.

  • 0

Reshape data in Stata – An easy to understand tutorial

Category:Blog,Uncategorized Tags : 

From wide to long format

Suppose we have the data in the following format

 | id   sex   inc80   inc81   inc82   ue80   ue81   ue82 |
 |  1     0    5000    5500    6000      0      1      0 |
 |  2     1    2000    2200    3300      1      0      0 |
 |  3     0    3000    2000    1000      0      0      1 |

The above structure is known as the wide format. If we wish to convert it to a long format, such as the one given below,

| id year sex inc ue |
| 1 80 0 5000 0 |
| 1 81 0 5500 1 |
| 1 82 0 6000 0 |
| 2 80 1 2000 1 |
| 2 81 1 2200 0 |
| 2 82 1 3300 0 |
| 3 80 0 3000 0 |
| 3 81 0 2000 0 |
| 3 82 0 1000 1 |

We shall just type

reshape long inc ue, i(id) j(year)


Since we need to convert the data from a wide format to a long format, this is why the command that we wrote was reshape long. After that, we have to specify the names of the variables which are in the wide format. In our dataset, there are 2 variables which are INC and UE. Both of these variables have a numeric part. That numeric part is what we call the variable J. We specify this J variable in the option j(new variable). In our dataset, there is no variable with the name year, however, we wrote the option j(year) so that a new variable is created for the numeric values of 80, 81, and 82. We also specified option i(id), where option i needs an existing variable that is a unique panel identifier.

To practice the above yourself, here is the source data and code.

reshape long inc ue, i(id) j(year)

Reshape long to wide

Continuing from the previous example, we can reshape the data back to wide format by

reshape wide inc ue, i(id) j(year)

email-subscribers-form id=”{form-id}”

  • 0

asdoc version 2.3.3 : New Features

Category:asdoc,Stata Programs Tags : 

Version, dated Feb 23, 2019, of asdoc bring significant improvements to existing routines and introduces few new features. Details are given below. If you have not used asdoc previously, I would encourage you to read this half page quick start to asdoc.

New Features

1.Font style

2. Formatting the header row and header column

3. Revamped the tabulation commands

4. Revamped the table command

5. Extending the detailed regression tables [ Read further details here]

  • 5.1 Added confidence intervals to the detailed regression tables
  • 5.2 Added an option for customizing the significance starts
  • 5.3 Added an option for suppressing significance stars
  • 5.4 Added an option for suppressing confidence intervals

6. Adding support for macOS

6. Improving the output from proportion command

7. Support added for logistic family of regressions

8. Improving table outputs of non-standard outputs i.e. multilevel models

9. eform() option added to nested tables

Detailed discussion and examples are provided in the help file accompanying the new version of asdoc. However, I would like to discuss the first two features in some details below.

1. Setting font style

The default font style is Garamond in the latest version of asdoc. Option font(font_name) can be used to change the font face to any desired font style. In the brackets, we have to write the full name of the font, which is currently installed in the operating system. For example, to set the font face to Arial, we shall type: 


To produce summary statistics in Times New Roman font, let us use the auto dataset from the system directory

sysuse auto, clear
asdoc sum, font(Times New Roman) replace

Please note that the font() option can be used only at the start of the document. Therefore, it cannot change from table to table when using option append of asdoc.

2. Formatting table headers

In this new version of asdoc, we can easily pass RTF formatting control words to the header row and header columns of the ouput tables. For this purpose, option fhr() is used to format the row headers, i.e. the data given in the first column of each row. Similarly, option fhc() is used to format the column headers, i.e., the data given in the top cells of each column. Both the fhr() and fhc() will pass RTF control words to the final document. See the
following examples.

Objective                                    Code to use
Format column headers as bold               fhc(\b)
Format column headers as italic             fhc(\i)
Format column headers as bold and italic    fhc(\b \i)
Format row headers as bold                  fhr(\b)
Format row headers as italic                fhr(\b)
Format row headers as bold and italic       fhr(\b \i)

So to make a table of descriptive statistics with column headers in bold and row headers in italic font, the code would be:

sysuse auto, clear
asdoc sum, fhr(\i) fhc(\b) replace

  • 4

Getting p-values and t-values with asreg

Category:Blog,Stata Programs

Xi asked the following quesiton:

How can I get p-values and t-values using asreg program?

Introduction to asreg

asreg is a Stata program, written by Dr. Attaullah Shah. The program is available for free and can be downloaded from SSC by typing the following on the Stata command window:

ssc install asreg

asreg was primarily written for rolling/moving / sliding window regressions. However, with the passage of time, several useful ideas were conceived by its creator and users. Therefore, more features were added to this program. 

Getting t-values after asreg

Consider the following example where we use the grunfeld dataset from the Stata web server. The dataset has 20 companies and 20 years of data for each company. We shall estimate the following regression model where the dependent variable is invest and independent variables are mvalue and kstock. Let’s estimate the regression model separately for each company, denoted by i .

In the following lines of code, the letters se after comma causes asreg to report the standard errors for each regression coefficient.

webuse grunfeld, clear
bys company: asreg invest mvalue kstock, se

asreg generates the regression coefficients, r-squared, adjusted r-squared, number of observations (_Nobs) and standard errors for each coefficient. This is enough information for producing t-values and p-values for the regression coefficients. The t-values can be generated by:

gen t_values_Cons = _b_cons / _se_cons
gen t_values_mvalue = _b_mvalue / _se_mvalue
gen t_values_kstock = _b_kstock / _se_kstock

Getting p-values after asreg

Getting p-values is just one step away. We need one additional bit of information from the regression estimates, that is the degrees of freedom. This is usually equal to the number of observations minus the number of parameters being estimated. Since we have two independent variables and one constant, the number of parameters being estimated are 3. asreg returns the number of observation in the variable _Nobs. Therefore, the term _Nobs – 3 in the following lines of code is a way to get the degrees of freedom.

gen p_values_Cons = (2 * ttail(_Nobs-3), abs( _b_cons / _se_cons ))
gen p_values_mvalue = (2 * ttail(_Nobs-3), abs( _b_mvalue / _se_mvalue ))
gen p_values_kstock = (2 * ttail(_Nobs-3), abs( _b_kstock / _se_kstock ))

Verify the Results

Let’s estimate a regression for the first company and compare our estimates with those produced by the Stata regress command.

 reg invest mvalue kstock if company == 1
list _b_cons _se_cons p_values_Cons in 1

Full Code

 webuse grunfeld, clear
bys company: asreg invest mvalue kstock, se
gen t_values_Cons = _b_cons / _se_cons
gen t_values_mvalue = _b_mvalue / _se_mvalue
gen t_values_kstock = _b_kstock / _se_kstock
gen p_values_Cons = (2 * ttail(_Nobs-3), abs( _b_cons / _se_cons ))
gen p_values_mvalue = (2 * ttail(_Nobs-3), abs( _b_mvalue / _se_mvalue ))
gen p_values_kstock = (2 * ttail(_Nobs-3), abs( _b_kstock / _se_kstock ))
reg invest mvalue kstock if company == 1
list _b_cons _se_cons t_values_Cons p_values_Cons in 1

  • 8

Ordering variables in a nested regression table of asdoc in Stata

Category:asdoc,Blog Tags : 

In this blog entry, I shall highlight one important, yet less known, feature of the option keep() in nested regression tables of asdoc. If you have not used asdoc previously, this half-page introduction will put on fast track. And for a quick start of regression tables with asdoc, you can also watch this YouTube video.


Option keep()

There are almost a dozen options in controlling the output of a regression table in asdoc. One of them is the option keep(list of variable names). This option is primarily used for reporting coefficient of the desired variables. However, this option can also be used for changing the order of the variables in the output table. I explore these with relevant examples below.


1. Changing the order of variables

Suppose we want to report our regression variables in a specific order, we shall use option keep() and list the variable names in the desired order inside the brackets of option keep(). It is important to note that we have to list all variables which we want to report as omitting any variable from the list will cause asdoc to omit that variable from the output table.


An example

Let us use the auto dataset from the system folder and estimate two regressions. As with any other Stata command, we need to add asdoc to the beginning of the command line. We shall nest these regressions in one table, hence we need to use the option nest. Also, we shall use option replace in the first regression to replace any existing output file in the current directory. Let’s say we want to variables to appear in this order in the output file _cons trunk weight turn. Therefore, the variables are listed in this order inside the keep() option. The code and output file are shown below.

sysuse auto, clear
asdoc reg mpg turn, nest replace
asdoc reg mpg turn weight trunk, nest keep(_cons trunk weight turn)



2. Reporting only needed variables

Option keep is also used for reporting only needed variables, for example, we might not be interested in reporting coefficients of year or industry dummies. In such cases, we shall list the desired variable names inside the brackets of the keep() option. In the above example, if we wish to report only _cons trunk weight , we would just skip the variable turn from the keep option. Again, the variables will be listed in the order in which they are listed inside the keep option.  

sysuse auto, clear
asdoc reg mpg turn, nest replace
asdoc reg mpg turn weight trunk, nest keep(_cons trunk weight)



Off course, we could also have used option drop(turn) instead of option keep(_cons trunk weight) for dropping variable turn from the output table.



  • 17

Exporting ttest results from Stata to Word using asdoc

Category:asdoc,Blog,Stata Programs,Uncategorized


asdoc installation

If you have not already studied the features of asdoc, you can visit this page that lists the table of contents of what asdoc can do. You can also read this one paragraph introduction to asdoc. The following line of code will install asdoc from SSC

ssc install asdoc
help asdoc


Reporting t-tests with asdoc

Before we make the t-test results table for our example data, let us breifly explore the options available in asdoc for making a t-test results table.

Whether it is one-sample t-test or two-sample or other forms, asdoc manages to report the results line by line for each test. asdoc also allows accumulating results from different runs of t-tests. For this purpose, the option rowappend of asdoc really comes handy. With the sub-command ttest , we can use the following options of asdoc to control asdoc behavior.

(1) replace / append

(2) save(filename)

(3) title(text)

(4) fs(#)

(5) hide.

(6) stats()

(7) rowappend.

These options are discussed in detail in Section 1 of asdoc help file. Option stats and rowappend are discussed below:


Option stat()

Without stat() option, asdoc reports the number of observations (obs), mean, standard error, t-value, and p-value with t-tests. However, we can select all or few statistics using the stat option. The following table lists the keywords and their details for reporting the desired statistics.

keyword details
n Number of observations
mean Arithmetic mean
se Standard error
df degrees of freedom
obs Number of observations
t t-value
p p-value
sd standard deviation
dif difference in means if two-sample t-test


Option rowappned

ttest tables can be constructed in steps by adding results of different t-tests to an existing table one by one using option rowappend. There is only one limitation that the t-tests are performed and asdoc command applied without writing any other results to the file in-between.


An example

Suppose we have the following data set with variables r0, r1, r2, r3, and y. The data can be downloaded into Stata by

use, clear

The variables ro-r3 are the numeric variables for which we would like to conduct one-sample ttest whereas variable y is a numeric date variable that tracks years. We wish to conduct a ttest for each of the r0-r3 variables and in each year and make one table from all such tests.


Without using a loop

 asdoc ttest R0==0 if Y==2009, replace title(One Sample t-test Results)
asdoc ttest R1==0 if Y==2009, rowappend
asdoc ttest R2==0 if Y==2009, rowappend
asdoc ttest R3==0 if Y==2009, rowappend
asdoc ttest R0==0 if Y==2010, rowappend
asdoc ttest R1==0 if Y==2010, rowappend
asdoc ttest R2==0 if Y==2010, rowappend
asdoc ttest R3==0 if Y==2010, rowappend
asdoc ttest R0==0 if Y==2011, rowappend
asdoc ttest R1==0 if Y==2011, rowappend
asdoc ttest R2==0 if Y==2011, rowappend
asdoc ttest R3==0 if Y==2011, rowappend
asdoc ttest R0==0 if Y==2012, rowappend
asdoc ttest R1==0 if Y==2012, rowappend
asdoc ttest R2==0 if Y==2012, rowappend
asdoc ttest R3==0 if Y==2012, rowappend

And appreciate the results



1.In the first line of code, we wrote asdoc ttest in the beggining of the line. This is how we use asdoc with Stata commands. We just add asdoc to the beggining of any Stata command and that’s all.

2. We used two options of asdoc in the first line of code: the replace and title(). Replace will replace any existing file with the name Myfile.doc and title will add the specific test as a title to the output file.

3. In the second line of code, we added option rowappend() that will append the results to the existing table in the file Myfile.doc

4. And the process continues untill all ttests are estimated.


  • 19

asdoc: Cutomizing the regression output | MS Word from Stata | Confidence Interval, adding stars, etc.

Category:asdoc,Blog,Uncategorized Tags : 


Version 2.3 of asdoc adds the following features for reporting detailed regression tables.

1. Reporting confidence interval

2. Suppressing confidence intervals

3. Suppressing the stars which are used to show significance level

4. Customization of significance level for stars

These features are discussed in details below. If you have not already studied the features of asdoc, you can visit this page that lists the table of contents of what asdoc can do. You can also read this one paragraph introduction to asdoc. The following line of code will install this beta version of asdoc from our website

net install asdoc, from( replace
help asdoc


Details of the new features

The new features related to creating detailed regression tables with asdoc are discussed below with details. 


1. Confidence interval

I received several emails and comments on blog posts suggesting the addition of confidence intervals (CI) to the detailed regression tables created by asdoc. In version 2.3 onwards, confidence intervals are shown by default. This means that we do not have to add an additional option to report CI. See the following example. 

sysuse auto, clear
asdoc reg price mpg rep78 headroom trunk weight length turn , replace



2. Suppressing the confidence interval

If confidence intervals are not needed, we can use option noci. For example

asdoc reg price mpg rep78 headroom trunk weight length turn , replace noci


3. Suppressing stars

Similarly, if we are not interested in reporting significance stars, we can use option nostars. For example, 


4. Setting custom significance level

The default significance levels for reporting stars are set at : *** for p-values <=0.01; ** for p-values <=0 .05, and * for p-values <=0.1. However, now we can set our own levels for statistical significance using option setstars. An example of setstars option looks like:

setstars(***@.01, **@.05, *@.1)

As we can see from the above line, setstars separates each argument by a comma. Each argument has three components. The first component is the symbol (in our case it is *) which will be reported for the given significance elve. The second component is the @ sign that connects the significance level with the symbol. And the third component is the value at which the significance level is set. So if we want to report stars such that

* for p-value .001 or less
** for p-value .01 or less
*** for p-value .05 or less

We shall write the option setstars as

setstars(*@.001, **@.01, ***@.05)

Continuing with our example, let us use the above option to report our defined level of stars.

asdoc reg price mpg rep78 headroom trunk weight length turn , replace setstars(*@.001, **@.01, ***@.05)



  • 2

Merging datasets in Stata on long strings and less precise matching criterion


The Problem

Consider the following two datasets where company names are not exactly the same in both the datasets, still we want to merge them using the name variable as the merging criterion. How that can be done in Stata.

name in dataset 1 name dataset 2 symbol data set2
The Pakistan General Insurance Co. Ltd. The Pakistan General Insurance PKGI
Security Leasing Corporation Ltd. Security Leasing Corporation L SLCL
Awwal Modaraba Comay Awwal Modaraba AWWAL
B.F. Modaraba. B.F. Modaraba BFMOD

The above table shows that the company names not only differ in terms of different number of characters but also in terms of capitalization. However, there is a general patterns of similarity in the first few characters, starting from left to right. Given that, we can split the problem into two parts.


1.Extract similar characters

Extract the first few characters that are similar in both the dataset and merge the data using those similar characters. For example, in the second row of the above table, “The Pakistan General Insurance” part is similar in both the tables. If we count these characters, they are 30. So how exactly are we going to find the matching number of characters in each case? We shall not do that. Instead, we shall take the iterative path where:

1. we start with extracting the first n-number of characters from both the key variables in the two datasets and merge using the extracted (truncated) variables. If the merge succeeds, we shall save the merged data separately. Also, we shall delete, from the initial file, those records which were successfully merged, and further process those that did not merge. In this first step, we shall normally start with extracting a large number of characters, for example, up to 30 characters in the case of row 2 of the above table.

Please note that the relevant Stata function is substr() for extracting a given number of characters from a variable. We shall discuss it further as we proceed in this article.

2. In the next iteration, we shall further process those records which did not merge. This time we shall extract one character less than what we used in the preceeding step. The idea is that if two variables did not have the first 30 characters in common, they might have the first 29 characters in common. But this time, we need to be careful as we reduce the extraction of the initial number of characters, the chances of matching incorrect records increases. For example, consider the following two records where the first 12 characters are exactly the same in both the records i.e. “THE BANK OF “, therefore, if we merge the two data sets using only the first 12 characters, we shall incorrectly merge THE BANK OF KHYBER into THE BANK OF PUNJAB


So using the first 29 characters of the two variables, we shall proceed to merge. Again, we shall retain the successfully merged records, append it to the already saved successful merged data, and delete it from the initial dataset, just we did in the first step above.

3. The iterative process will continue in the same fashion, and each time we need to pay more attention to the merged data to identify any incorrect merges as explained in step 2.

4. We shall stop at a point where the data seems to have a reasonable number of similar observations based on the initial characters.


2. Use lower case

The second part of the problem is the dissimilarity in the capitalization of the names. This part of the problem is easy to handle as we can use the function lower() to convert all the names to lower cases and then use it in the merging process.


An Example

Let’s use a simple example to implement what we have read so far. In this example, we are going to use the same data as given in the above table. From that data, we shall create two datasets. One of the dataset will remain in the Stata memory, we shall call it data_memory. The other data set will be saved to a file, we shall call it data_file. The data_file has two variables, name and symbol. We shall merge the data_memory into data_file using variable name as the merging criterion. To create the two dataset, we can copy and paste the following code to Stata do editor and run it.

* Copy the following code and run from Stata do editor

* Create the data_memory
input str30 name str5 symbol
"The Pakistan General Insurance" "PKGI"
"Security Leasing Corporation L" "SLCL"
"Awwal Modaraba" "AWWAL"
"B.F. Modaraba" "BFMOD"
save data_memory, replace

* Now to create the data_file
input str39 name
"The Pakistan General Insurance Co. Ltd."
"Security Leasing Corporation Ltd."
"Awwal Modaraba Comay"
"B.F. Modaraba."

save data_file, replace

Now that we have created the two datasets, let’s start the process step by step. In the following code box, the code perform the first step as discussed above. I shall then explain what actually the code does.

use data_file, clear
gen name2=lower(substr(name,1,30))
clonevar name_old = name
save data_file, replace
use data_memory
gen name2=lower(substr(name,1,30))
merge 1:1 name2 using data_file
save temporary, replace
keep if _merge == 3
save matched, replace
list name name2 symbol
name name2 |
---------------------------------------------------------------------- |
NATIONAL BANK OF PAKISTAN national bank of pakistan |
Security Leasing Corporation Ltd. security leasing corporation |
THE BANK OF KHYBER the bank of khyber |
The Pakistan General Insurance Co. Ltd. the pakistan general insurance

use temporary, clear
keep if _merge == 2
keep name
save remaining, replace

| name |
| Awwal Modaraba |
| B.F. Modaraba |



In the above code block, we loaded the first dataset i.e. data_memory and created a truncated variable (name2) from the original variable using the following line of code. The first function in this line is lower() that converts all cases to lower cases. The second function is substr() that extracts bytes 1 to 30 from the string variable name. So this line will convert “Security Leasing Corporation Ltd.” to “security leasing corporation

gen name2=lower(substr(name,1,30))

After creating the name2 variable, we saved and replaced the data_file. We then loaded the other dataset, that is, data_memory and created a similar truncated variable as above, and then merged the two datasets using this truncated variable. The merged dataset has all useful information and we saved it in a temporary.dta file for further processing.

The first step in this processing is to isolate successful merges, that is flagged by the _merge == 3 code, therefore, we kept all such observations and saved them to matched.dta file. The list command shows that we have 4 successful merges in this go. We then reload the temporary.dta file to isolate those records that did not merge. Therefore, those records are saved to another file, which we named as remaining.dta.

In our next iteration, we shall take all of the above steps using the remaining.dta file, instead of data_file.dta file.

* Iteration 2 - using first 29 characters
use data_file, clear
drop name2
gen name2=lower(substr(name,1,29))
save data_file, replace
use remaining, clear
gen name2=lower(substr(name,1,29))
merge 1:1 name2 using data_file
save temporary, replace
keep if _merge == 3
save matched, replace
list name name2 symbol

 | name name2 symbol |
| Awwal Modaraba awwal modaraba AWWAL |
| B.F. Modaraba b.f. modaraba BFMOD |
| SILKBANK LIMITED silkbank limited SILK |
| THE BANK OF PUNJAB the bank of punjab BOP |

save remaining, replace

Is there a simple way to do this?


smerge Program

The above tutorial is best for a learning the merging process, however, it is too lengthy and is not optimal for quick applications. I have managed to get the code in a standard Stata ado program. I call this program as smerge, that is merge based on sub-string.



The program can be installed by typing the following line in Stata command window

net install smerge, from( replace

The syntax of the program is given below:

 smerge varname, use(file_name) [lower() higher()]

smerge is the program name, followed by the varname, that is the variable name which should exist in both the files, i.e. the file currently loaded file, i.e. the file in the memory of Stata and the file on disk. After the comma, the program options are listed. Option use(file_name) is a required option. In the brackets, we have to provide the file name with which we want to merge the data. It can be a full directory path with the file name or simply the file name if the file is in the current directory.

options enclosed in brackets [] show optional options in Stata. There are two optional options in smerge program. If these options are not specified, then the default values are used, which are 30 for the higher and 10 for the lower. The higher() option sets the initial value of the number of string bites to be extracted from the varname to make a truncated variable that will be used for the first iteration in the merge process. The initial value is decremented by one in the next iteration until it reaches the limit set by the option lower().


An Example using smerge

Let use the same dataset which we created above. We shall first load the file data_memory and merge it with the file data_file.dta. Both the datasets have the variable name.

use data_memory.dta, clear
smerge name, use(data_file) lower(10) higher(30)

. list
| name symbol nstr |
| The Pakistan General Insurance Co. Ltd. PKGI 30 |
| Security Leasing Corporation Ltd. SLCL 30 |
| Awwal Modaraba Comay AWWAL 14 |
| B.F. Modaraba. BFMOD 13 |


Read the second line in the above code block. We typed smerge and then the variable name. The name variable is the key variable that we are using as the merging criterion. Option use(data_file) tells smerge program to merge the current data in memeory with the file data_file. And option higher(30) and lower(10) sets the limits for sub-string extraction for creating the truncated variables that will be used for merging. And that’s it. We just need one line of code to do all the steps that we went through in the earlier tutorial.

The list command shows three variables, one of them is nstr. This variable records the number of strings from left to right of the varname. This number is recorded when the merge is successful. So the list command shows that the four records were merged in the first iteration i.e when we extracted the first 30 characters. The 5th record was merged when we extracted 18 characters i.e. The full entry is THE BANK OF PUNJAB Ltd, and the first 18 characters are THE BANK OF PUNJAB