Author Archives: Attaullah Shah

  • 0

Using the bysort prefix with tab commands in asdoc | Stata


Recently, Scott Siegal asked for the possibility of adding the bysort prefix with tabulate, tab, tab1, and tab2 commands to asdoc. Honoring his request, I have added the bysort support to asdoc.

The new version of asdoc can be installed from my site. Copy and paste the following line in Stata and press enter.

net install asdoc, from( replace

Please note that the above line has to be copied in full. After installation of the new version, then restart Stata.

Once installed, we can use asdoc by adding the word asdoc to any Stata Command. Let us start with the basic tabulation commands

asdoc with the tab command

The tab command of Stata is used to create a table of frequencies. When adding one variable with tab command, it creates a one-way tabulation. To export the results to MS Word, just add asdoc to the beginning of the command. Let us use the nslw88.dta dataset from the Stata installation.

sysuse nlsw88, clear

*Tabule the race variable
asdoc tab race

The above code generates the following table.

Tab with bysort variable

The bysort prefix can be used to repeat the one-way tabulation across groups of another variable. So if you care about repeating the tabulation of race variable across the categories of the married variable, we shall write the code as:

bys married: asdoc tab race , replace

Two-way tables

Two-way frequency tables are created by adding two variables after the tab word. Let’s create a two-way table of race and smsa

Using bysort with two-way tables

The bysort prefix can be used to repeat the two-way tabulations across groups of another variable. So if we care about repeating the tabulation of race and smsa variables across the categories of the married variable, we shall write the code as:

bys married: asdoc tab race smsa , replace

  • 1

What is a Python Dictionary

Category:Blog Tags : 
Python Dictionary

Python Dictionary

Dictionary is a method in which data is stored in pairs of keys and values. These are also called Associative Arrays in other programming languages.

What is key-value pair?

key is a unique identifier for a given record. Values are data stored in that identifier. For example, Let us say that Muneer is a student, and we want to create a dictionary containing his details. The first key in his record is name and the value for this key is ‘Muneer’. He has a weight of 75 Kg, therefore, the second key in this record is weight, and the value of this key is 75. His height is 6ft, and has age of 35 years. In this record, following are the key value pairs

  |   keys   values |
  |   name   Muneer |
  | weight       75 |
  | height        6 |
  |    age       35 |

How to create a dictionary

A dictionary is created using curly brackets. The first item is always the key followed by a full colon, the second item is the value. Next key-value pair is created using a comma.

In [2]:
student = {'name': 'Muneer', 'weight': 75, 'height': 6, 'age': 35}
In [3]:
{'name': 'Muneer', 'weight': 75, 'height': 6, 'age': 35}
In [13]:
In [14]:
In [ ]:

  • 0

Getting Started with Data Visualization in Python Pandas

Category:Blog Tags : 


To download the datasets used in this tutorial, pleas see the following links
1. gapminder.tsv
2. pew.csv
3. billboard.csv
4. ebola.csv
5. tips.csv

TED Talk Dataset Excercises

In [5]:
# Change directory
In [6]:
cd "D:\Dropbox\CLASSES\Data Science for Finance\Python\Lecture 1 - Assignment"
D:\Dropbox\CLASSES\Data Science for Finance\Python\Lecture 1 - Assignment
In [7]:
import pandas as pd
In [8]:
ted = pd.read_csv('ted.csv')

1: Explore the Data attributes

In [11]:
comments               int64
description           object
duration               int64
event                 object
film_date              int64
languages              int64
main_speaker          object
name                  object
num_speaker            int64
published_date         int64
ratings               object
related_talks         object
speaker_occupation    object
tags                  object
title                 object
url                   object
views                  int64
dtype: object
In [12]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2550 entries, 0 to 2549
Data columns (total 17 columns):
comments              2550 non-null int64
description           2550 non-null object
duration              2550 non-null int64
event                 2550 non-null object
film_date             2550 non-null int64
languages             2550 non-null int64
main_speaker          2550 non-null object
name                  2550 non-null object
num_speaker           2550 non-null int64
published_date        2550 non-null int64
ratings               2550 non-null object
related_talks         2550 non-null object
speaker_occupation    2544 non-null object
tags                  2550 non-null object
title                 2550 non-null object
url                   2550 non-null object
views                 2550 non-null int64
dtypes: int64(7), object(10)
memory usage: 338.8+ KB
In [13]:
(2550, 17)

2. Which talk has the highest comments

In [77]:
ted.sort_values('comments')[['comments', 'duration','main_speaker']].tail()
comments duration main_speaker
1787 2673 1117 David Chalmers
201 2877 1099 Jill Bolte Taylor
644 3356 1386 Sam Harris
0 4553 1164 Ken Robinson
96 6404 1750 Richard Dawkins

3 Find top 5 talks that have the highest views to comments ratio

In [16]:
ted['view_to_comment'] = ted['views'] / ted['comments']
In [17]:
2545    26495.882353
2546    69578.333333
2547    37564.700000
2548    13103.406250
2549    48965.125000
Name: view_to_comment, dtype: float64

4 . Create a histogram of comments

In [19]:
import matplotlib.pyplot as plot
ted['comments'].plot(kind = 'hist')
<matplotlib.axes._subplots.AxesSubplot at 0x14a233e3978>

5. Create histogram of comments where comments are less than 1000

In [35]:
# Get index of those row which have less than 1000 comments 
index = ted['comments']<1000
In [38]:
# Get only the comments column from these filtered rows
com1000 = ted[index]['comments']
In [39]:
# Make a plot of these filtered comments
com1000.plot(kind = 'hist')
<matplotlib.axes._subplots.AxesSubplot at 0x14a236dc7b8>
In [40]:
# When you expert, you can do the above just in one line
<matplotlib.axes._subplots.AxesSubplot at 0x14a2375cac8>
In [44]:
# How many rows were excluded from the above graph
ted[ted['comments'] >=1000].shape
(32, 18)

6. Do the same as in 5, but using a query method

In [68]:
# Filter the whole dataset where comments are less than 1000
ted1000 = ted.query('comments <1000')
In [69]:
# Get only the comments column from the reduced dataset
comment1000 = ted1000['comments']
In [70]:
# Plot the filtered comments
comment1000.plot(kind = 'hist')
<matplotlib.axes._subplots.AxesSubplot at 0x14a238fb630>

7. How to add more bins to the histogram

In [71]:
comment1000.plot(kind = 'hist', bins = 20)
<matplotlib.axes._subplots.AxesSubplot at 0x14a23953278>

8. Make a box plot and identify outliers

In [73]:
comment1000.plot(kind = 'box')
<matplotlib.axes._subplots.AxesSubplot at 0x14a23a4ba20>

The black dots show outliers

In [ ]:

  • 0

asrol Cumulative Returns Over 12-months, skipping the Current month


Anomaly Hunters has asked: How Can I find cumulative returns from month T-1 to T-12, i.e. not including the current month in the calculation. I want to do it on a rolling window for each firm.

Let us use a small dataset. We shall create this dataset in Stata with the following code

 input byte firm_id str6 date float ret
 1 "200310"         .
 1 "200311"  -.049505
 1 "200312"  .0121528
 1 "200401"  .0360206
 1 "200402" -.0480132
 1 "200403"  .0434783
 1 "200404"  .0933333
 1 "200405"  .0121951
 1 "200406"  .0060241
 1 "200407"  .0883234
 1 "200408" -.0261348
 1 "200409" -.0819209
 1 "200410"      -.02
 1 "200411" -.0062794
 1 "200412"  .0458136
 1 "200501"  .0891239
 1 "200502"  .0443828
 1 "200503"

* Convert the date to Stata format

tostring date, replace
gen month = mofd(date(date, "YM"))
format month %tm

* Find cumulative returns in a 12-month rolling window

bys firm_id : asrol ret, stat(product) add(1) window(month 12)

* Lag the returns by one month to skip the current month returns from the calculation

tsset firm_id month
gen ret_skip1 = L.product12_ret
list month ret product12_ret ret_skip1

*And appreciate the results


     |   month        ret   product1~t   ret_skip1 |
  1. | 2003m10          .            .           . |
  2. | 2003m11   -.049505     -.049505           . |
  3. | 2003m12    .012153   -.03795382    -.049505 |
  4. |  2004m1    .036021   -.00330034   -.0379538 |
  5. |  2004m2   -.048013   -.05115508   -.0033003 |
  6. |  2004m3    .043478   -.00990092   -.0511551 |
  7. |  2004m4    .093333    .08250829   -.0099009 |
  8. |  2004m5    .012195    .09570959    .0825083 |
  9. |  2004m6    .006024    .10231025    .0957096 |
 10. |  2004m7    .088323    .19967004    .1023103 |
 11. |  2004m8   -.026135     .1683169      .19967 |
 12. |  2004m9   -.081921    .07260733    .1683169 |
 13. | 2004m10       -.02    .05115519    .0726073 |
 14. | 2004m11   -.006279     .0989585    .0511552 |
 15. | 2004m12    .045814    .13550617    .0989585 |
 16. |  2005m1    .089124     .1937088    .1355062 |
 17. |  2005m2    .044383    .30956536    .1937088 |
 18. |  2005m3   -.005312     .2483334    .3095654 |

  • 0

Step-by-Step: Portfolio Risk in Stata and Excel


Portfolio Risk in Excel

To build our concept of the portfolio risk, we shall calculate it first manually in EXCEL, then we shall replicate the results using matrix notations in Stata. 
Consider the following set of returns for two assets, i.e asset A and B.    

  A        B   
.249917 .819483
.739069 .821416
.895491 .276843
.902722 .001586
.078344 .714815
.429804 .027261
.239556 .736011

As we know, the portfolio standard deviation under the modern portfolio theory is calculated as = 

Thus, we need the standard deviation of the two assets, the proportion of investment in each asset (weights), and the covariance term between the two assets. In Excel, we can find the standard deviation by


And the formula for covariance is

=COVARIANCE.S(range1, range2)

Thus, standard deviation of asset A = 0.3387         

the standard deviation of asset B = 0.3713         

Covariance between the two = -0.0683

And if we invest equally in both assets, then

the weight of A = 0.5

and weight of B = 0.5

Portfolio SD =( (0.5^2*0.3387^2)+(0.5^2*0.3713^2)+(2*0.5*0.5*-0.0683))^0.5

Excel sheet showing the above example can be downloaded from here.

Portfolio Risk in Stata

Finding portfolio standard deviation under the Modern Portfolio theory using matrix algebra requires three matrices
1. Weights of the assets in the portfolio, in row format = W
2. Variance-Covariance matrix of assets returns = S
3. Weights of the assets in the portfolio, in column format = W’

Portfolio SD = W * S * W'

NOTE: In order to find the variance-covariance matrix, you can install varrets program from ssc with:

 ssc install mvport 

Step 1 : Copy the example data to stata

You can do it either by copying the data from the excel file and pasting it to the stata editor. Alternatively, you can copy the following and past it in the stata command line, or download the data file from here.

 input a b
.249917 .819483
.739069 .821416
.895491 .276843
.902722 .001586
.078344 .714815
.429804 .027261
.239556 .736011

Step 2: Make variance-covariance matrix

varrets a bmat S = r(cov) 

Step 3: Make a weight matrix

Assuming that we assign equal weights, we define matrix W

mat W = (0.5, 0.5) 

Step 4 : Multiply the weight and variance-covariance matrix

mat VAR = W * S * W' 

Step 5: Show variance of the portfolio

mat list VAR 

Complete .do file of the example can be downloaded from here.

  • 1

How Fama and French June to July Portfolios are Constructed?

Category:Asset Pricing Research,Blog Tags : 

The description of portfolios’ construction given in various Fama and Fench papers is usually confusing for many researchers, especially those who are new to asset pricing models. The typical language used in Fama and French papers reads like this

The size breakpoint for year t is the median NYSE market equity at the end of June of year t. BE/ME for June of year t is the book equity for the last fiscal year end in t-1 divided by ME for December of t-1.

This blog post aims at explaining the above paragraph with some examples.

Break-points for Portfolio Construction

The size-breakpoints

As mentioned in the above paragraph, the size-breakpoints are based on the market capitalization of firms at the end of June of the current year. This means while making two groups of firms:

  1. First, we need to reduce the data to keep the market capitalization of each firm at the end of June.
  2. Also, we need to further reduce the data to keep only firms listed at the NYSE stock exchange.

The BE/ME-breakpoints

The BE/ME variable uses lagged values of the book equity and market equity. However, the way the lagged values are obtained for both the variables differs from one another. The book equity is the last fiscal year’s available book equity. Since the assumption is that the financial year ends in June, therefore, the last June’s book equity is called book equity for the last fiscal year end in t-1

Consider the following monthly data where we have observations for a single firm over three years period. The variable year represents the calendar year that starts in January and ends in December. The variable fyear represents the fiscal year, that starts in July and ends in June.

From these observations, we need ME in December of yeat t-1. In our dataset, the first December appears in the calendar year 2016. The ME on that date is 958. For the calendar year 2006, the corresponding BE value for the fiscal year is 467, that is the book equity for the last fiscal year end in t-1
We are able to calculate the BE/ME ratio in June 2017 as = 467 / 958. This value will be used for finding the breakpoints and making the three BE/ME portfolios, which are then held from July of year t to June of year t+1, as shown in the following snapshot.

The Yearly Portfolios

The portfolios for July of year t to June of t+1 include all NYSE, AMEX, and NASDAQ stocks for which we have market equity data for December of t-1 and June of t, and (positive) book equity data for t-1.


Once the breakpoints for size and

How to Do it Programatically?

There are more than a dozen steps to fully implement the Fama and French model. Entry-level researchers might try to do all these steps in MS Excel. However, doing these steps in Excel is not only cumbersome but also prone to errors. Further, the process is manual, therefore, it cannot be easily replicated.

We have developed codes in Stata to construct the three factors of the Fama and French model as well as the 25 RHS (right-hand side) portfolios. Our codes generate factors that have over 97% correlation with the Fama and French factors.

Why buy codes for Fama and French Model?

There are several reasons that you should consider using the codes of a professional. These reasons include but are not limited to the accuracy of the code, quick learning, replicability of the codes in the same project or other projects, and validation of your own code if you have written a code yourself.

Pricing Options


  • Source Code
  • Comments
  • Email Support


Most Popular

  • Source Code
  • Data Handling
  • Comments
  • Email Support


  • Source Code
  • Example Dataset
  • Comments
  • Email Support

  • 7

fillmissing: Fill Missing Values in Stata

Category:Blog Tags : 

This post presents a quick tutorial on how to fill missing values in variables in Stata. This tutorial uses fillmissing program which can be downloaded by typing the following command in Stata command window

net install fillmissing, from( replace


Important Note: This post does not imply that filling missing values is justified by theory. Users should make their own decisions and follow appropriate theory while filling missing values.


After the installation of the fillmissing program, we can use it to fill missing values in numeric as well as string variables. Also, this program allows the bysort prefix to fill missing values by groups. We shall see several examples of using bysort prefix to perform by-groups calculations. But let us first quickly go through the different options of the program.


Program Options

The fillmissing program offers the following options to fill missing values

  1. with(any)
  2. with(previous)
  3. with(next)
  4. with(first)
  5. with(last)
  6. with(mean)
  7. with(max)
  8. with(min)
  9. with(median)

Let us quickly go through these options. Please note that options starting from serial number 6 are applicable only in the case of numerical variables.


1. with(any)

Option with() is used to specify the source from where the missing values will be filled. Option with(any) is an optional option and hence if not specified, will automatically be invoked by the fillmissing program. This option is best to fill missing values of a constant variable, i.e. a variable that has all similar values, however, due to some reason, some of the values are missing. Option with(any) will try to fill the missing values from any available non-missing values of the given variable.

Example 1: Fill missing values with(any)

Let us first create a sample dataset of one variable having 10 observations. You can copy-paste the following code to Stata Do editor to generate the dataset

clear all
set obs 10
gen symbol = "AABS"
replace symbol = "" in 5
replace symbol = "" in 8

The above dataset has missing values on row 5 and 8. To fill the missing values from any other available non-missing values, let us use the with(any) option.

fillmissing symbol, with(any)

Since with(any) is the default option of the program, we could also write the above code as

fillmissing symbol


2. with(previous)

Option with(previous) is used to fill the current missing value with the preceding or previous value of the same variable. Please note that if the previous value is also missing, the current value will remain missing. Further, this option does not sort the data, so whatever the current sort of the data is, fillmissing will use that sort and identify the current and previous observation.

Example 2: Fill missing values with(previous)

Let’s create a dummy dataset first.

clear all
set obs 10
gen symbol = "AABS" 
replace symbol = "AKBL" in 1
replace symbol = "" in 2 

The dataset looks like this

 | symbol |
 |   AKBL |
 |        |
 |   AABS |
 |   AABS |
 |   AABS |
 |   AABS |
 |   AABS |
 |   AABS |
 |   AABS |
 |   AABS |

To fill the missing value in observation number 2 with AKBL, i.e. from previous observation, we would type:

fillmissing symbol, with(previous)


What’s Next

In the next blog post, I shall talk about other options of the fillmissing program. Specifically, I shall discuss the use of by and bys with fillmissing program. Therefore, you may visit the blog section of this site or subscribe to updates from this site.


  • 2

Export output of Table command from Stata to Word using asdoc


Exporting tables from table command was the most challenging part in asdoc programming. Nevertheless, asdoc does a pretty good job in exporting table from table command. asdoc accepts almost all options with table command, except cellwidth(#), stubwidth(#), and csepwidth(#).


7.1 One-way table

Example 54 : One-way table; frequencies shown by default

sysuse auto, clear
asdoc table rep78, title(Table of Freq. for Repairs) replace


Example 55 : One-way table; show count of non-missing observations for mpg}

asdoc table rep78, contents(n mpg) replace

Example 56 : One-way table; multiple statistics on mpg requested

asdoc table rep78, c(n mpg mean mpg sd mpg median mpg) replace


Example 57 : Add formatting – 2 decimals

asdoc table rep78, c(n mpg mean mpg sd mpg median mpg) dec(2) replace


7.2 Two-way table

Example 58 : Two-way table; frequencies shown by default

asdoc table rep78 foreign, replace


Example 59 : Two-way table; show means of mpg for each cell

asdoc table rep78 foreign, c(mean mpg) replace


Example 60 : Add formatting

asdoc table rep78 foreign, c(mean mpg) dec(2) center replace


Example 61 : Add row and column totals

asdoc table rep78 foreign, c(mean mpg) dec(2) center row col replace


7.3 Three-way table

Example 62 : Three-way table

webuse byssin, clear
asdoc table workplace smokes race [fw=pop], c(mean prob) replace

7.4 Four-way table

Example 65 : Four-way table with by()

webuse byssin1, clear
asdoc table workplace smokes race [fw=pop], by(sex) c(mean prob) replace


Example 66 : Four-way table with supercolumn, row, and column totals

asdoc table workplace smokes race [fw=pop], by(sex) c(mean prob) sc col row replace

  • 0

Customized tables using option row() of asdoc – Stata

Category:asdoc,Stata Programs

This is rather a quick example of how to use option row() of asdoc for creating highly customized tables. We are interested in a table that is given bellow.

* Load example dataset
sysuse auto,clear

*Write the header row of the table with table title
asdoc, row(Dependent variable:domestic or foreign, Domestic mean/frequency, Domestic SD, Foreign mean/frequency, Foreign SD, t-test) title(Summary staticis) save(myfile) replace

*Add the second row : \i, adds an empty cell
asdoc, row( Model independent variables, \i, \i, \i, \i, \i) append

*Use a loop over each variable that include price, mpg, ...
foreach var of varlist price mpg rep78 headroom trunk weight length turn{
  *First summarize each variable for a given sample, that is if foregin is   zero
  qui sum `var' if foreign==0

  *Obtain the mean divided by frequency
  local mf=`r(mean)'/`r(N)'

   *Store the mf and standard deviation variable in accum macro
  asdoc, accum(`mf', `r(sd)')

*now repeat the same for the second sample, ie. when foreign is 1
  qui sum `var' if foreign==1
  local mf=`r(mean)'/`r(N)'
  asdoc, accum(`mf', `r(sd)')

*Conduct a two-sample ttest using foreign as a grouping variable
  ttest `var', by(foreign)

*Obtain the t-statistics
  local t : di %9.3f = abs(`r(t)')

*Create significance stars
  if `r(p)'<=0.01 {
    local star "***" 
  else if `r(p)'<=0.05{ 
    local star "**" 
  else if `r(p)'<=0.1{
   local star "*"
  else {
   local star " " 
  local tstar `t'`star' 

*Add the t-value and stars to the accum macro
  asdoc, accum(`tstar') 

* Finally write this complete row where we first write the variable name
*and then all accumulated variables that are present in $accum macro.
asdoc, row(`var', $accum) 

  • 8

tabstat with asdoc in Stata


asdoc makes some elegant tables when used with tabstat command. There are several custom-made routines in asdoc that creates clean tables from tabstat command. asdoc fully supports the command structure and options of tabstat. And, yes asdoc allows one additional statistics, that is, t-statistics alongside the allowed statistics in tabstat. For reporting purposes, asdoc categorizes tabstat commands in two groups:

(1) stats without a grouping variable

(2) stats over a grouping variable.


Tabstat Without-by

If statistics are less than variables, the table is transposed, i.e. statistics are shown in columns, while variables are shown in rows


Example 49 : One variable, many stats, including t-statistics

sysuse auto, clear  
asdoc tabstat price , stat(min max mean sd median p1 p99 tstat) replace 


Example 50 : Many variables, one statistic

asdoc tabstat price mpg rep78 headroom trunk weight length foreign , stat( mean) replace


Example 51 : Many variables, many statistics

asdoc tabstat price mpg rep78 headroom trunk weight length foreign , /// 
stat( max mean sd median p1 p99 tstat) replace


Tabstat with-by


Example 52 :

bysort foreign: asdoc tabstat price mpg rep78 headroom trunk weight length, stat(mean) replace


asdoc tabstat price mpg rep78 headroom trunk weight length, ///
stat(mean) by(foreign) replace


Example 53 : By with many variables and many statistics

bysort foreign: asdoc tabstat price mpg rep78 headroom trunk weight length, ///
stat(mean sd p1 p99 tstat) replace