Category Archives: Blog

  • 1

What is a Python Dictionary

Category:Blog Tags : 
Python Dictionary

Python Dictionary

Dictionary is a method in which data is stored in pairs of keys and values. These are also called Associative Arrays in other programming languages.

What is key-value pair?

key is a unique identifier for a given record. Values are data stored in that identifier. For example, Let us say that Muneer is a student, and we want to create a dictionary containing his details. The first key in his record is name and the value for this key is ‘Muneer’. He has a weight of 75 Kg, therefore, the second key in this record is weight, and the value of this key is 75. His height is 6ft, and has age of 35 years. In this record, following are the key value pairs

  |   keys   values |
  |   name   Muneer |
  | weight       75 |
  | height        6 |
  |    age       35 |

How to create a dictionary

A dictionary is created using curly brackets. The first item is always the key followed by a full colon, the second item is the value. Next key-value pair is created using a comma.

In [2]:
student = {'name': 'Muneer', 'weight': 75, 'height': 6, 'age': 35}
In [3]:
{'name': 'Muneer', 'weight': 75, 'height': 6, 'age': 35}
In [13]:
In [14]:
In [ ]:

  • 0

Getting Started with Data Visualization in Python Pandas

Category:Blog Tags : 


To download the datasets used in this tutorial, pleas see the following links
1. gapminder.tsv
2. pew.csv
3. billboard.csv
4. ebola.csv
5. tips.csv

TED Talk Dataset Excercises

In [5]:
# Change directory
In [6]:
cd "D:\Dropbox\CLASSES\Data Science for Finance\Python\Lecture 1 - Assignment"
D:\Dropbox\CLASSES\Data Science for Finance\Python\Lecture 1 - Assignment
In [7]:
import pandas as pd
In [8]:
ted = pd.read_csv('ted.csv')

1: Explore the Data attributes

In [11]:
comments               int64
description           object
duration               int64
event                 object
film_date              int64
languages              int64
main_speaker          object
name                  object
num_speaker            int64
published_date         int64
ratings               object
related_talks         object
speaker_occupation    object
tags                  object
title                 object
url                   object
views                  int64
dtype: object
In [12]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2550 entries, 0 to 2549
Data columns (total 17 columns):
comments              2550 non-null int64
description           2550 non-null object
duration              2550 non-null int64
event                 2550 non-null object
film_date             2550 non-null int64
languages             2550 non-null int64
main_speaker          2550 non-null object
name                  2550 non-null object
num_speaker           2550 non-null int64
published_date        2550 non-null int64
ratings               2550 non-null object
related_talks         2550 non-null object
speaker_occupation    2544 non-null object
tags                  2550 non-null object
title                 2550 non-null object
url                   2550 non-null object
views                 2550 non-null int64
dtypes: int64(7), object(10)
memory usage: 338.8+ KB
In [13]:
(2550, 17)

2. Which talk has the highest comments

In [77]:
ted.sort_values('comments')[['comments', 'duration','main_speaker']].tail()
comments duration main_speaker
1787 2673 1117 David Chalmers
201 2877 1099 Jill Bolte Taylor
644 3356 1386 Sam Harris
0 4553 1164 Ken Robinson
96 6404 1750 Richard Dawkins

3 Find top 5 talks that have the highest views to comments ratio

In [16]:
ted['view_to_comment'] = ted['views'] / ted['comments']
In [17]:
2545    26495.882353
2546    69578.333333
2547    37564.700000
2548    13103.406250
2549    48965.125000
Name: view_to_comment, dtype: float64

4 . Create a histogram of comments

In [19]:
import matplotlib.pyplot as plot
ted['comments'].plot(kind = 'hist')
<matplotlib.axes._subplots.AxesSubplot at 0x14a233e3978>

5. Create histogram of comments where comments are less than 1000

In [35]:
# Get index of those row which have less than 1000 comments 
index = ted['comments']<1000
In [38]:
# Get only the comments column from these filtered rows
com1000 = ted[index]['comments']
In [39]:
# Make a plot of these filtered comments
com1000.plot(kind = 'hist')
<matplotlib.axes._subplots.AxesSubplot at 0x14a236dc7b8>
In [40]:
# When you expert, you can do the above just in one line
<matplotlib.axes._subplots.AxesSubplot at 0x14a2375cac8>
In [44]:
# How many rows were excluded from the above graph
ted[ted['comments'] >=1000].shape
(32, 18)

6. Do the same as in 5, but using a query method

In [68]:
# Filter the whole dataset where comments are less than 1000
ted1000 = ted.query('comments <1000')
In [69]:
# Get only the comments column from the reduced dataset
comment1000 = ted1000['comments']
In [70]:
# Plot the filtered comments
comment1000.plot(kind = 'hist')
<matplotlib.axes._subplots.AxesSubplot at 0x14a238fb630>

7. How to add more bins to the histogram

In [71]:
comment1000.plot(kind = 'hist', bins = 20)
<matplotlib.axes._subplots.AxesSubplot at 0x14a23953278>

8. Make a box plot and identify outliers

In [73]:
comment1000.plot(kind = 'box')
<matplotlib.axes._subplots.AxesSubplot at 0x14a23a4ba20>

The black dots show outliers

In [ ]:

  • 1

How Fama and French June to July Portfolios are Constructed?

Category:Asset Pricing Research,Blog Tags : 

The description of portfolios’ construction given in various Fama and Fench papers is usually confusing for many researchers, especially those who are new to asset pricing models. The typical language used in Fama and French papers reads like this

The size breakpoint for year t is the median NYSE market equity at the end of June of year t. BE/ME for June of year t is the book equity for the last fiscal year end in t-1 divided by ME for December of t-1.

This blog post aims at explaining the above paragraph with some examples.

Break-points for Portfolio Construction

The size-breakpoints

As mentioned in the above paragraph, the size-breakpoints are based on the market capitalization of firms at the end of June of the current year. This means while making two groups of firms:

  1. First, we need to reduce the data to keep the market capitalization of each firm at the end of June.
  2. Also, we need to further reduce the data to keep only firms listed at the NYSE stock exchange.

The BE/ME-breakpoints

The BE/ME variable uses lagged values of the book equity and market equity. However, the way the lagged values are obtained for both the variables differs from one another. The book equity is the last fiscal year’s available book equity. Since the assumption is that the financial year ends in June, therefore, the last June’s book equity is called book equity for the last fiscal year end in t-1

Consider the following monthly data where we have observations for a single firm over three years period. The variable year represents the calendar year that starts in January and ends in December. The variable fyear represents the fiscal year, that starts in July and ends in June.

From these observations, we need ME in December of yeat t-1. In our dataset, the first December appears in the calendar year 2016. The ME on that date is 958. For the calendar year 2006, the corresponding BE value for the fiscal year is 467, that is the book equity for the last fiscal year end in t-1
We are able to calculate the BE/ME ratio in June 2017 as = 467 / 958. This value will be used for finding the breakpoints and making the three BE/ME portfolios, which are then held from July of year t to June of year t+1, as shown in the following snapshot.

The Yearly Portfolios

The portfolios for July of year t to June of t+1 include all NYSE, AMEX, and NASDAQ stocks for which we have market equity data for December of t-1 and June of t, and (positive) book equity data for t-1.


Once the breakpoints for size and

How to Do it Programatically?

There are more than a dozen steps to fully implement the Fama and French model. Entry-level researchers might try to do all these steps in MS Excel. However, doing these steps in Excel is not only cumbersome but also prone to errors. Further, the process is manual, therefore, it cannot be easily replicated.

We have developed codes in Stata to construct the three factors of the Fama and French model as well as the 25 RHS (right-hand side) portfolios. Our codes generate factors that have over 97% correlation with the Fama and French factors.

Why buy codes for Fama and French Model?

There are several reasons that you should consider using the codes of a professional. These reasons include but are not limited to the accuracy of the code, quick learning, replicability of the codes in the same project or other projects, and validation of your own code if you have written a code yourself.

Pricing Options


  • Source Code
  • Comments
  • Email Support


Most Popular

  • Source Code
  • Data Handling
  • Comments
  • Email Support


  • Source Code
  • Example Dataset
  • Comments
  • Email Support

  • 9

fillmissing: Fill Missing Values in Stata

Category:Blog Tags : 

This post presents a quick tutorial on how to fill missing values in variables in Stata. This tutorial uses fillmissing program which can be downloaded by typing the following command in Stata command window

net install fillmissing, from( replace


Important Note: This post does not imply that filling missing values is justified by theory. Users should make their own decisions and follow appropriate theory while filling missing values.


After the installation of the fillmissing program, we can use it to fill missing values in numeric as well as string variables. Also, this program allows the bysort prefix to fill missing values by groups. We shall see several examples of using bysort prefix to perform by-groups calculations. But let us first quickly go through the different options of the program.


Program Options

The fillmissing program offers the following options to fill missing values

  1. with(any)
  2. with(previous)
  3. with(next)
  4. with(first)
  5. with(last)
  6. with(mean)
  7. with(max)
  8. with(min)
  9. with(median)

Let us quickly go through these options. Please note that options starting from serial number 6 are applicable only in the case of numerical variables.


1. with(any)

Option with() is used to specify the source from where the missing values will be filled. Option with(any) is an optional option and hence if not specified, will automatically be invoked by the fillmissing program. This option is best to fill missing values of a constant variable, i.e. a variable that has all similar values, however, due to some reason, some of the values are missing. Option with(any) will try to fill the missing values from any available non-missing values of the given variable.

Example 1: Fill missing values with(any)

Let us first create a sample dataset of one variable having 10 observations. You can copy-paste the following code to Stata Do editor to generate the dataset

clear all
set obs 10
gen symbol = "AABS"
replace symbol = "" in 5
replace symbol = "" in 8

The above dataset has missing values on row 5 and 8. To fill the missing values from any other available non-missing values, let us use the with(any) option.

fillmissing symbol, with(any)

Since with(any) is the default option of the program, we could also write the above code as

fillmissing symbol


2. with(previous)

Option with(previous) is used to fill the current missing value with the preceding or previous value of the same variable. Please note that if the previous value is also missing, the current value will remain missing. Further, this option does not sort the data, so whatever the current sort of the data is, fillmissing will use that sort and identify the current and previous observation.

Example 2: Fill missing values with(previous)

Let’s create a dummy dataset first.

clear all
set obs 10
gen symbol = "AABS" 
replace symbol = "AKBL" in 1
replace symbol = "" in 2 

The dataset looks like this

 | symbol |
 |   AKBL |
 |        |
 |   AABS |
 |   AABS |
 |   AABS |
 |   AABS |
 |   AABS |
 |   AABS |
 |   AABS |
 |   AABS |

To fill the missing value in observation number 2 with AKBL, i.e. from previous observation, we would type:

fillmissing symbol, with(previous)


What’s Next

In the next blog post, I shall talk about other options of the fillmissing program. Specifically, I shall discuss the use of by and bys with fillmissing program. Therefore, you may visit the blog section of this site or subscribe to updates from this site.


  • 2

Export output of Table command from Stata to Word using asdoc


Exporting tables from table command was the most challenging part in asdoc programming. Nevertheless, asdoc does a pretty good job in exporting table from table command. asdoc accepts almost all options with table command, except cellwidth(#), stubwidth(#), and csepwidth(#).


7.1 One-way table

Example 54 : One-way table; frequencies shown by default

sysuse auto, clear
asdoc table rep78, title(Table of Freq. for Repairs) replace


Example 55 : One-way table; show count of non-missing observations for mpg}

asdoc table rep78, contents(n mpg) replace

Example 56 : One-way table; multiple statistics on mpg requested

asdoc table rep78, c(n mpg mean mpg sd mpg median mpg) replace


Example 57 : Add formatting – 2 decimals

asdoc table rep78, c(n mpg mean mpg sd mpg median mpg) dec(2) replace


7.2 Two-way table

Example 58 : Two-way table; frequencies shown by default

asdoc table rep78 foreign, replace


Example 59 : Two-way table; show means of mpg for each cell

asdoc table rep78 foreign, c(mean mpg) replace


Example 60 : Add formatting

asdoc table rep78 foreign, c(mean mpg) dec(2) center replace


Example 61 : Add row and column totals

asdoc table rep78 foreign, c(mean mpg) dec(2) center row col replace


7.3 Three-way table

Example 62 : Three-way table

webuse byssin, clear
asdoc table workplace smokes race [fw=pop], c(mean prob) replace

7.4 Four-way table

Example 65 : Four-way table with by()

webuse byssin1, clear
asdoc table workplace smokes race [fw=pop], by(sex) c(mean prob) replace


Example 66 : Four-way table with supercolumn, row, and column totals

asdoc table workplace smokes race [fw=pop], by(sex) c(mean prob) sc col row replace

  • 7

Quick setup of Python with Stata 16

Category:Blog,Stata Programs Tags : 

With the announcement of Stata 16, Python commands can be executed directly from the Stata command prompt, do files or ado programs. That would definitely expand the possibilities of doing extraordinary things without leaving the Stata environment. However, this integration exposes Stata to all the problems of Python installations and its packages.

First of all, Python does not come as part of the Stata installation. Stata depends on the already installed version of Python. That would definitely make a Stata-Python code less portable. One solution might be the portable version of Python. Only time can tell what will work best in such situations.

In this short post, I am going to outline a few basic steps to get started with Python from Stata. These steps are mentioned below:

1.What Version of Python to Install

A number of options are available to install Python. Over the past 12 months, I found that the installation of Python using Anaconda is the least problematic one. And with Stata 16, this again came out true. The stand-alone version of Python did not work with Stata. Each time I tried to type python from the Stata command prompt, the error message generated by Stata was:

initialized          no

What I did was to uninstall the other version of Python and kept only the Anaconda installation.

2. Set the Installation path

Stata can search for any available Python installation, including the installation through Anaconda. To search and associate python with Stata, I typed the following from the Stata command prompt:

python search 
set python_exec  D:\Anaconda\python.exe, permanently

The first line of code finds the directory path and the Python executable file. The second line of code sets which Python version to use. Option permanently would save this path for future use as well. And that’s all.

3. Using Python

Once the above steps go without an error, we are ready to use Python. In the Stata command window, we can enter the Python environment by typing python, and the three greater than familiar symbol will appear on the screen

 . python
 --------- python (type end to exit) ------- 

  • 0

Reporting odd ratios and Chi2 with asdoc


Richard Makurumidze has asked:It seems asdoc does not work with the chi (chi -square) and or (odds ratio) in logistic regression. Is this correct or am making some error?

Richard is referring to the nest option of asdoc that creates the nested regression tables. Without the nest option, asdoc produces detailed regression tables and exports odds ratio as a default option. However, with nest option, users must explicitly declare that they are interested in the odd ratios. This declaration is done using the eform option. In the following examples, I show how to get odd ratios with both the detailed and the nested regressions.

Reporting the odd ratios

We shall use the example data that is available on the Stata web server. The data can be downloaded by typing the following in the Stata command window.

webuse lbw, clear

Getting odd ratios in the detailed regression tables

 asdoc logistic low age lwt i.race smoke ptl ht ui, replace 

Getting odd ratios in the nested regression tables

 asdoc logistic low age lwt i.race smoke ptl ht ui, replace nest eform

This is how the output looks like.

Reporting the Chi2

Richard’s second querry is related to reporting the Chi2 test value. Since asdoc tries to find the r-squared values in regression commands, it is possible that this value is not available in some commands such as in the case of the logisitc regression. Users can add additional statistics to the regression table by using the option add(). There is a detailed discussion on this option in the asdoc help file, which we can access by typing:

help asdoc

Below, I show how we can use this option for reporting the Chi2 test value. Please note that Stata regression commands leaves behind several statistics in the e() macro which we can report with asdoc.

 asdoc logistic low age lwt i.race smoke  ui, replace nest add(Chi2, `e(chi2)')

*Add another regression

asdoc logistic low age lwt i.race smoke ptl ht ui, nest add(Chi2, `e(chi2)')


Option add() has two elements – the text Chi2 and the macro `e(chi2)’. These two elements are separated by the comma. This is how option add works. The inputs of option add() should be added in pairs of two, each one separated by a comma.

  • 0

Reshape data in Stata – An easy to understand tutorial

Category:Blog,Uncategorized Tags : 

From wide to long format

Suppose we have the data in the following format

 | id   sex   inc80   inc81   inc82   ue80   ue81   ue82 |
 |  1     0    5000    5500    6000      0      1      0 |
 |  2     1    2000    2200    3300      1      0      0 |
 |  3     0    3000    2000    1000      0      0      1 |

The above structure is known as the wide format. If we wish to convert it to a long format, such as the one given below,

| id year sex inc ue |
| 1 80 0 5000 0 |
| 1 81 0 5500 1 |
| 1 82 0 6000 0 |
| 2 80 1 2000 1 |
| 2 81 1 2200 0 |
| 2 82 1 3300 0 |
| 3 80 0 3000 0 |
| 3 81 0 2000 0 |
| 3 82 0 1000 1 |

We shall just type

reshape long inc ue, i(id) j(year)


Since we need to convert the data from a wide format to a long format, this is why the command that we wrote was reshape long. After that, we have to specify the names of the variables which are in the wide format. In our dataset, there are 2 variables which are INC and UE. Both of these variables have a numeric part. That numeric part is what we call the variable J. We specify this J variable in the option j(new variable). In our dataset, there is no variable with the name year, however, we wrote the option j(year) so that a new variable is created for the numeric values of 80, 81, and 82. We also specified option i(id), where option i needs an existing variable that is a unique panel identifier.

To practice the above yourself, here is the source data and code.

reshape long inc ue, i(id) j(year)

Reshape long to wide

Continuing from the previous example, we can reshape the data back to wide format by

reshape wide inc ue, i(id) j(year)

email-subscribers-form id=”{form-id}”

  • 6

Getting p-values and t-values with asreg

Category:Blog,Stata Programs

Xi asked the following quesiton:

How can I get p-values and t-values using asreg program?

Introduction to asreg

asreg is a Stata program, written by Dr. Attaullah Shah. The program is available for free and can be downloaded from SSC by typing the following on the Stata command window:

ssc install asreg

asreg was primarily written for rolling/moving / sliding window regressions. However, with the passage of time, several useful ideas were conceived by its creator and users. Therefore, more features were added to this program. 

Getting t-values after asreg

Consider the following example where we use the grunfeld dataset from the Stata web server. The dataset has 20 companies and 20 years of data for each company. We shall estimate the following regression model where the dependent variable is invest and independent variables are mvalue and kstock. Let’s estimate the regression model separately for each company, denoted by i .

In the following lines of code, the letters se after comma causes asreg to report the standard errors for each regression coefficient.

webuse grunfeld, clear
bys company: asreg invest mvalue kstock, se

asreg generates the regression coefficients, r-squared, adjusted r-squared, number of observations (_Nobs) and standard errors for each coefficient. This is enough information for producing t-values and p-values for the regression coefficients. The t-values can be generated by:

gen t_values_Cons = _b_cons / _se_cons
gen t_values_mvalue = _b_mvalue / _se_mvalue
gen t_values_kstock = _b_kstock / _se_kstock

Getting p-values after asreg

Getting p-values is just one step away. We need one additional bit of information from the regression estimates, that is the degrees of freedom. This is usually equal to the number of observations minus the number of parameters being estimated. Since we have two independent variables and one constant, the number of parameters being estimated are 3. asreg returns the number of observation in the variable _Nobs. Therefore, the term _Nobs – 3 in the following lines of code is a way to get the degrees of freedom.

gen p_values_Cons = (2 * ttail(_Nobs-3), abs( _b_cons / _se_cons ))
gen p_values_mvalue = (2 * ttail(_Nobs-3), abs( _b_mvalue / _se_mvalue ))
gen p_values_kstock = (2 * ttail(_Nobs-3), abs( _b_kstock / _se_kstock ))

Verify the Results

Let’s estimate a regression for the first company and compare our estimates with those produced by the Stata regress command.

 reg invest mvalue kstock if company == 1
list _b_cons _se_cons p_values_Cons in 1

Full Code

 webuse grunfeld, clear
bys company: asreg invest mvalue kstock, se
gen t_values_Cons = _b_cons / _se_cons
gen t_values_mvalue = _b_mvalue / _se_mvalue
gen t_values_kstock = _b_kstock / _se_kstock
gen p_values_Cons = (2 * ttail(_Nobs-3), abs( _b_cons / _se_cons ))
gen p_values_mvalue = (2 * ttail(_Nobs-3), abs( _b_mvalue / _se_mvalue ))
gen p_values_kstock = (2 * ttail(_Nobs-3), abs( _b_kstock / _se_kstock ))
reg invest mvalue kstock if company == 1
list _b_cons _se_cons t_values_Cons p_values_Cons in 1

  • 8

Ordering variables in a nested regression table of asdoc in Stata

Category:asdoc,Blog Tags : 

In this blog entry, I shall highlight one important, yet less known, feature of the option keep() in nested regression tables of asdoc. If you have not used asdoc previously, this half-page introduction will put on fast track. And for a quick start of regression tables with asdoc, you can also watch this YouTube video.


Option keep()

There are almost a dozen options in controlling the output of a regression table in asdoc. One of them is the option keep(list of variable names). This option is primarily used for reporting coefficient of the desired variables. However, this option can also be used for changing the order of the variables in the output table. I explore these with relevant examples below.


1. Changing the order of variables

Suppose we want to report our regression variables in a specific order, we shall use option keep() and list the variable names in the desired order inside the brackets of option keep(). It is important to note that we have to list all variables which we want to report as omitting any variable from the list will cause asdoc to omit that variable from the output table.


An example

Let us use the auto dataset from the system folder and estimate two regressions. As with any other Stata command, we need to add asdoc to the beginning of the command line. We shall nest these regressions in one table, hence we need to use the option nest. Also, we shall use option replace in the first regression to replace any existing output file in the current directory. Let’s say we want to variables to appear in this order in the output file _cons trunk weight turn. Therefore, the variables are listed in this order inside the keep() option. The code and output file are shown below.

sysuse auto, clear
asdoc reg mpg turn, nest replace
asdoc reg mpg turn weight trunk, nest keep(_cons trunk weight turn)



2. Reporting only needed variables

Option keep is also used for reporting only needed variables, for example, we might not be interested in reporting coefficients of year or industry dummies. In such cases, we shall list the desired variable names inside the brackets of the keep() option. In the above example, if we wish to report only _cons trunk weight , we would just skip the variable turn from the keep option. Again, the variables will be listed in the order in which they are listed inside the keep option.  

sysuse auto, clear
asdoc reg mpg turn, nest replace
asdoc reg mpg turn weight trunk, nest keep(_cons trunk weight)



Off course, we could also have used option drop(turn) instead of option keep(_cons trunk weight) for dropping variable turn from the output table.