# Author Archives: Attaullah Shah

## Using the bysort prefix with tab commands in asdoc | Stata

Category:Uncategorized

Recently, Scott Siegal asked for the possibility of adding the bysort prefix with tabulate, tab, tab1, and tab2 commands to asdoc. Honoring his request, I have added the bysort support to asdoc.

The new version of asdoc can be installed from my site. Copy and paste the following line in Stata and press enter.

```net install asdoc, from(http://fintechprofessor.com) replace
```

Please note that the above line has to be copied in full. After installation of the new version, then restart Stata.

Once installed, we can use asdoc by adding the word asdoc to any Stata Command. Let us start with the basic tabulation commands

## asdoc with the tab command

The tab command of Stata is used to create a table of frequencies. When adding one variable with tab command, it creates a one-way tabulation. To export the results to MS Word, just add asdoc to the beginning of the command. Let us use the nslw88.dta dataset from the Stata installation.

```sysuse nlsw88, clear

*Tabule the race variable
asdoc tab race```

The above code generates the following table.

## Tab with bysort variable

The bysort prefix can be used to repeat the one-way tabulation across groups of another variable. So if you care about repeating the tabulation of race variable across the categories of the married variable, we shall write the code as:

`bys married: asdoc tab race , replace`

## Two-way tables

Two-way frequency tables are created by adding two variables after the tab word. Let’s create a two-way table of race and smsa

## Using bysort with two-way tables

The bysort prefix can be used to repeat the two-way tabulations across groups of another variable. So if we care about repeating the tabulation of race and smsa variables across the categories of the married variable, we shall write the code as:

`bys married: asdoc tab race smsa , replace`

## What is a Python Dictionary

Category:Blog
Python Dictionary

# Python Dictionary¶

Dictionary is a method in which data is stored in pairs of keys and values. These are also called Associative Arrays in other programming languages.

### What is key-value pair?¶

key is a unique identifier for a given record. Values are data stored in that identifier. For example, Let us say that Muneer is a student, and we want to create a dictionary containing his details. The first key in his record is name and the value for this key is ‘Muneer’. He has a weight of 75 Kg, therefore, the second key in this record is weight, and the value of this key is 75. His height is 6ft, and has age of 35 years. In this record, following are the key value pairs

```  +-----------------+
|   keys   values |
|-----------------|
|   name   Muneer |
| weight       75 |
| height        6 |
|    age       35 |
+-----------------+

```

### How to create a dictionary¶

A dictionary is created using curly brackets. The first item is always the key followed by a full colon, the second item is the value. Next key-value pair is created using a comma.

In [2]:
```student = {'name': 'Muneer', 'weight': 75, 'height': 6, 'age': 35}
```
In [3]:
```student
```
Out[3]:
`{'name': 'Muneer', 'weight': 75, 'height': 6, 'age': 35}`
In [13]:
```student.get('name')
```
Out[13]:
`'Muneer'`
In [14]:
```student.get('age')
```
Out[14]:
`35`
In [ ]:
```
```

Category:Blog

1. gapminder.tsv
2. pew.csv
3. billboard.csv
4. ebola.csv
5. tips.csv

# TED Talk Dataset Excercises¶

In [5]:
```# Change directory
```
In [6]:
```cd "D:\Dropbox\CLASSES\Data Science for Finance\Python\Lecture 1 - Assignment"
```
```D:\Dropbox\CLASSES\Data Science for Finance\Python\Lecture 1 - Assignment
```
In [7]:
```import pandas as pd
```
In [8]:
```ted = pd.read_csv('ted.csv')
```

## 1: Explore the Data attributes¶

In [11]:
```ted.dtypes
```
Out[11]:
```comments               int64
description           object
duration               int64
event                 object
film_date              int64
languages              int64
main_speaker          object
name                  object
num_speaker            int64
published_date         int64
ratings               object
related_talks         object
speaker_occupation    object
tags                  object
title                 object
url                   object
views                  int64
dtype: object```
In [12]:
```ted.info()
```
```<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2550 entries, 0 to 2549
Data columns (total 17 columns):
description           2550 non-null object
duration              2550 non-null int64
event                 2550 non-null object
film_date             2550 non-null int64
languages             2550 non-null int64
main_speaker          2550 non-null object
name                  2550 non-null object
num_speaker           2550 non-null int64
published_date        2550 non-null int64
ratings               2550 non-null object
related_talks         2550 non-null object
speaker_occupation    2544 non-null object
tags                  2550 non-null object
title                 2550 non-null object
url                   2550 non-null object
views                 2550 non-null int64
dtypes: int64(7), object(10)
memory usage: 338.8+ KB
```
In [13]:
```ted.shape
```
Out[13]:
`(2550, 17)`

## 2. Which talk has the highest comments¶

In [77]:
```ted.sort_values('comments')[['comments', 'duration','main_speaker']].tail()
```
Out[77]:
1787 2673 1117 David Chalmers
201 2877 1099 Jill Bolte Taylor
644 3356 1386 Sam Harris
0 4553 1164 Ken Robinson
96 6404 1750 Richard Dawkins

# 3 Find top 5 talks that have the highest views to comments ratio¶

In [16]:
```ted['view_to_comment'] = ted['views'] / ted['comments']
```
In [17]:
```ted['view_to_comment'].tail()
```
Out[17]:
```2545    26495.882353
2546    69578.333333
2547    37564.700000
2548    13103.406250
2549    48965.125000
Name: view_to_comment, dtype: float64```

# 4 . Create a histogram of comments¶

In [19]:
```import matplotlib.pyplot as plot
```
Out[19]:
`<matplotlib.axes._subplots.AxesSubplot at 0x14a233e3978>`

# 5. Create histogram of comments where comments are less than 1000¶

In [35]:
```# Get index of those row which have less than 1000 comments
```
In [38]:
```# Get only the comments column from these filtered rows
```
In [39]:
```# Make a plot of these filtered comments
com1000.plot(kind = 'hist')
```
Out[39]:
`<matplotlib.axes._subplots.AxesSubplot at 0x14a236dc7b8>`
In [40]:
```# When you expert, you can do the above just in one line
```
Out[40]:
`<matplotlib.axes._subplots.AxesSubplot at 0x14a2375cac8>`
In [44]:
```# How many rows were excluded from the above graph
```
Out[44]:
`(32, 18)`

# 6. Do the same as in 5, but using a query method¶

In [68]:
```# Filter the whole dataset where comments are less than 1000
```
In [69]:
```# Get only the comments column from the reduced dataset
```
In [70]:
```# Plot the filtered comments
comment1000.plot(kind = 'hist')
```
Out[70]:
`<matplotlib.axes._subplots.AxesSubplot at 0x14a238fb630>`

# 7. How to add more bins to the histogram¶

In [71]:
```comment1000.plot(kind = 'hist', bins = 20)
```
Out[71]:
`<matplotlib.axes._subplots.AxesSubplot at 0x14a23953278>`

# 8. Make a box plot and identify outliers¶

In [73]:
```comment1000.plot(kind = 'box')
```
Out[73]:
`<matplotlib.axes._subplots.AxesSubplot at 0x14a23a4ba20>`

The black dots show outliers

In [ ]:
```
```

## asrol Cumulative Returns Over 12-months, skipping the Current month

Category:Uncategorized

Anomaly Hunters has asked: How Can I find cumulative returns from month T-1 to T-12, i.e. not including the current month in the calculation. I want to do it on a rolling window for each firm.

Let us use a small dataset. We shall create this dataset in Stata with the following code

```clear
input byte firm_id str6 date float ret
1 "200310"         .
1 "200311"  -.049505
1 "200312"  .0121528
1 "200401"  .0360206
1 "200402" -.0480132
1 "200403"  .0434783
1 "200404"  .0933333
1 "200405"  .0121951
1 "200406"  .0060241
1 "200407"  .0883234
1 "200408" -.0261348
1 "200409" -.0819209
1 "200410"      -.02
1 "200411" -.0062794
1 "200412"  .0458136
1 "200501"  .0891239
1 "200502"  .0443828
1 "200503"
end```

* Convert the date to Stata format
Code:

```tostring date, replace
gen month = mofd(date(date, "YM"))
format month %tm```

* Find cumulative returns in a 12-month rolling window
Code:

`bys firm_id : asrol ret, stat(product) add(1) window(month 12)`

* Lag the returns by one month to skip the current month returns from the calculation
Code:

```tsset firm_id month
gen ret_skip1 = L.product12_ret
list month ret product12_ret ret_skip1```

*And appreciate the results

Code:

```     +---------------------------------------------+
|   month        ret   product1~t   ret_skip1 |
|---------------------------------------------|
1. | 2003m10          .            .           . |
2. | 2003m11   -.049505     -.049505           . |
3. | 2003m12    .012153   -.03795382    -.049505 |
4. |  2004m1    .036021   -.00330034   -.0379538 |
5. |  2004m2   -.048013   -.05115508   -.0033003 |
|---------------------------------------------|
6. |  2004m3    .043478   -.00990092   -.0511551 |
7. |  2004m4    .093333    .08250829   -.0099009 |
8. |  2004m5    .012195    .09570959    .0825083 |
9. |  2004m6    .006024    .10231025    .0957096 |
10. |  2004m7    .088323    .19967004    .1023103 |
|---------------------------------------------|
11. |  2004m8   -.026135     .1683169      .19967 |
12. |  2004m9   -.081921    .07260733    .1683169 |
13. | 2004m10       -.02    .05115519    .0726073 |
14. | 2004m11   -.006279     .0989585    .0511552 |
15. | 2004m12    .045814    .13550617    .0989585 |
|---------------------------------------------|
16. |  2005m1    .089124     .1937088    .1355062 |
17. |  2005m2    .044383    .30956536    .1937088 |
18. |  2005m3   -.005312     .2483334    .3095654 |
+---------------------------------------------+```

## Step-by-Step: Portfolio Risk in Stata and Excel

Category:Uncategorized

## Portfolio Risk in Excel

To build our concept of the portfolio risk, we shall calculate it first manually in EXCEL, then we shall replicate the results using matrix notations in Stata.
Consider the following set of returns for two assets, i.e asset A and B.

```  A        B
.249917 .819483
.739069 .821416
.895491 .276843
.902722 .001586
.078344 .714815
.429804 .027261
.239556 .736011

```

As we know, the portfolio standard deviation under the modern portfolio theory is calculated as =

Thus, we need the standard deviation of the two assets, the proportion of investment in each asset (weights), and the covariance term between the two assets. In Excel, we can find the standard deviation by

``=STDEV(range)``

And the formula for covariance is

``=COVARIANCE.S(range1, range2)``

`Thus, standard deviation of asset A = 0.3387         `

the standard deviation of asset B = 0.3713

Covariance between the two = -0.0683

And if we invest equally in both assets, then

the weight of A = 0.5

and weight of B = 0.5

```Portfolio SD =( (0.5^2*0.3387^2)+(0.5^2*0.3713^2)+(2*0.5*0.5*-0.0683))^0.5

=(0.02897)^0.5
=0.1702       ```

## Portfolio Risk in Stata

Finding portfolio standard deviation under the Modern Portfolio theory using matrix algebra requires three matrices
1. Weights of the assets in the portfolio, in row format = W
2. Variance-Covariance matrix of assets returns = S
3. Weights of the assets in the portfolio, in column format = W’

`Portfolio SD = W * S * W'`

NOTE: In order to find the variance-covariance matrix, you can install varrets program from ssc with:

`` ssc install mvport` `

### Step 1 : Copy the example data to stata

You can do it either by copying the data from the excel file and pasting it to the stata editor. Alternatively, you can copy the following and past it in the stata command line, or download the data file from here.

``` `input a b`
`.249917 .819483`
`.739069 .821416`
`.895491 .276843`
`.902722 .001586`
`.078344 .714815`
`.429804 .027261`
`.239556 .736011`
`end` ```

### Step 2: Make variance-covariance matrix

``varrets a bmat S = r(cov)` `

### Step 3: Make a weight matrix

Assuming that we assign equal weights, we define matrix W

``mat W = (0.5, 0.5)` `

### Step 4 : Multiply the weight and variance-covariance matrix

``mat VAR = W * S * W'` `

### Step 5: Show variance of the portfolio

``mat list VAR` `

## How Fama and French June to July Portfolios are Constructed?

Category:Asset Pricing Research,Blog

The description of portfolios’ construction given in various Fama and Fench papers is usually confusing for many researchers, especially those who are new to asset pricing models. The typical language used in Fama and French papers reads like this

The size breakpoint for year t is the median NYSE market equity at the end of June of year t. BE/ME for June of year t is the book equity for the last fiscal year end in t-1 divided by ME for December of t-1.

This blog post aims at explaining the above paragraph with some examples.

## Break-points for Portfolio Construction

### The size-breakpoints

As mentioned in the above paragraph, the size-breakpoints are based on the market capitalization of firms at the end of June of the current year. This means while making two groups of firms:

1. First, we need to reduce the data to keep the market capitalization of each firm at the end of June.
2. Also, we need to further reduce the data to keep only firms listed at the NYSE stock exchange.

### The BE/ME-breakpoints

The BE/ME variable uses lagged values of the book equity and market equity. However, the way the lagged values are obtained for both the variables differs from one another. The book equity is the last fiscal year’s available book equity. Since the assumption is that the financial year ends in June, therefore, the last June’s book equity is called `book equity for the last fiscal year end in t-1 `

Consider the following monthly data where we have observations for a single firm over three years period. The variable year represents the calendar year that starts in January and ends in December. The variable fyear represents the fiscal year, that starts in July and ends in June.

From these observations, we need ME in December of yeat t-1. In our dataset, the first December appears in the calendar year 2016. The ME on that date is 958. For the calendar year 2006, the corresponding BE value for the fiscal year is 467, that is the `book equity for the last fiscal year end in t-1 `
We are able to calculate the BE/ME ratio in June 2017 as = 467 / 958. This value will be used for finding the breakpoints and making the three BE/ME portfolios, which are then held from July of year t to June of year t+1, as shown in the following snapshot.

### The Yearly Portfolios

The portfolios for July of year t to June of t+1 include all NYSE, AMEX, and NASDAQ stocks for which we have market equity data for December of t-1 and June of t, and (positive) book equity data for t-1.

#### Interpretation

Once the breakpoints for size and

### How to Do it Programatically?

There are more than a dozen steps to fully implement the Fama and French model. Entry-level researchers might try to do all these steps in MS Excel. However, doing these steps in Excel is not only cumbersome but also prone to errors. Further, the process is manual, therefore, it cannot be easily replicated.

We have developed codes in Stata to construct the three factors of the Fama and French model as well as the 25 RHS (right-hand side) portfolios. Our codes generate factors that have over 97% correlation with the Fama and French factors.

### Why buy codes for Fama and French Model?

There are several reasons that you should consider using the codes of a professional. These reasons include but are not limited to the accuracy of the code, quick learning, replicability of the codes in the same project or other projects, and validation of your own code if you have written a code yourself.

## Pricing Options

### Bronze

\$99
• Source Code
• Email Support

### Gold

Most Popular

\$199
• Source Code
• Data Handling
• Email Support

### Silver

\$149
• Source Code
• Example Dataset
• Email Support

## fillmissing: Fill Missing Values in Stata

Category:Blog

This post presents a quick tutorial on how to fill missing values in variables in Stata. This tutorial uses fillmissing program which can be downloaded by typing the following command in Stata command window

`ssc install fillmissing, replace`

Important Note: This post does not imply that filling missing values is justified by theory. Users should make their own decisions and follow appropriate theory while filling missing values.

After the installation of the fillmissing program, we can use it to fill missing values in numeric as well as string variables. Also, this program allows the `bysort` prefix to fill missing values by groups. We shall see several examples of using `bysort` prefix to perform by-groups calculations. But let us first quickly go through the different options of the program.

## Program Options

The fillmissing program offers the following options to fill missing values

1. with(any)
2. with(previous)
3. with(next)
4. with(first)
5. with(last)
6. with(mean)
7. with(max)
8. with(min)
9. with(median)

Let us quickly go through these options. Please note that options starting from serial number 6 are applicable only in the case of numerical variables.

### 1. with(any)

Option `with()` is used to specify the source from where the missing values will be filled. Option `with(any)` is an optional option and hence if not specified, will automatically be invoked by the fillmissing program. This option is best to fill missing values of a constant variable, i.e. a variable that has all similar values, however, due to some reason, some of the values are missing. Option `with(any)` will try to fill the missing values from any available non-missing values of the given variable.

#### Example 1: Fill missing values with(any)

Let us first create a sample dataset of one variable having 10 observations. You can copy-paste the following code to Stata Do editor to generate the dataset

```clear all
set obs 10
gen symbol = "AABS"
replace symbol = "" in 5
replace symbol = "" in 8
```

The above dataset has missing values on row 5 and 8. To fill the missing values from any other available non-missing values, let us use the `with(any)` option.

`fillmissing symbol, with(any)`

Since `with(any)` is the default option of the program, we could also write the above code as

`fillmissing symbol`

### 2. with(previous)

Option `with(previous)` is used to fill the current missing value with the preceding or previous value of the same variable. Please note that if the previous value is also missing, the current value will remain missing. Further, this option does not sort the data, so whatever the current sort of the data is, fillmissing will use that sort and identify the current and previous observation.

#### Example 2: Fill missing values with(previous)

Let’s create a dummy dataset first.

```clear all
set obs 10
gen symbol = "AABS"
replace symbol = "AKBL" in 1
replace symbol = "" in 2 ```

The dataset looks like this

```
+--------+
| symbol |
+--------+
|   AKBL |
|        |
|   AABS |
|   AABS |
|   AABS |
|   AABS |
|   AABS |
|   AABS |
|   AABS |
|   AABS |
+--------+ ```

To fill the missing value in observation number 2 with AKBL, i.e. from previous observation, we would type:

`fillmissing symbol, with(previous)`

### What’s Next

In the next blog post, I shall talk about other options of the fillmissing program. Specifically, I shall discuss the use of `by` and `bys` with fillmissing program. Therefore, you may visit the blog section of this site or subscribe to updates from this site.

## Export output of Table command from Stata to Word using asdoc

Category:asdoc,Blog

Exporting tables from table command was the most challenging part in asdoc programming. Nevertheless, asdoc does a pretty good job in exporting table from table command. asdoc accepts almost all options with table command, except cellwidth(#), stubwidth(#), and csepwidth(#).

## 7.1 One-way table

### Example 54 : One-way table; frequencies shown by default

`sysuse auto, clearasdoc table rep78, title(Table of Freq. for Repairs) replace`

### Example 55 : One-way table; show count of non-missing observations for mpg}

`asdoc table rep78, contents(n mpg) replace`

### Example 56 : One-way table; multiple statistics on mpg requested

`asdoc table rep78, c(n mpg mean mpg sd mpg median mpg) replace`

### Example 57 : Add formatting – 2 decimals

`asdoc table rep78, c(n mpg mean mpg sd mpg median mpg) dec(2) replace`

## 7.2 Two-way table

### Example 58 : Two-way table; frequencies shown by default

`asdoc table rep78 foreign, replace`

### Example 59 : Two-way table; show means of mpg for each cell

`asdoc table rep78 foreign, c(mean mpg) replace`

### Example 60 : Add formatting

`asdoc table rep78 foreign, c(mean mpg) dec(2) center replace`

### Example 61 : Add row and column totals

`asdoc table rep78 foreign, c(mean mpg) dec(2) center row col replace`

## 7.3 Three-way table

### Example 62 : Three-way table

`webuse byssin, clearasdoc table workplace smokes race [fw=pop], c(mean prob) replace`

## 7.4 Four-way table

### Example 65 : Four-way table with by()

```webuse byssin1, clear
asdoc table workplace smokes race [fw=pop], by(sex) c(mean prob) replace```

### Example 66 : Four-way table with supercolumn, row, and column totals

`asdoc table workplace smokes race [fw=pop], by(sex) c(mean prob) sc col row replace`

## Customized tables using option row() of asdoc – Stata

Category:asdoc,Stata Programs

This is rather a quick example of how to use option row() of asdoc for creating highly customized tables. We are interested in a table that is given bellow.

```* Load example dataset
sysuse auto,clear

*Write the header row of the table with table title
asdoc, row(Dependent variable:domestic or foreign, Domestic mean/frequency, Domestic SD, Foreign mean/frequency, Foreign SD, t-test) title(Summary staticis) save(myfile) replace

asdoc, row( Model independent variables, \i, \i, \i, \i, \i) append

*Use a loop over each variable that include price, mpg, ...
foreach var of varlist price mpg rep78 headroom trunk weight length turn{

*First summarize each variable for a given sample, that is if foregin is   zero
qui sum `var' if foreign==0

*Obtain the mean divided by frequency
local mf=`r(mean)'/`r(N)'

*Store the mf and standard deviation variable in accum macro
asdoc, accum(`mf', `r(sd)')

*now repeat the same for the second sample, ie. when foreign is 1
qui sum `var' if foreign==1
local mf=`r(mean)'/`r(N)'
asdoc, accum(`mf', `r(sd)')

*Conduct a two-sample ttest using foreign as a grouping variable
ttest `var', by(foreign)

*Obtain the t-statistics
local t : di %9.3f = abs(`r(t)')

*Create significance stars
if `r(p)'<=0.01 {
local star "***"
}
else if `r(p)'<=0.05{
local star "**"
}
else if `r(p)'<=0.1{
local star "*"
}
else {
local star " "
}
local tstar `t'`star'

*Add the t-value and stars to the accum macro
asdoc, accum(`tstar')

* Finally write this complete row where we first write the variable name
*and then all accumulated variables that are present in \$accum macro.
asdoc, row(`var', \$accum)
}
```

## tabstat with asdoc in Stata

Category:asdoc,Uncategorized

asdoc makes some elegant tables when used with tabstat command. There are several custom-made routines in asdoc that creates clean tables from tabstat command. asdoc fully supports the command structure and options of tabstat. And, yes asdoc allows one additional statistics, that is, t-statistics alongside the allowed statistics in tabstat. For reporting purposes, asdoc categorizes tabstat commands in two groups:

(1) stats without a grouping variable

(2) stats over a grouping variable.

## Tabstat Without-by

If statistics are less than variables, the table is transposed, i.e. statistics are shown in columns, while variables are shown in rows

### Example 49 : One variable, many stats, including t-statistics

```sysuse auto, clear
asdoc tabstat price , stat(min max mean sd median p1 p99 tstat) replace ```

### Example 50 : Many variables, one statistic

`asdoc tabstat price mpg rep78 headroom trunk weight length foreign , stat( mean) replace`

### Example 51 : Many variables, many statistics

```asdoc tabstat price mpg rep78 headroom trunk weight length foreign , ///
stat( max mean sd median p1 p99 tstat) replace```

## Tabstat with-by

### Example 52 :

`bysort foreign: asdoc tabstat price mpg rep78 headroom trunk weight length, stat(mean) replace`

OR

```asdoc tabstat price mpg rep78 headroom trunk weight length, ///
stat(mean) by(foreign) replace```

### Example 53 : By with many variables and many statistics

```bysort foreign: asdoc tabstat price mpg rep78 headroom trunk weight length, ///
stat(mean sd p1 p99 tstat) replace```