This post offers a brief tutorial on filling missing values in Stata variables. The tutorial utilizes the ‘fillmissing’ program, which you can download by entering the following command in the Stata command window

ssc install fillmissing, replace

Important Note: This post does not imply that filling missing values is theoretically justified. Users should exercise their judgment and adhere to appropriate theoretical principles when addressing missing values.

Once you have installed the fillmissing program, you can use it to fill missing values in both numeric and string variables. Additionally, fillmissing program supports the use of the bysort prefix for filling missing values within specific groups. In the following sections, we’ll explore several examples of using the bysort prefix for group-based calculations. However, before diving into these examples, let’s take a quick look at the various options available within the program.

fillmissing: Program Options

The fillmissing program provides a range of options for addressing missing values:

  1. with(any)
  2. with(previous)
  3. with(next)
  4. with(first)
  5. with(last)
  6. with(mean)
  7. with(max)
  8. with(min)
  9. with(median)

Let us quickly walk through these options, keeping in mind that options numbered from 6 onwards are designed for numerical variables.

1. with(any)

The with() option serves to designate the source from which missing values will be populated. Specifically, with(any) is an optional choice, and if not explicitly specified, the fillmissing program will automatically default to it. This option is particularly useful when addressing missing values in a constant variable, where most values are identical, yet some are missing. with(any) aims to replace these missing values by drawing upon any available non-missing values within the same variable.

Example 1: Fill missing values with(any)

First, let’s create a sample dataset consisting of a single variable with ten observations. You can easily generate this dataset by copying and pasting the following code into the Stata Do editor:

clear all

set obs 10

gen symbol = "AABS"

replace symbol = "" in 5

replace symbol = "" in 8

The above dataset has missing values on row 5 and 8. To fill the missing values from any other available non-missing values, let us use the with(any) option.

fillmissing symbol, with(any)

Since with(any) is the default option of the program, we could also write the above code as

fillmissing symbol

2. with(previous)

The with(previous) option is designed to replace the current missing value with the preceding value of the same variable. It’s important to note that if the previous value is also missing, the current value will remain missing. Additionally, it’s worth mentioning that this option does not sort the data. Therefore, fillmissing uses the existing data order to identify the current and previous observations. To know more about sort order of data in Stata, read this article Stata Tip 28: Precise control of Data Sort Order

Example 2: Fill missing values with(previous)

Let’s create a dummy dataset first.

clear all

set obs 10

gen symbol = "AABS"

replace symbol = "AKBL" in 1

replace symbol = "" in 2

The dataset looks like this

+------+

| symbol |

+------+

| AKBL |

|        |

| AABS |

| AABS |

| AABS |

| AABS |

| AABS |

| AABS |

| AABS |

| AABS |

+--------+

To fill the missing value in observation number 2 with ‘AKBL‘ from the previous observation, simply type:

fillmissing symbol, with(previous)

 

  Filling missing values with groups: by() or bysort options

To fill missing values within groups, let’s use the nlswork dataset from the web. This dataset has missing values in various variables. To create a report of missing values by groups, lets use missings program by Nick Cox, a valuable tool for reporting missing data in variables. Users can employ ‘asdocx’ to export the report in Word, Excel, or LaTeX formats

webuse nlswork, clear

* Create a report of missing values for all variables if c_city variable is 1

asdocx missings report if c_city == 1

Table: Results
variable missings
age 8
msp 10
nev_mar 10
grade 1
ind_code 121
occ_code 58
union 3634
wks_ue 1931
tenure 164
hours 24
wks_work 269
Notes:

The above table shows that there are 8 missing values the age variable where c_city variable is 1. However, there are a total of 24 missing values in age variable, to confirm:

count if missing(age)

24

To fill the missing values with previous non-missing values within each group of the c_city variable, the code would be:

bys c_city : fillmissing age

* (24 real changes made)

Your support keeps these efforts alive