fillmissing: Fill Missing Values in Stata

View Larger Image

This post offers a brief tutorial on filling missing values in Stata variables. The tutorial utilizes the ‘fillmissing’ program, which you can download by entering the following command in the Stata command window

[code]ssc install fillmissing, replace[/code]

Important Note: This post does not imply that filling missing values is theoretically justified. Users should exercise their judgment and adhere to appropriate theoretical principles when addressing missing values.

Once you have installed the fillmissing program, you can use it to fill missing values in both numeric and string variables. Additionally, fillmissing program supports the use of the bysort prefix for filling missing values within specific groups. In the following sections, we’ll explore several examples of using the bysort prefix for group-based calculations. However, before diving into these examples, let’s take a quick look at the various options available within the program.

fillmissing: Program Options

The fillmissing program provides a range of options for addressing missing values:

with(any)
with(previous)
with(next)
with(first)
with(last)
with(mean)
with(max)
with(min)
with(median)

Let us quickly walk through these options, keeping in mind that options numbered from 6 onwards are designed for numerical variables.

1. with(any)

The with() option serves to designate the source from which missing values will be populated. Specifically, with(any) is an optional choice, and if not explicitly specified, the fillmissing program will automatically default to it. This option is particularly useful when addressing missing values in a constant variable, where most values are identical, yet some are missing. with(any) aims to replace these missing values by drawing upon any available non-missing values within the same variable.

Example 1: Fill missing values `with(any)`

First, let’s create a sample dataset consisting of a single variable with ten observations. You can easily generate this dataset by copying and pasting the following code into the Stata Do editor:

[code]clear all

set obs 10

gen symbol = “AABS”

replace symbol = “” in 5

replace symbol = “” in 8[/code]

The above dataset has missing values on row 5 and 8. To fill the missing values from any other available non-missing values, let us use the with(any) option.

[code]fillmissing symbol, with(any)[/code]

Since with(any) is the default option of the program, we could also write the above code as

[code]fillmissing symbol[/code]

2. with(previous)

The with(previous) option is designed to replace the current missing value with the preceding value of the same variable. It’s important to note that if the previous value is also missing, the current value will remain missing. Additionally, it’s worth mentioning that this option does not sort the data. Therefore, fillmissing uses the existing data order to identify the current and previous observations. To know more about sort order of data in Stata, read this article Stata Tip 28: Precise control of Data Sort Order

Example 2: Fill missing values `with(previous)`

Let’s create a dummy dataset first.

[code]clear all

set obs 10

gen symbol = “AABS”

replace symbol = “AKBL” in 1

replace symbol = “” in 2 [/code]

The dataset looks like this

[code]

+——+

| symbol |

+——+

| AKBL |

| |

| AABS |

+——–+

[/code]

To fill the missing value in observation number 2 with ‘AKBL‘ from the previous observation, simply type:

[code]fillmissing symbol, with(previous)[/code]

Filling missing values with groups: by() or bysort options

To fill missing values within groups, let’s use the nlswork dataset from the web. This dataset has missing values in various variables. To create a report of missing values by groups, lets use missings program by Nick Cox, a valuable tool for reporting missing data in variables. Users can employ ‘asdocx’ to export the report in Word, Excel, or LaTeX formats.

[code]webuse nlswork, clear

* Create a report of missing values for all variables if c_city variable is 1

asdocx missings report if c_city == 1[/code]

Table: Results

variable	missings
age	8
msp	10
nev_mar	10
grade	1
ind_code	121
occ_code	58
union	3634
wks_ue	1931
tenure	164
hours	24
wks_work	269
Notes:

The above table shows that there are 8 missing values the age variable where c_city variable is 1. However, there are a total of 24 missing values in age variable, to confirm:

[code]count if missing(age)

24[/code]

To fill the missing values with previous non-missing values within each group of the c_city variable, the code would be:

[code]bys c_city : fillmissing age

* (24 real changes made)[/code]

Your support keeps these efforts alive

Attaullah Shah2023-09-06T18:49:36+05:00December 20th, 2019|Blog|14 Comments

14 Comments

Syed Ahmad Gillani December 29, 2019 at 1:50 pm - Reply

dear dr please check your email I have asked about Corporate governance data.. detail is in my email
Attaullah Shah December 30, 2019 at 11:48 pm - Reply

I did not receive your email. Please send it to attahshah15@hotmail.com
Dr. Hassan Raza January 29, 2020 at 2:50 pm - Reply

Dear sir, this code is not installing to stata, please help “net install fillmissing, from(http://fintechprofessor.com) replace”
- Attaullah Shah January 29, 2020 at 7:56 pm - Reply
  Dear Dr. Hassan Raz
  I have converted the site to https protocol, therefore, you may try this method.
  
  net install fillmissing, from(https://fintechprofessor.com) replace
James Kirkbride March 27, 2020 at 2:19 pm - Reply

Very useful command, thanks. Would be helpful to have a help file installed along with the package itself for future reference. I can also confirm this works with the “bysort” command (in my Stata 15), which is exactly what I needed it to be able to do.
Aco March 28, 2020 at 10:32 pm - Reply
Dear, I have a question when using this fillmissing code in stata.
Example:
```
by Product pair_id, sort: fillmissing tarrifs, with (mean)
```
This command uses the average of the group, but I would like to use the average of the previous variable and the posterior variable to replace the missing, keeping the limits within each group.
Example of the database with missing
```
Year    Rates    Country
2000	5	 USA
2001	. 	USA
2002	. 	USA
2003	 4 	USA
2004	. 	USA
2005	 6 	USA
2000	. 	BRA
2001	 4 	BRA
2002	. 	BRA
2003	. 	BRA
2004	 8 	BRA
2005	. 	BRA
```
Example that I would like to arrive using the fillmissing code
```
2000	5	
2001	4.5	
2002	4.25	
2003	4	
2004	5	
2005	6	
2000	4	BRA
2001	4	BRA
2002	6	BRA
2003	7	BRA
2004	8	BRA
2005	8	BRA
```
I hope you can help me.
My best regards.
Attaullah Shah March 30, 2020 at 11:28 pm - Reply

I could not understand the requirements. The data you have posted and the fillmissing command that you have used do not match. Can you please clarify it a bit further on what to use for filling the missing values?
Aco April 5, 2020 at 8:04 pm - Reply
Dear,
I am sorry for the lack of clarity in the explanation.

The original database consists of a panel, with more than 100 importing and 100 exporting countries, organized in pairs. The dependent variable is import flow and the dependent variable is tariff.

The following database is similar to the original
```
year    rates	country
2000		BRA USA
2001	4	BRA USA
2002		BRA USA
2003		BRA USA
2004	8	BRA USA
2005		BRA USA
2000	5	USA BRA
2001		USA BRA
2002		USA BRA
2003	4	USA BRA
2004		USA BRA
2005	6	USA BRA
```
Command:
```
by country, sort: fillmissing rates, with (mean)
```
Result with the above command
```
year     rates	 country
2000	6	BRA USA
2001	4	BRA USA
2002	6	BRA USA
2003	6	BRA USA
2004	8	BRA USA
2005	6	BRA USA
2000	5	USA BRA
2001	5	USA BRA
2002	5	USA BRA
2003	4	USA BRA
2004	5	USA BRA
2005	6	USA BRA
```
This command uses the average of the group, but I would like to use the average of the previous variable and the posterior variable to replace the missing, keeping the limits within each group (BRA USA; USA BRA; and so on).

My expected result would be is to arrive to a base of data similar to the base below:
```
year	rates	country
2000	4	BRA USA
2001	4	BRA USA
2002	6	BRA USA
2003	7	BRA USA
2004	8	BRA USA
2005	8	BRA USA
2000	5	USA BRA
2001	4.5	USA BRA
2002	4.25	USA BRA
2003	4	USA BRA
2004	5	USA BRA
2005	6	USA BRA
```
The asdoc and fillmissing commands are very useful and help a lot in the job.
Excuse me for the inconvenience.
My best regards.
Attaullah Shah April 6, 2020 at 1:23 pm - Reply

I think it is an interesting problem and will need recursive loops. I have added this option to fillmissing now.
Aristino Djeuf July 11, 2020 at 9:32 am - Reply

Hello Dr Attaullah Shah;
I want the fillmissing program to solve missing value problems with the with(mean) with panel data.
Tia August 27, 2020 at 10:30 pm - Reply

Thank you for this – it was really helpful!
Muhammad Azlaan August 26, 2022 at 4:37 pm - Reply

I am new to stata and want to run interindustry volatility spillover. can you please guide me in this regard?
syeda tuba bukhari October 17, 2022 at 8:45 pm - Reply

Hi. What is the best way to solve missing value problem in categorical variables using survey data analysis in the stata?
- Attaullah Shah November 4, 2022 at 11:15 am - Reply
  
  Tuba, I do not have expertise in survey data.