This post offers a brief tutorial on filling missing values in Stata variables. The tutorial utilizes the ‘fillmissing’ program, which you can download by entering the following command in the Stata command window
ssc install fillmissing, replace
Important Note: This post does not imply that filling missing values is theoretically justified. Users should exercise their judgment and adhere to appropriate theoretical principles when addressing missing values.
Once you have installed the fillmissing
program, you can use it to fill missing values in both numeric and string variables. Additionally, fillmissing
program supports the use of the bysort
prefix for filling missing values within specific groups. In the following sections, we’ll explore several examples of using the bysort
prefix for group-based calculations. However, before diving into these examples, let’s take a quick look at the various options available within the program.
fillmissing: Program Options
The fillmissing
program provides a range of options for addressing missing values:
- with(any)
- with(previous)
- with(next)
- with(first)
- with(last)
- with(mean)
- with(max)
- with(min)
- with(median)
Let us quickly walk through these options, keeping in mind that options numbered from 6 onwards are designed for numerical variables.
1. with(any)
The with()
option serves to designate the source from which missing values will be populated. Specifically, with(any)
is an optional choice, and if not explicitly specified, the fillmissing
program will automatically default to it. This option is particularly useful when addressing missing values in a constant variable, where most values are identical, yet some are missing. with(any)
aims to replace these missing values by drawing upon any available non-missing values within the same variable.
Example 1: Fill missing values with(any)
First, let’s create a sample dataset consisting of a single variable with ten observations. You can easily generate this dataset by copying and pasting the following code into the Stata Do editor:
clear all
set obs 10
gen symbol = "AABS"
replace symbol = "" in 5
replace symbol = "" in 8
The above dataset has missing values on row 5 and 8. To fill the missing values from any other available non-missing values, let us use the with(any)
option.
fillmissing symbol, with(any)
Since with(any)
is the default option of the program, we could also write the above code as
fillmissing symbol
2. with(previous)
The with(previous)
option is designed to replace the current missing value with the preceding value of the same variable. It’s important to note that if the previous value is also missing, the current value will remain missing. Additionally, it’s worth mentioning that this option does not sort the data. Therefore, fillmissing
uses the existing data order to identify the current and previous observations. To know more about sort order of data in Stata, read this article Stata Tip 28: Precise control of Data Sort Order
Example 2: Fill missing values with(previous)
Let’s create a dummy dataset first.
clear all
set obs 10
gen symbol = "AABS"
replace symbol = "AKBL" in 1
replace symbol = "" in 2
The dataset looks like this
+------+
| symbol |
+------+
| AKBL |
| |
| AABS |
| AABS |
| AABS |
| AABS |
| AABS |
| AABS |
| AABS |
| AABS |
+--------+
To fill the missing value in observation number 2 with ‘AKBL‘ from the previous observation, simply type:
fillmissing symbol, with(previous)
Filling missing values with groups: by() or bysort options
To fill missing values within groups, let’s use the nlswork
dataset from the web. This dataset has missing values in various variables. To create a report of missing values by groups, lets use missings program by Nick Cox, a valuable tool for reporting missing data in variables. Users can employ ‘asdocx’ to export the report in Word, Excel, or LaTeX formats.
webuse nlswork, clear
* Create a report of missing values for all variables if c_city variable is 1
asdocx missings report if c_city == 1
variable | missings |
---|---|
age | 8 |
msp | 10 |
nev_mar | 10 |
grade | 1 |
ind_code | 121 |
occ_code | 58 |
union | 3634 |
wks_ue | 1931 |
tenure | 164 |
hours | 24 |
wks_work | 269 |
Notes: |
The above table shows that there are 8 missing values the age
variable where c_city
variable is 1. However, there are a total of 24 missing values in age
variable, to confirm:
count if missing(age)
24
To fill the missing values with previous non-missing values within each group of the c_city
variable, the code would be:
bys c_city : fillmissing age
* (24 real changes made)
dear dr please check your email I have asked about Corporate governance data.. detail is in my email
I did not receive your email. Please send it to attahshah15@hotmail.com
Dear sir, this code is not installing to stata, please help “net install fillmissing, from(http://fintechprofessor.com) replace”
Dear Dr. Hassan Raz
I have converted the site to https protocol, therefore, you may try this method.
Very useful command, thanks. Would be helpful to have a help file installed along with the package itself for future reference. I can also confirm this works with the “bysort” command (in my Stata 15), which is exactly what I needed it to be able to do.
Dear, I have a question when using this fillmissing code in stata.
Example:
This command uses the average of the group, but I would like to use the average of the previous variable and the posterior variable to replace the missing, keeping the limits within each group.
Example of the database with missing
Example that I would like to arrive using the fillmissing code
I hope you can help me.
My best regards.
I could not understand the requirements. The data you have posted and the fillmissing command that you have used do not match. Can you please clarify it a bit further on what to use for filling the missing values?
Dear,
I am sorry for the lack of clarity in the explanation.
The original database consists of a panel, with more than 100 importing and 100 exporting countries, organized in pairs. The dependent variable is import flow and the dependent variable is tariff.
The following database is similar to the original
Command:
Result with the above command
This command uses the average of the group, but I would like to use the average of the previous variable and the posterior variable to replace the missing, keeping the limits within each group (BRA USA; USA BRA; and so on).
My expected result would be is to arrive to a base of data similar to the base below:
The asdoc and fillmissing commands are very useful and help a lot in the job.
Excuse me for the inconvenience.
My best regards.
I think it is an interesting problem and will need recursive loops. I have added this option to fillmissing now.
Hello Dr Attaullah Shah;
I want the fillmissing program to solve missing value problems with the with(mean) with panel data.
Thank you for this – it was really helpful!
I am new to stata and want to run interindustry volatility spillover. can you please guide me in this regard?
Hi. What is the best way to solve missing value problem in categorical variables using survey data analysis in the stata?
Tuba, I do not have expertise in survey data.