In this post, I show how to convert string variables to numeric in Stata. String variables are shown in red . If a numeric variable is stored as a string variable in Stata, we have several ways to convert them to numeric variables. Let’s start with the destring command first.

1. The destring command

The destring command might be the first choice for converting string variables to numeric if we have a limited number of non-numeric characters. With this command, we can either generate a new variable or replace the existing one. Here is an example:

*First create a dummy data
input str15(price return)
"120.25" ".10"
"122.25" ".12"
end

*Now destring the two variables
destring price,  replace
destring return, replace

1. The ignore option of destring

If there are non-numeric characters in our dataset, destring command will show an error

non-numeric characters found

For example, our data might have comma separators, therefore, destring will generate the above error. In such cases, we can use the ignore(“,”) option, which tells Stata that the given character i.e. ” ,” should be ignored. See the following example:

*First create a dummy data
input str15(price return)
"12,000.25" ".10"
"12,200.25" ".12"
end
*Now destring the two variables
destring price, replace ignore(",")

1.2 List all non-numeric characters

We can list all non-numeric characters using the tabulate command and the real() function. Suppose that our variable strvar contains non-numeric values

tabulate strvar if missing(real(strvar ))

Suppose that the above code comes up with a list of the following non-numeric characters.

tabulate strvar if missing(real(strvar))

     strvar |      Freq.     Percent        Cum.
------------+-----------------------------------
     #error |          1        0.00        0.00
        #na |          1        0.00        0.00
          . |    171,106      100.00      100.00
        14* |          1        0.00      100.00
------------+-----------------------------------

The above tables shows that there are three non-numeric variables in our dataset. These are #error, #na, and *. We can specify these in the ignore option. The following code creates a new variable numvar

destring strvar, ignor("#error" "#na" "*") gen(numvar)

2. The real() function

The destring command is useful in a sense that it does not convert data to missing observations. Instead, it gives you an error message when there are non-numeric characters in the variable. If you are sure that observations with non-numeric characters are not needed , you can use the real() function with generate command.

generate newvar = real(strvar)

The above code uses a brute force. It converts string values to numeric values. And if there are non-numeric characters? Those observations are set to missing values.


Your support keeps these efforts alive