Merge datasets  Merge Compustat and CRSP Data in Stata


Merging Compustat and CRSP data is a common task in accounting and finance research. Compustat provides fundamental financial data on companies, while CRSP contains rich market and returns data. Combining these datasets enables powerful analyses of the link between financial performance and stock returns.

This comprehensive guide will walk you through the key steps involved in a merging Compustat and CRSP. We’ll cover the essentials of each database, merging mechanics and considerations, and pros and cons of using pre-merged data.

Introduction to Compustat and CRSP

Let’s start with an overview of Compustat and CRSP.

Compustat  Compustat


Compustat, maintained by S&P Global, contains fundamental data on public companies worldwide. It includes full income statements, balance sheets, statements of cash flows, ratios, pricing, earnings estimates and more. Here are some of the key variables found in the Compustat database:

  • GVKEY – Global company identifier unique to each company
  • DATADATE – Date for the fundamental data item, usually end of fiscal period
  • FYEAR – Fiscal year
  • FQTR – Fiscal quarter
  • SALE – Total revenue
  • COGS – Cost of goods sold
  • XRD – R&D expense
  • XINT – Interest expense
  • NI – Net income
  • EPSPX – Earnings per share
  • TXDB – Deferred taxes
  • PPENT – Property, plant and equipment (net)
  • AT – Assets total
  • LT – Liabilities total
  • CEQQ – Common/ordinary equity
  • CSHPRQ – Common shares outstanding
  • PRCC_F – Share price at fiscal year end
  • CUSIP – Committee on Uniform Security Identification Procedures number
  • NAICS – North American Industry Classification System code
  • SIC – Standard Industrial Classification code

Compustat data is organized by company with one row per company-date observation. The GVKEY code uniquely identifies each company across time.

Compustat variables needed for merge Compustat variables needed for merging


For merging Compustat data with CRSP, we need:

  • GVKEY: Unique company identifier assigned by Compustat
  • CUSIP: 9-digit numeric CUSIP code with check digit
  • DATADATE: Date of fundamentals – end of fiscal period

Compustat, maintained by S&P Global, contains fundamental data on public companies worldwide. It includes full income statements, balance sheets, statements of cash flows, ratios, pricing, earnings estimates and more. Key variables include:

The CRSP US Stock Database, from the University of Chicago’s Center for Research in Security Prices, provides historical market data on common stocks. CRSP contains security-level data, with multiple rows per company for each trading day, one for each security. The PERMNO field uniquely identifies securities, while PERMCO groups securities into companies. Here are some of the key variables in the CRSP US stock database:

  • PERMNO – Unique permanent security identifier
  • PERMCO – Unique permanent company identifier
  • DATE – Trade date
  • RET – Holding period return
  • VOL – Share volume traded
  • SHROUT – Shares outstanding
  • PRCC_C – Closing price
  • CFACPR – Cumulative factor to adjust price
  • CFACSHR – Cumulative factor to adjust shares
  • BIDLO – Bid price low
  • ASKHI – Ask price high
  • OPENPRC – Opening price
  • NUMTRD – Number of trades
  • SHRCD – Share code
  • EXCHCD – Exchange code
  • SICCD – Standard industrial classification code
  • NCUSIP – Historical CUSIP
  • CUSIP – Current CUSIP
  • TICKER – Ticker symbol
  • COMNAM – Company name

Some key uses of these variables include calculating returns, trading volume, liquidity measures, identifying missing returns, merging on historical CUSIPs, analyzing bid-ask spreads, and linking CRSP to Compustat data.

.

Compustat variables needed for merge  CRSP variables needed for merging

  • PERMNO: Unique permanent security identifier
  • PERMCO: Unique permanent company identifier
  • CUSIP: 8-character CUSIP code without check digit
  • DATE: Trading day of stock returns.

Steps in merging compustat with crsp Steps for Merging Compustat and CRSP


Merging CRSP with Compustat can be done in two ways: using the CCM database or using the CUSIP and DATADATE variables. The CCM-based merging requires access to the CCM database, which may not be available to your university or college. In the following section, we will discuss merging the two databases using the second approach.

1. Extract relevant variables

Pull the variables you will need for analysis from both datasets.

For Compustat, this typically includes GVKEY, CUSIP, DATADATE and fields for your research like assets, liabilities, earnings, cash flows, etc.

For CRSP, extract CUSIP, PERMNO, PERMCO, DATE, RET and any other return variables.

2. Standardize date formats

Ensure that the date fields from both sources use the same format. In Stata or SAS, it is recommended to convert the data to the software’s format and then extract date frequency at which both the dataset must be merged. For example, if you are merging annual data from Compustat with monthly data from CRSP, first extract the year variable from the date variables in the CRSP dataset. Match the variable names in both datasets. So, if the year variable is named fyear in the Compustat dataset, rename the year variable as fyear in the CRSP dataset as well. See data excerpts from the two datasets and note how we extract the year from the CRSP dataset.

 * Extract fyear from date variable in CRSP dataset

gen fyear = year(date)

Data excerpt from Compustat database
gvkey datadate fyear cusip at
001000 31dec1963 1963 000032102 .
001000 31dec1964 1964 000032102 1.416
001000 31dec1965 1965 000032102 2.310
001000 31dec1966 1966 000032102 2.430
001000 31dec1967 1967 000032102 2.456
001000 31dec1968 1968 000032102 5.922
001000 31dec1969 1969 000032102 28.712
001000 31dec1970 1970 000032102 33.450
001000 31dec1971 1971 000032102 29.330
001000 31dec1972 1972 000032102 19.907
001000 31dec1973 1973 000032102 21.771

3. Prepare CUSIP

Compustat’s CUSIP has 9 digits, having three parts.

1. Cusip issuer number – 6 digits (i.e. the company)
2. Cusip issue number – 2 digits (the particular stock or bond offering)
3. Cusip check digit – 1 digit

Therefore, for merging purpose, we just need 6 digits. The Stata function to extract six digits from the CUSIP is substr().

 * Extract 6 digits from CUSIP

gen cusip_6 = substr(cusip, 1, 6)

4. Merge datasets using CUSIP and date

With date and CUSIP formatted consistently, merge CRSP and Compustat using an inner join on both CUSIP and date. This attaches market data from CRSP to each company-date row from Compustat. The inner join combines records from two tables whenever there are matching values in a field common to both tables. In Stata’s parlance, inner join refers to :

keep if _merge==3

Since the CRSP data will have multiple records for the same CUSIP and date variable, more often than not, we shall use the m:1 merge.

* Open the Compustat data

use compustat, clear

.

* Now merge on CUSIP and TIME

merge 1:m CUSIP timevar using "CRSP"

.

*Keep only matching observations

keep if _merge == 3

5. Check for non-merges

Some Compustat observations may not find a match in CRSP on the same date due to non-trading days. Re-merge using date +/- 1 day to maximize merge rate.

6. Finalize dataset

Keep only the variables you need for analysis. With a single company per row but multiple securities, you may aggregate CRSP data like taking the mean return across all matched securities.

Proper data preparation is crucial prior to merging. Bad merges lead to inaccurate or misleading results. But carefully following these steps will yield a powerful merged dataset.

Merge Considerations and Choices

Merging Compustat and CRSP yourself offers flexibility compared to pre-merged data products. Here are some key considerations:

  • 1:M merges: A Compustat company may have multiple securities in CRSP. Analyze multi-security firms carefully.
  • Delisting returns: Merge on PERMCO not PERMNO to capture delisting returns spread across securities.
  • Time shifts: Use date +/- 1 merges to maximize observations. Choose approach based on research goals.
  • Pre- vs. post-merging aggregations: Decide whether to aggregate before or after merging based on needs.
  • Alternative identifiers: CUSIP standardization is common, but can also use PERMCO-GVKEY or WRDSEID maps.
  • CCM vs. self-merging: Pre-merged CCM data handles historical CUSIPs better, but self-merging allows more control.

Think through these dynamics and options to determine the optimal merging approach for your research.

Can I use Pre-Merged CCM Data

Merging Compustat and CRSP yourself provides the most flexibility. But researchers at subscribing institutions can also access pre-merged data from WRDS in the form of the CRSP/Compustat Merged (CCM) Database.

CCM has merged CRSP and Compustat historically, simplifying many tricky aspects. Benefits include:

  • Seamless handling of historical CUSIP changes
  • Pre-built PERMCO-GVKEY links
  • Treatment of non-trading days
  • Delisting return calculations

This can lower the barrier to entry for new merger analyses. However, limitations of CCM include:

  • Less control over how merges handled
  • Fixed variable list – no customizations
  • More difficult to merge additional data sources

If embarking on novel merger research, self-merging allows more creativity. But CCM offers convenience for common merger applications. Understand tradeoffs when choosing approach.