D-SOCIAL-KLD: Methodology

Dynamic IRT measurement of corporate social responsibility

Author

Robert J. Carroll

1 The measurement problem

Corporate social responsibility (CSR) is a latent construct. We do not observe a firm’s CSR directly; instead, we observe dozens of binary indicators—strengths and concerns—coded by ratings agencies like KLD (now MSCI ESG). The question is how to aggregate these indicators into a single score.

The standard approach in the management literature is the KLD Index: sum the strengths, subtract the concerns. This treats every indicator as equally informative, which is a strong and often implausible assumption. A strength that virtually every firm earns (e.g., a basic governance policy) tells us much less than one that only the most responsible firms achieve. An additive index cannot distinguish between these cases.

Item response theory (IRT) offers a principled alternative. Originally developed in psychometrics to score examinees on standardized tests, IRT models the probability of a binary response as a function of the respondent’s latent trait and the item’s statistical properties. The analogy is direct: firms are examinees, KLD indicators are test items, and the latent trait is CSR.

This document describes the dynamic IRT model used to produce D-SOCIAL-KLD scores, covering the model specification, identification strategy, estimation, and key implementation decisions. The original measure was introduced in Carroll, Primo, and Richter (2016); this implementation extends coverage through 2018 and re-estimates the model using Stan’s Hamiltonian Monte Carlo sampler.

2 Model specification

2.1 The two-parameter probit IRT model

For each observed triple of firm \(i\), KLD indicator \(j\), and year \(t\), the model specifies:

\[ y_{ijt} \sim \text{Bernoulli}\bigl(\Phi(\beta_{jt} \cdot \theta_{it} - \alpha_{jt})\bigr) \tag{1}\]

where:

  • \(\theta_{it}\) is the latent CSR score for firm \(i\) in year \(t\)—the quantity of interest.
  • \(\alpha_{jt}\) is the difficulty of indicator \(j\) in year \(t\): how much underlying CSR a firm needs to have a 50% probability of being coded 1 on this indicator, all else equal. Higher \(\alpha\) means fewer firms are coded 1 (the indicator is “harder” to earn or harder to avoid).
  • \(\beta_{jt}\) is the discrimination of indicator \(j\) in year \(t\): how steeply the probability of a 1 changes as \(\theta\) increases. Strength indicators tend to have \(\beta > 0\) (more-responsible firms are more likely to earn the strength). Concern indicators tend to have \(\beta < 0\) (more-responsible firms are less likely to have the concern).
  • \(\Phi(\cdot)\) is the standard normal CDF (probit link).

The discrimination parameter \(\beta\) is what makes IRT superior to an additive index. An indicator with \(|\beta| \approx 0\) contributes almost nothing to the score; one with large \(|\beta|\) is highly informative. The model estimates these weights from the data rather than assuming they are equal.

2.2 Dynamic prior on \(\theta\)

A static IRT model would treat each year as an independent cross-section, discarding the information that a firm rated highly in 2005 is likely to be rated highly in 2006. The dynamic extension imposes a random-walk prior on \(\theta\):

\[ \theta_{i, t_0} \sim \text{Normal}(0, \sigma_{\text{init}}) \tag{2}\]

\[ \theta_{i,t} \mid \theta_{i,t-1} \sim \text{Normal}(\theta_{i,t-1}, \sigma_{\text{firm},i}) \tag{3}\]

The firm-specific innovation standard deviation \(\sigma_{\text{firm},i}\) governs how quickly each firm’s CSR position can change year to year. A firm with small \(\sigma_{\text{firm}}\) has a CSR trajectory that is relatively stable; a firm with large \(\sigma_{\text{firm}}\) can shift rapidly.

Rather than estimating each \(\sigma_{\text{firm},i}\) in isolation, we use partial pooling via a hierarchical prior:

\[ \sigma_{\text{firm},i} \sim \text{Half-Normal}(0, \sigma_{\text{global}}) \]

\[ \sigma_{\text{global}} \sim \text{Half-Normal}(0, 1) \]

This structure lets the model learn from the full cross-section of firms what a “typical” rate of change looks like, while still allowing individual firms to deviate from that baseline. The result is a Bayesian shrinkage estimator: firms with limited data are pulled toward the group mean rate of change, while firms with abundant data are estimated more freely.

2.3 Item priors

The difficulty and discrimination parameters receive mildly regularizing priors:

\[ \alpha_{jt} \sim \text{Normal}(0, 5) \qquad \beta_{jt} \sim \text{Normal}(0, 2.5) \]

These are wide enough to let the data drive the estimates while preventing extreme values that could destabilize the sampler.

3 Identification

IRT models are identified only up to a reflection (multiplying all \(\theta\) by \(-1\) and all \(\beta\) by \(-1\) gives the same likelihood) and a scale (the units of \(\theta\) are arbitrary without a reference point).

We fix both by pinning two anchor firms’ first-year scores to fixed values:

Anchor Constraint Rationale
Halliburton (HAL) \(\theta_{\text{HAL},t_0} = -1\) Consistently associated with environmental and governance controversies; used as the negative anchor in Carroll, Primo, and Richter (2016)
Abbott Laboratories (ABT) \(\theta_{\text{ABT},t_0} = +1\) Consistently ranked among top corporate citizens in major CSR indices (DJSI, MSCI, Fortune)

where \(t_0\) denotes each anchor firm’s first appearance in the data. The fixed values apply only to the first-year score; subsequent years evolve freely under the dynamic prior. Fixing at \(\pm 1\) rather than merely constraining the sign eliminates a scale degeneracy: sign constraints allow the anchor scores to hover near zero, leaving the overall scale of \(\theta\) and \(\beta\) unidentified across chains. The fixed values are hard-coded in the transformed parameters block; the anchor scores are not sampled parameters.

Note

The original paper used Microsoft (MSFT) as the positive anchor. MSFT lacks KLD indicator data before 2010, making it unsuitable for the full panel. Johnson & Johnson (JNJ) was considered as a replacement but also lacks data before 2001. ABT has complete coverage and a strong, well-documented CSR record throughout the period.

4 Estimation

4.1 From Gibbs to HMC

The original D-SOCIAL-KLD scores were estimated using MCMCpack::MCMCdynamicIRT1d in R, which implements a Gibbs sampler with data augmentation following Martin and Quinn (2002). Gibbs sampling has several limitations for a model of this size:

  1. High autocorrelation: Gibbs draws are serially correlated because each parameter is updated conditional on all others. Effective sample sizes are often a small fraction of the nominal chain length.
  2. No built-in diagnostics: convergence must be assessed post hoc, typically by informal visual inspection of trace plots.
  3. Difficult to parallelize: Gibbs updates are inherently sequential; running multiple chains requires separate processes with no communication.

This implementation uses Stan (Carpenter et al. 2017), which employs the No-U-Turn Sampler (NUTS), an adaptive variant of Hamiltonian Monte Carlo (HMC) (Hoffman and Gelman 2014). HMC uses gradient information to make large, low-autocorrelation moves through the posterior, resulting in dramatically higher effective sample sizes per draw. Stan also provides:

  • Automatic diagnostics: divergent transitions, R-hat, bulk and tail effective sample size
  • Parallel chains: multiple chains run simultaneously, enabling both faster estimation and formal convergence assessment
  • Pointwise log-likelihood: can be computed in a generated quantities block for leave-one-out cross-validation via the loo package (Vehtari, Gelman, and Gabry 2017). This block is omitted from the primary model file because storing an \(N\)-length vector per posterior draw exhausts memory for the full KLD dataset (~3.3M observations). LOO diagnostics should be computed on a subsample or via moment matching.

4.2 Implementation details

The Stan model uses several techniques to handle the scale of the problem:

Sparse data representation. The KLD panel is highly sparse: firms enter and exit the database, and the set of indicators changes over time. Rather than passing the full firm \(\times\) indicator \(\times\) year array to Stan (which would be mostly NA), the R preprocessing scripts convert the data to a flat list of non-missing \((i, j, t, y)\) tuples. This reduces the data footprint from millions of potential cells to the 3,273,013 actually observed.

Non-centered parameterization (NCP). The centered form of the dynamic prior creates a funnel geometry in the joint posterior: when \(\sigma_{\text{global}}\) is small, \(\sigma_{\text{firm}}\) must be small, and when \(\sigma_{\text{firm}}\) is small, consecutive \(\theta\) values must be nearly identical. HMC must take extremely small leapfrog steps to navigate this narrow funnel, resulting in nearly all transitions hitting the maximum treedepth and near-zero E-BFMI values.

The NCP decouples these hierarchical dependencies by introducing raw standard-normal innovation parameters and treating \(\theta\) as a deterministic function in transformed parameters:

\[ \sigma_{\text{firm},i} = \sigma_{\text{global}} \cdot \tilde{\sigma}_i, \qquad \tilde{\sigma}_i \sim \text{Half-Normal}(0,1) \]

\[ \theta_{i,t_0} = \sigma_{\text{init}} \cdot z^{\text{init}}_i, \qquad z^{\text{init}}_i \sim \text{Normal}(0,1) \]

\[ \theta_{i,t} = \theta_{i,t-1} + \sigma_{\text{firm},i} \cdot z^{\text{trans}}_{i,t}, \qquad z^{\text{trans}}_{i,t} \sim \text{Normal}(0,1) \]

(with the exception of the anchor firms’ first-year scores, which are fixed constants—not sampled—as described in Section 3). The \(\theta\) vector is assembled sequentially in the transformed parameters block by processing transitions in chronological order within each firm. Firm-years that follow gaps in the panel (i.e., years that are not consecutive transitions) receive the same \(\sigma_{\text{init}} \cdot z\) initialization as first appearances.

Initialization. All chains are initialized at zero on the unconstrained scale (init = 0). Under NCP, this places all \(z\) innovations at zero, so every non-anchor \(\theta\) starts at zero; the anchor scores are fixed at \(\pm 1\) and are not sampled. This avoids the initialization failures that arise with random starts when the dynamic prior drives \(\log p\) to \(-\infty\).

Post-estimation memory management. The full model has approximately 120,000 parameters, producing chain CSVs of ~3.4 GB each (13.6 GB total for 4 chains). Neither cmdstanr::as_cmdstan_fit() nor data.table::fread() can load these files on a 16 GB machine—the former exhausts memory building its internal data structure, and the latter overflows its header-line parser on the ~1.3 MB, 120,000-field column header. The pipeline converts the raw chain CSVs to Apache Parquet via Arrow, dropping the NCP internal parameters (z_init, z_trans, sigma_firm_raw) that are never needed downstream. This cuts the column count from ~120,000 to ~62,000 and file size to ~1.5 GB per chain. Diagnostics and score extraction then use Arrow’s column-selective Parquet reads, loading only the parameters needed for each operation—typically a few hundred to a few thousand columns at a time, keeping peak memory well under 1 GB.

4.3 Estimation results

The model was estimated with 4 chains, each running 1,000 warmup iterations and 2,500 sampling iterations (10,000 posterior draws total), with adapt_delta = 0.9. Wall-clock time was approximately 3.2 days on an Apple M-series machine (single-threaded per chain, 4 chains in parallel).

4.3.1 Sampler diagnostics

Diagnostic Result
Divergences 0 across all 4 chains
Max treedepth hits 0 (max depth = 10)
E-BFMI 0.80, 0.78, 0.76, 0.81

All E-BFMI values are well above the 0.3 concern threshold, confirming that the non-centered parameterization successfully resolves the funnel geometry that plagues the centered form of this model.

4.3.2 Convergence

Convergence was assessed via R-hat and effective sample size (ESS) on a stratified sample of 352 parameters spanning all parameter types—hyperparameters, item difficulties, item discriminations, firm-year scores, and firm-level innovation standard deviations:

Metric Result
R-hat Max 1.004 across all 352 parameters (all \(\leq\) 1.01)
Bulk ESS Min 741 (sigma_init), generally 2,000–26,000+
Tail ESS Min 1,263, generally 2,800–10,000+

The lowest bulk ESS belongs to \(\sigma_{\text{init}}\), the global initialization scale—a single scalar governing all firms’ first-year draws. Even at 741, this represents adequate posterior exploration for a parameter with a simple unimodal marginal. All item parameters and firm-year scores have bulk ESS comfortably in the thousands.

4.3.3 Anchor trajectories

The identification strategy pins each anchor firm’s first-year score to a fixed value (HAL at \(-1\), ABT at \(+1\) in 1991); subsequent years evolve freely under the dynamic prior. The resulting trajectories are substantively revealing.

ABT (the positive anchor) remains consistently positive throughout the panel, with posterior means ranging from 0.65 to 0.87—stable high-CSR performance over nearly three decades.

HAL (the negative anchor) tells a more surprising story. Pinned at \(-1\) in 1991, Halliburton’s estimated CSR score turns positive by 1993 and remains in the 0.5–0.8 range for the remainder of the period. This is not a failure of identification—the first-year pin successfully sets the scale, and the model is free to let the trajectory evolve as the data dictate. What the data dictate is that even a firm selected as a low-CSR exemplar in the early 1990s improved substantially over the following decades.

This pattern is not unique to Halliburton. Among all firms in the panel with at least 10 years of coverage, not a single one maintains a consistently negative CSR trajectory across the full 1991–2018 period. The most persistently negative firms (e.g., Saul Centers, DXP Enterprises, Cass Financial) enter the panel in 2003 or later, after KLD expanded its coverage universe to include smaller firms. The implication is clear: corporate social responsibility, as measured by KLD indicators, improved broadly and substantially over this period. A dynamic model captures this secular trend through the \(\theta\) trajectories; a static cross-sectional analysis would miss it entirely.

5 Key findings

Three results from the estimation merit emphasis.

5.1 The model works at scale

Zero divergences, zero max-treedepth hits, healthy E-BFMI, and R-hat \(\leq\) 1.004 across all spot-checked parameters. The non-centered parameterization eliminates the pathological funnel geometry that would otherwise cripple HMC at this scale. This is a model with over 120,000 parameters estimated from 3.3 million observations; clean diagnostics validate both the parameterization strategy and the choice of Stan over Gibbs sampling.

5.2 CSR improved broadly over 1991–2018

The anchor trajectory analysis (see Section 4.3.3) reveals that no firm with full-panel coverage maintains a persistently negative CSR score. Halliburton—chosen as a low-CSR exemplar based on its environmental and governance record—turns positive within two years of the 1991 starting point. This reflects the observable pattern that KLD strength indicators became more widely adopted and concern indicators became less prevalent over time. The dynamic model captures this secular improvement through the firm-year \(\theta\) trajectories rather than absorbing it into the item parameters, making the trend visible and measurable.

5.3 Indicator informativeness varies substantially

The estimated discrimination parameters (\(\beta\)) range widely across indicators. Some KLD items sharply separate high- from low-CSR firms; others contribute almost no discriminating power. This variation is precisely what the additive KLD Index cannot capture. By weighting all indicators equally, the index treats a near-universal governance checkbox the same as a rare environmental commitment that only the most responsible firms achieve. The IRT scores correct for this by data-adaptively downweighting uninformative items and amplifying informative ones.

6 Data

6.1 Source

The raw data come from KLD STATS (now MSCI ESG), a panel of binary CSR indicators for publicly traded U.S. firms. The dataset used here covers 1991–2018 and is available through Wharton Research Data Services (WRDS). KLD indicators are organized into categories (community, diversity, employee relations, environment, human rights, product) and coded as either strengths or concerns.

6.2 Preprocessing

The cleaning pipeline handles several data-quality issues:

  1. Non-binary values: a small number of indicator values exceed 1 (e.g., HUM_str_num = 2). These are clamped to 1.
  2. Ticker disambiguation: KLD’s coverage expanded over time, and different firms occasionally share the same ticker symbol across coverage universes. The disambiguation uses issuerid where available (2017–2018) and deterministic numeric suffixes (.2, .3, …) for earlier years.
  3. Time-varying indicator sets: not all indicators appear in all years. Columns that are entirely NA in a given year are excluded from that year’s item set.
  4. Universe restriction: the estimation sample is restricted to firms present in the KLD database by 2012, ensuring comparability with the original published scores.

6.3 Scale and sparsity

The full panel contains 6,324 unique firms, 2,068 indicator-year items, and 3,273,013 non-missing observations across 51,586 firm-years. Coverage is uneven: the early KLD years (1991–2000) include ~650 firms per year; later years include 3,000+. Many prominent firms (e.g., IBM, Coca-Cola) have surprisingly sparse indicator data in certain periods, reflecting KLD’s evolving coverage universe rather than missing observations in the usual sense.

7 Interpreting the output

7.1 Scores (\(\theta\))

The primary output is a posterior distribution over \(\theta_{it}\) for each firm-year. The scores file reports the posterior mean, standard deviation, and percentiles.

Key properties:

  • Centered near zero: by construction, the distribution of scores in any given year has mean near zero. Positive scores indicate above-median CSR; negative scores indicate below-median.
  • Interval-scaled: differences between scores are meaningful (a firm at +1.5 has measurably higher CSR than a firm at +0.5), but ratios are not (a score of +2 does not mean “twice as responsible” as +1).
  • Uncertainty quantified: the posterior SD for each firm-year is an explicit measure of estimation uncertainty. Firms with more indicators observed in a given year will have smaller SDs (more precise scores). Researchers should propagate this uncertainty into downstream analyses.

7.2 Item parameters (\(\alpha\), \(\beta\))

Posterior summaries for each indicator’s difficulty and discrimination are also available. These are useful for understanding which KLD indicators are most informative about the latent CSR dimension and how the indicator set has evolved over time.

8 References

Carpenter, Bob, Andrew Gelman, Matthew D. Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. 2017. “Stan: A Probabilistic Programming Language.” Journal of Statistical Software 76 (1): 1–32. https://doi.org/10.18637/jss.v076.i01.
Carroll, Robert J., David M. Primo, and Brian K. Richter. 2016. “Using Item Response Theory to Improve Measurement in Strategic Management Research: An Application to Corporate Social Responsibility.” Strategic Management Journal 37 (1): 66–85. https://doi.org/10.1002/smj.2463.
Hoffman, Matthew D., and Andrew Gelman. 2014. “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo.” Journal of Machine Learning Research 15 (47): 1593–623.
Martin, Andrew D., and Kevin M. Quinn. 2002. “Dynamic Ideal Point Estimation via Markov Chain Monte Carlo for the U.S. Supreme Court, 1953–1999.” Political Analysis 10 (2): 134–53. https://doi.org/10.1093/pan/10.2.134.
Vehtari, Aki, Andrew Gelman, and Jonah Gabry. 2017. “Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and WAIC.” Statistics and Computing 27 (5): 1413–32. https://doi.org/10.1007/s11222-016-9696-4.