Structural equation modeling


Structural equation modeling is a diverse set of methods used by scientists for both observational and experimental research. SEM is used mostly in the social and behavioral science fields, but it is also used in epidemiology, business, and other fields. By a standard definition, SEM is "a class of methodologies that seeks to represent hypotheses about the means, variances, and covariances of observed data in terms of a smaller number of 'structural' parameters defined by a hypothesized underlying conceptual or theoretical model".
SEM involves a model representing how various aspects of some phenomenon are thought to causally connect to one another. Structural equation models often contain postulated causal connections among some latent variables. Additional causal connections link those latent variables to observed variables whose values appear in a data set. The causal connections are represented using equations, but the postulated structuring can also be presented using diagrams containing arrows as in Figures 1 and 2. The causal structures imply that specific patterns should appear among the values of the observed variables. This makes it possible to use the connections between the observed variables' values to estimate the magnitudes of the postulated effects, and to test whether or not the observed data are consistent with the requirements of the hypothesized causal structures.
The boundary between what is and is not a structural equation model is not always clear, but SE models often contain postulated causal connections among a set of latent variables and causal connections linking the postulated latent variables to variables that can be observed and whose values are available in some data set. Variations among the styles of latent causal connections, variations among the observed variables measuring the latent variables, and variations in the statistical estimation strategies result in the SEM toolkit including confirmatory factor analysis, confirmatory composite analysis, path analysis, multi-group modeling, longitudinal modeling, partial least squares path modeling, latent growth modeling and hierarchical or multilevel modeling.
SEM researchers use computer programs to estimate the strength and sign of the coefficients corresponding to the modeled structural connections, for example the numbers connected to the arrows in Figure 1. Because a postulated model such as Figure 1 may not correspond to the worldly forces controlling the observed data measurements, the programs also provide model tests and diagnostic clues suggesting which indicators, or which model components, might introduce inconsistency between the model and observed data. Criticisms of SEM methods include disregard of available model tests, problems in the model's specification, a tendency to accept models without considering external validity, and potential philosophical biases.
A great advantage of SEM is that all of these measurements and tests occur simultaneously in one statistical estimation procedure, where all the model coefficients are calculated using all information from the observed variables. This means the estimates are more accurate than if a researcher were to calculate each part of the model separately.

History

Structural equation modeling began differentiating itself from correlation and regression when Sewall Wright provided explicit causal interpretations for a set of regression-style equations based on a solid understanding of the physical and physiological mechanisms producing direct and indirect effects among his observed variables. The equations were estimated like ordinary regression equations but the substantive context for the measured variables permitted clear causal, not merely predictive, understandings. O. D. Duncan introduced SEM to the social sciences in his 1975 book, and SEM blossomed in the late 1970's and 1980's when increasing computing power permitted practical model estimation. In 1987 Hayduk provided the first book-length introduction to structural equation modeling with latent variables, and this was soon followed by Bollen's popular text.
Different yet mathematically related modeling approaches developed in psychology, sociology, and economics. Early Cowles Commission work on simultaneous equations estimation centered on Koopman and Hood's algorithms from transport economics and optimal routing, with maximum likelihood estimation, and closed form algebraic calculations, as iterative solution search techniques were limited in the days before computers. The convergence of two of these developmental streams produced the current core of SEM. One of several programs Karl Jöreskog developed at Educational Testing Services, LISREL embedded latent variables within path-analysis-style equations. The factor-structured portion of the model incorporated measurement errors which permitted measurement-error-adjustment, though not necessarily error-free estimation, of effects connecting different postulated latent variables.
Traces of the historical convergence of the factor analytic and path analytic traditions persist as the distinction between the measurement and structural portions of models; and as continuing disagreements over model testing, and whether measurement should precede or accompany structural estimates. Viewing factor analysis as a data-reduction technique deemphasizes testing, which contrasts with path analytic appreciation for testing postulated causal connections – where the test result might signal model misspecification. The friction between factor analytic and path analytic traditions continue to surface in the literature.
Wright's path analysis influenced Hermann Wold, Wold's student Karl Jöreskog, and Jöreskog's student Claes Fornell, but SEM never gained a large following among U.S. econometricians, possibly due to fundamental differences in modeling objectives and typical data structures. The prolonged separation of SEM's economic branch led to procedural and terminological differences, though deep mathematical and statistical connections remain. Disciplinary differences in approaches can be seen in SEMNET discussions of endogeneity, and in discussions on causality via directed acyclic graphs. Discussions comparing and contrasting various SEM approaches are available highlighting disciplinary differences in data structures and the concerns motivating economic models.
Judea Pearl extended SEM from linear to nonparametric models, and proposed causal and counterfactual interpretations of the equations. Nonparametric SEMs permit estimating total, direct and indirect effects without making any commitment to linearity of effects or assumptions about the distributions of the error terms.
SEM analyses are popular in the social sciences because these analytic techniques help us break down complex concepts and understand causal processes, but the complexity of the models can introduce substantial variability in the results depending on the presence or absence of conventional control variables, the sample size, and the variables of interest. The use of experimental designs may address some of these doubts.
Today, SEM forms part of a basis of machine learning and neural networks. Exploratory and confirmatory factor analyses in classical statistics mirror unsupervised and supervised machine learning.

General steps and considerations

The following considerations apply to the construction and assessment of many structural equation models.

Model specification

Building or specifying a model requires attending to:
  • the set of variables to be employed,
  • what is known about the variables,
  • what is theorized or hypothesized about the variables' causal connections and disconnections,
  • what the researcher seeks to learn from the modeling, and
  • the instances of missing values and/or the need for imputation.
Structural equation models attempt to mirror the worldly forces operative for causally homogeneous cases – namely cases enmeshed in the same worldly causal structures but whose values on the causes differ and who therefore possess different values on the outcome variables. Causal homogeneity can be facilitated by case selection, or by segregating cases in a multi-group model. A model's specification is not complete until the researcher specifies:
  • which effects and/or correlations/covariances are to be included and estimated,
  • which effects and other coefficients are forbidden or presumed unnecessary,
  • and which coefficients will be given fixed/unchanging values.
The latent level of a model is composed of endogenous and exogenous variables. The endogenous latent variables are the true-score variables postulated as receiving effects from at least one other modeled variable. Each endogenous variable is modeled as the dependent variable in a regression-style equation. The exogenous latent variables are background variables postulated as causing one or more of the endogenous variables and are modeled like the predictor variables in regression-style equations. Causal connections among the exogenous variables are not explicitly modeled but are usually acknowledged by modeling the exogenous variables as freely correlating with one another. The model may include intervening variables – variables receiving effects from some variables but also sending effects to other variables. As in regression, each endogenous variable is assigned a residual or error variable encapsulating the effects of unavailable and usually unknown causes. Each latent variable, whether exogenous or endogenous, is thought of as containing the cases' true-scores on that variable, and these true-scores causally contribute valid/genuine variations into one or more of the observed/reported indicator variables.
The LISREL program assigned Greek names to the elements in a set of matrices to keep track of the various model components. These names became relatively standard notation, though the notation has been extended and altered to accommodate a variety of statistical considerations. Texts and programs "simplifying" model specification via diagrams or by using equations permitting user-selected variable names, re-convert the user's model into some standard matrix-algebra form in the background. The "simplifications" are achieved by implicitly introducing default program "assumptions" about model features with which users supposedly need not concern themselves. Unfortunately, these default assumptions easily obscure model components that leave unrecognized issues lurking within the model's structure, and underlying matrices.
Two main components of models are distinguished in SEM: the structural model showing potential causal dependencies between endogenous and exogenous latent variables, and the measurement model showing the causal connections between the latent variables and the indicators. Exploratory and confirmatory factor analysis models, for example, focus on the causal measurement connections, while path models more closely correspond to SEMs latent structural connections.
Modelers specify each coefficient in a model as being free to be estimated, or fixed at some value. The free coefficients may be postulated effects the researcher wishes to test, background correlations among the exogenous variables, or the variances of the residual or error variables providing additional variations in the endogenous latent variables. The fixed coefficients may be values like the 1.0 values in Figure 2 that provide a scales for the latent variables, or values of 0.0 which assert causal disconnections such as the assertion of no-direct-effects pointing from Academic Achievement to any of the four scales in Figure 1. SEM programs provide estimates and tests of the free coefficients, while the fixed coefficients contribute importantly to testing the overall model structure. Various kinds of constraints between coefficients can also be used. The model specification depends on what is known from the literature, the researcher's experience with the modeled indicator variables, and the features being investigated by using the specific model structure.
There is a limit to how many coefficients can be estimated in a model. If there are fewer data points than the number of estimated coefficients, the resulting model is said to be "unidentified" and no coefficient estimates can be obtained. Reciprocal effect, and other causal loops, may also interfere with estimation.