Predictive analytics


Predictive analytics encompasses a variety of statistical techniques from data mining, predictive modeling, and machine learning that analyze current and historical facts to make predictions about future or otherwise unknown events.
In business, predictive models exploit patterns found in historical and transactional data to identify risks and opportunities. Models capture relationships among many factors to allow assessment of risk or potential associated with a particular set of conditions, guiding decision-making for candidate transactions.
The defining functional effect of these technical approaches is that predictive analytics provides a predictive score for each individual in order to determine, inform, or influence organizational processes that pertain across large numbers of individuals, such as in marketing, credit risk assessment, fraud detection, manufacturing, healthcare, and government operations including law enforcement.
Since 2022, the field has evolved significantly with the integration of generative AI and large language models, moving from purely numerical forecasting to "Predictive GenAI," which combines forecasting with automated content generation and agentic workflows.

Definition

Predictive analytics is a set of business intelligence technologies that uncovers relationships and patterns within large volumes of data that can be used to predict behavior and events. Unlike other BI technologies, predictive analytics is forward-looking, using past events to anticipate the future.
Predictive analytics statistical techniques include data modeling, machine learning, artificial intelligence, deep learning algorithms and data mining. Often the unknown event of interest is in the future, but predictive analytics can be applied to any type of unknown whether it be in the past, present or future. For example, identifying suspects after a crime has been committed, or credit card fraud as it occurs.
The core of predictive analytics relies on capturing relationships between explanatory variables and the predicted variables from past occurrences, and exploiting them to predict the unknown outcome. It is important to note, however, that the accuracy and usability of results will depend greatly on the level of data analysis and the quality of assumptions.

Evolution and Generative AI Integration (2022–Present)

Traditionally, predictive analytics focused on discriminative models—algorithms that classify data or predict a value. Since 2023, the emergence of Generative AI has expanded the field's capabilities.
  • Predictive GenAI: This hybrid approach uses predictive models to identify a future event and generative models to create an intervention. For instance, a predictive model may flag a high-risk customer, while a generative model drafts a personalized retention email.
  • Synthetic Data Generation: Generative adversarial networks and variational autoencoders are used to create synthetic datasets, allowing organizations to train predictive models on data that mimics real-world patterns without compromising user privacy.
  • Natural Language Querying: Business users can now utilize natural language processing to query data without needing knowledge of SQL or Python, lowering the barrier to entry for analytics.

    Technology Stack

The modern technology stack for predictive analytics, often referred to as the "Modern Data Stack," has shifted from on-premise servers to cloud-native, real-time architectures.

Infrastructure

  • Data Lakehouses: Platforms such as Databricks and Snowflake combine the structure of data warehouses with the flexibility of data lakes. This allows predictive models to run directly on high-volume raw data.
  • Vector Databases: To support AI-driven analytics, vector databases store data as high-dimensional vectors. This enables semantic search and allows predictive models to incorporate unstructured data such as text, audio, and video.

    Analytical techniques

The approaches and techniques used to conduct predictive analytics can broadly be grouped into regression techniques and machine learning techniques.

Machine learning

Machine learning can be defined as the ability of a machine to learn and then mimic human behavior that requires intelligence. This is accomplished through artificial intelligence, algorithms, and models.

Autoregressive Integrated Moving Average (ARIMA)

ARIMA models are a common example of time series models. These models use autoregression, which means the model can be fitted with a regression software that will use machine learning to do most of the regression analysis and smoothing. ARIMA models are known to have no overall trend, but instead have a variation around the average that has a constant amplitude, resulting in statistically similar time patterns. Through this, variables are analyzed and data is filtered in order to better understand and predict future values.
One example of an ARIMA method is exponential smoothing models. Exponential smoothing takes into account the difference in importance between older and newer data sets, as the more recent data is more accurate and valuable in predicting future values. In order to accomplish this, exponents are utilized to give newer data sets a larger weight in the calculations than the older sets.

Time series models

Time series models are a subset of machine learning that utilize time series in order to understand and forecast data using past values. A time series is the sequence of a variable's value over equally spaced periods, such as years or quarters in business applications. To accomplish this, the data must be smoothed, or the random variance of the data must be removed in order to reveal trends in the data. There are multiple ways to accomplish this.
Single moving average
Single moving average methods utilize smaller and smaller numbered sets of past data to decrease error that is associated with taking a single average, making it a more accurate average than it would be to take the average of the entire data set.
Centered moving average
Centered moving average methods utilize the data found in the single moving average methods by taking an average of the median-numbered data set. However, as the median-numbered data set is difficult to calculate with even-numbered data sets, this method works better with odd-numbered data sets than even.

Predictive modeling

Predictive modeling is a statistical technique used to predict future behavior. It utilizes predictive models to analyze a relationship between a specific unit in a given sample and one or more features of the unit. The objective of these models is to assess the possibility that a unit in another sample will display the same pattern. Predictive model solutions can be considered a type of data mining technology. The models can analyze both historical and current data and generate a model in order to predict potential future outcomes.
Regardless of the methodology used, in general, the process of creating predictive models involves the same steps. First, it is necessary to determine the project objectives and desired outcomes and translate these into predictive analytic objectives and tasks. Then, analyze the source data to determine the most appropriate data and model building approach. Select and transform the data in order to create models. Create and test models in order to evaluate if they are valid and will be able to meet project goals and metrics. Apply the model's results to appropriate business processes. Afterward, manage and maintain models in order to standardize and improve performance.

Regression analysis

Generally, regression analysis uses structural data along with the past values of independent variables and the relationship between them and the dependent variable to form predictions.

Linear regression

In linear regression, a plot is constructed with the previous values of the dependent variable plotted on the Y-axis and the independent variable that is being analyzed plotted on the X-axis. A regression line is then constructed by a statistical program representing the relationship between the independent and dependent variables which can be used to predict values of the dependent variable based only on the independent variable. With the regression line, the program also shows a slope intercept equation for the line which includes an addition for the error term of the regression, where the higher the value of the error term the less precise the regression model is. In order to decrease the value of the error term, other independent variables are introduced to the model, and similar analyses are performed on these independent variables.

Applications

Analytical Review and Conditional Expectations in Auditing

An important aspect of auditing includes analytical review. In analytical review, the reasonableness of reported account balances being investigated is determined. Auditors accomplish this process through predictive modeling to form predictions called conditional expectations of the balances being audited using autoregressive integrated moving average methods and general regression analysis methods, specifically through the Statistical Technique for Analytical Review methods.
The ARIMA method for analytical review uses time-series analysis on past audited balances in order to create the conditional expectations. These conditional expectations are then compared to the actual balances reported on the audited account in order to determine how close the reported balances are to the expectations. If the reported balances are close to the expectations, the accounts are not audited further. If the reported balances are very different from the expectations, there is a higher possibility of a material accounting error and a further audit is conducted.
Regression analysis methods are deployed in a similar way, except the regression model used assumes the availability of only one independent variable. The materiality of the independent variable contributing to the audited account balances are determined using past account balances along with present structural data. Materiality is the importance of an independent variable in its relationship to the dependent variable. In this case, the dependent variable is the account balance. Through this the most important independent variable is used in order to create the conditional expectation and, similar to the ARIMA method, the conditional expectation is then compared to the account balance reported and a decision is made based on the closeness of the two balances.
The STAR methods operate using regression analysis, and fall into two methods. The first is the STAR monthly balance approach, and the conditional expectations made and regression analysis used are both tied to one month being audited. The other method is the STAR annual balance approach, which happens on a larger scale by basing the conditional expectations and regression analysis on one year being audited. Besides the difference in the time being audited, both methods operate the same, by comparing expected and reported balances to determine which accounts to further investigate.