Inductive probability
Inductive probability attempts to give the probability of future events based on past events. It is the basis for inductive reasoning, and gives the mathematical basis for learning and the perception of patterns. It is a source of knowledge about the world.
There are three sources of knowledge: inference, communication, and deduction. Communication relays information found using other methods. Deduction establishes new facts based on existing facts. Inference establishes new facts from data. Its basis is Bayes' theorem.
Information describing the world is written in a language. For example, a simple mathematical language of propositions may be chosen. Sentences may be written down in this language as strings of characters. But in the computer it is possible to encode these sentences as strings of bits. Then the language may be encoded so that the most commonly used sentences are the shortest. This internal language implicitly represents probabilities of statements.
Occam's razor says the "simplest theory, consistent with the data is most likely to be correct". The "simplest theory" is interpreted as the representation of the theory written in this internal language. The theory with the shortest encoding in this internal language is most likely to be correct.
History
Probability and statistics was focused on probability distributions and tests of significance. Probability was formal, well defined, but limited in scope. In particular its application was limited to situations that could be defined as an experiment or trial, with a well defined population.Bayes's theorem is named after Rev. Thomas Bayes 1701–1761. Bayesian inference broadened the application of probability to many situations where a population was not well defined. But Bayes' theorem always depended on prior probabilities, to generate new probabilities. It was unclear where these prior probabilities should come from.
Ray Solomonoff developed algorithmic probability which gave an explanation for what randomness is and how patterns in the data may be represented by computer programs, that give shorter representations of the data circa 1964.
Chris Wallace and D. M. Boulton developed minimum message length circa 1968. Later Jorma Rissanen developed the minimum description length circa 1978. These methods allow information theory to be related to probability, in a way that can be compared to the application of Bayes' theorem, but which give a source and explanation for the role of prior probabilities.
Marcus Hutter combined decision theory with the work of Ray Solomonoff and Andrey Kolmogorov to give a theory for the Pareto optimal behavior for an Intelligent agent, circa 1998.
Minimum description/message length
The program with the shortest length that matches the data is the most likely to predict future data. This is the thesis behind the minimum message length and minimum description length methods.At first sight Bayes' theorem appears different from the minimimum message/description length principle. At closer inspection it turns out to be the same. Bayes' theorem is about conditional probabilities, and states the probability that event B happens if firstly event A happens:
becomes in terms of message length L,
This means that if all the information is given describing an event then the length of the information may be used to give the raw probability of the event. So if the information describing the occurrence of A is given, along with the information describing B given A, then all the information describing A and B has been given.
Overfitting
occurs when the model matches the random noise and not the pattern in the data. For example, take the situation where a curve is fitted to a set of points. If a polynomial with many terms is fitted then it can more closely represent the data. Then the fit will be better, and the information needed to describe the deviations from the fitted curve will be smaller. Smaller information length means higher probability.However, the information needed to describe the curve must also be considered. The total information for a curve with many terms may be greater than for a curve with fewer terms, that has not as good a fit, but needs less information to describe the polynomial.
Inference based on program complexity
is also inductive inference. A bit string x is observed. Then consider all programs that generate strings starting with x. Cast in the form of inductive inference, the programs are theories that imply the observation of the bit string x.The method used here to give probabilities for inductive inference is based on Solomonoff's theory of inductive inference.
Detecting patterns in the data
If all the bits are 1, then people infer that there is a bias in the coin and that it is more likely also that the next bit is 1 also. This is described as learning from, or detecting a pattern in the data.Such a pattern may be represented by a computer program. A short computer program may be written that produces a series of bits which are all 1. If the length of the program K is bits then its prior probability is,
The length of the shortest program that represents the string of bits is called the Kolmogorov complexity.
Kolmogorov complexity is not computable. This is related to the halting problem. When searching for the shortest program some programs may go into an infinite loop.
Considering all theories
The Greek philosopher Epicurus is quoted as saying "If more than one theory is consistent with the observations, keep all theories".As in a crime novel all theories must be considered in determining the likely murderer, so with inductive probability all programs must be considered in determining the likely future bits arising from the stream of bits.
Programs that are already longer than n have no predictive power. The raw probability that the pattern of bits is random is.
Each program that produces the sequence of bits, but is shorter than the n is a theory/pattern about the bits with a probability of where k is the length of the program.
The probability of receiving a sequence of bits y after receiving a series of bits x is then the conditional probability of receiving y given x, which is the probability of x with y appended, divided by the probability of x.
Universal priors
The programming language affects the predictions of the next bit in the string. The language acts as a prior probability. This is particularly a problem where the programming language codes for numbers and other data types. Intuitively we think that 0 and 1 are simple numbers, and that prime numbers are somehow more complex than numbers that may be composite.Using the Kolmogorov complexity gives an unbiased estimate of the prior probability of a number. As a thought experiment an intelligent agent may be fitted with a data input device giving a series of numbers, after applying some transformation function to the raw numbers. Another agent might have the same input device with a different transformation function. The agents do not see or know about these transformation functions. Then there appears no rational basis for preferring one function over another. A universal prior insures that although two agents may have different initial probability distributions for the data input, the difference will be bounded by a constant.
So universal priors do not eliminate an initial bias, but they reduce and limit it. Whenever we describe an event in a language, either using a natural language or other, the language has encoded in it our prior expectations. So some reliance on prior probabilities are inevitable.
A problem arises where an intelligent agent's prior expectations interact with the environment to form a self reinforcing feed back loop. This is the problem of bias or prejudice. Universal priors reduce but do not eliminate this problem.
Universal artificial intelligence
The theory of universal artificial intelligence applies decision theory to inductive probabilities. The theory shows how the best actions to optimize a reward function may be chosen. The result is a theoretical model of intelligence.It is a fundamental theory of intelligence, which optimizes the agents behavior in,
- Exploring the environment; performing actions to get responses that broaden the agents knowledge.
- Competing or co-operating with another agent; games.
- Balancing short and long term rewards.
At present the theory is limited by incomputability. Approximations may be used to avoid this. Processing speed and combinatorial explosion remain the primary limiting factors for artificial intelligence.
Probability
Probability is the representation of uncertain or partial knowledge about the truth of statements. Probabilities are subjective and personal estimates of likely outcomes based on past experience and inferences made from the data.This description of probability may seem strange at first. In natural language we refer to "the probability" that the sun will rise tomorrow. We do not refer to "your probability" that the sun will rise. But in order for inference to be correctly modeled probability must be personal, and the act of inference generates new posterior probabilities from prior probabilities.
Probabilities are personal because they are conditional on the knowledge of the individual. Probabilities are subjective because they always depend, to some extent, on prior probabilities assigned by the individual. Subjective should not be taken here to mean vague or undefined.
The term intelligent agent is used to refer to the holder of the probabilities. The intelligent agent may be a human or a machine. If the intelligent agent does not interact with the environment then the probability will converge over time to the frequency of the event.
If however the agent uses the probability to interact with the environment there may be a feedback, so that two agents in the identical environment starting with only slightly different priors, end up with completely different probabilities. In this case optimal decision theory as in Marcus Hutter's Universal Artificial Intelligence will give Pareto optimal performance for the agent. This means that no other intelligent agent could do better in one environment without doing worse in another environment.