Computational Techniques in Theoretical Physics
Section 11
Importance Sampling

Why importance sampling?

In the Monte Carlo calculation of integration shown in previous lectures, it is assumed that a random number X with probability P(X) can be generated.

In most cases, this is not so.

Example:

Multi-dimensional distribution: P(x1,...,xk)

The inverse of distribution function:

F(x1,...,xk) = int P(x1,..,xk) dx1...dxk
is difficult to determine!

Solution:
• Derive P(x1,...,xk) from an easy to determine distribution function P0(x1,...xk)
• The transition from P0(x1,...xk) to P(x1,...,xk) can be regarded as a multistep evolution: P0=>P1=>...=>Pn=P, and such an evolution can be described by the theory of Markov chain.
• Equilbrium distribution can be achieved for systems satisfying ergodic condition.

Markov chain

Definition:

A Markov chain is a mathematical notion for a random walk in the space of X.

Remember that X=(x1, x2,...,xk) is a point in a k-dimensional space. We'll sometimes refer X as a state of the system.

Generation of a Markov chain:

We start from some initial point X0, the next point X1 is generated according to certain transition probability. In this way a sequence of random points {X0, X1, ... , XN} is generated.

These points are required to appear with the probability P(X). The rule to generate next point Xi+1 given the point Xi is described by the transition probability:

P(Xi+1 | Xi) = W(Xi to Xi+1)
where W(Xi to Xi+1) is called transition probability from state Xi to Xi+1

P(Xi+1 | Xi) is the probability that the system is in state Xi+1 given that the system was in state Xi, namely, it is the conditional probability.

W(Xi to Xi+1) satisfies:

W(Xi to Xi+1) > 0
sum Xi+1 W(Xi to Xi+1) = 1
This is because W is a probability, and thus satisfies the usual constraint of probability - probability must be positive and total probability adds to one.

Markov chain for generating probability P(X):

The method of importance sampling is to choose a proper transition probability W in such a way so that the distribution P(X) is the desired distribution we wanted to generate.

Let P0(X) be some probability distribution, which is the initial distribution of the states. The new distribution after one step of move is then:

P1(X) = sum X' P0(X') W(X' to X)
In general we have

Pk(X) = sumX' Pk-1(X') W(X' to X)
The theorems of stochastic processes tell us that the limit for large k exists and is unique (with certain conditions):

lim k to infinity Pk(X)  ->  P(X)
and P(X) satisfies the equation:

P(X) = sumX' P(X') W(X' to X)

We say that P(X) is invariant under the transition of W, or P(X) is the equilibrium distribution for the given transition probability.
Intuitive understanding:
To have a better, intuitive understanding of the above equations, you may think of the probability as population of certain particles.

Imagine that the particles can be in a number of (quantum-mechanical) states denoted by X. Initially at time 0, the particles are distributed according to P0(X). Each particle has a tendency to go to other states. For example, if the particle is at state X', the probability that it will go to state X is W(X' to X). Thus, we have, as number of particles times the probability, this number of particles:

P0(X') W(X' to X)
going to state X.

After a unit time, the population of particles at state X, P1(X), is the sum of the number of particles going from any state X' to the state X. This transitions of particles will eventually make the population steady in any states---we have reached the equilibrium.

Simulation framework
In a computer simulation, typically only one state X is stored in memory. The transition changes the state. The distribution of the state is also given by Pk(X).

By this we mean that if we do the same simulation many times starting with X0 distributed according to P0(X), the outcome after k steps will be distributed according to Pk(X).

Acuracy of the method

The sequence of points Xi generated by the random walk is correlated, since the next point is generated based on the knowledge of the previous point.

This is in contrast with the simple sampling method, e.g., the inverse distribution function method, where the sequence of points is uncorrelated. Each point is generated afresh.

The penalty due to the correlation is that the statistical error will be larger than the formula sigma/sqrt{N}, which is valid only if the points are uncorrelated.

Ergodicity:
Condition for the existence and uniqueness of the equilibrium distribution
We mentioned the existence and uniqueness of the equilibrium distribution. But we slipped over the condition for it. The condition is the following:

Starting from any state X, it is possible to get to any another state X' after a finite number of transitions. We call such Markov chain ergodic (or regular, irreducible).

We can view the transition probability W as a matrix. Let Wn(X,X') be the matrix element of the n-th power of W. The ergodic condition means that the matrix element at every entry (X,X') is nonzero for some value n.

Choice of transition probability

How to choose W to get prescribed P? It is sufficient that W satisfies the following detailed balance condition:

P(X') W(X' to X) = P(X) W(X to X')
where P is the known distribution. If W satisfies this equation and it is ergodic, then the Markov chain will generate X according to the probability distribution P(X).

It is possible that W does not satisfy the detailed balance condition but still generates the distribution P. That's why we say that the above condition is a sufficient condition.

Proof:
Let's prove that the distribution is invariant under the transition W satisfying the detailed balance condition. Summing over the variable X' and using the normalization condition for W, we have:

sumX' P(X') W(X' to X) =
sumX' P(X) W(X to X') =
P(X) sumX' W(X to X')= P(X)
Example: Consider two pots, A and B, with three red balls and two white balls distributed between them so that A always has two balls and B always has three balls. The balls of same color are not distinguishable. There are three different configurations for the pots as shown above, corresponding to three states X1, X2, X3.

We obtain transitions between these three configurations by picking a ball out of A at random and one out of B at random and interchanging them.

Inspection shows that we can make the following transitions from state i to j with probability W(i to j):

W(1 to 1) = W(1 to 3) = 0, W(1 to 2) = 1

W(2 to 1) = 1/6, W(2 to 2) = 1/2,
W(2 to 3) = 1/3

W(3 to 1)=0, W(3 to 2)=2/3, W(3 to 3) = 1/3

The transition probabilities in matrix form is then:

0       1       0

W = 1/6     1/2     1/3

0     2/3     1/3

Since

1/6       1/2       1/3

W = 1/12   23/36   10/36

1/9       5/9       1/3

all the elements are nonzero, the transition matrix is regular and ergodic, we expect to find a unique stationary probability distribution Pi, independent of our initial state.

The probabilities Pi satisfy the equation (an eigenvalue problem with eigenvalue 1)

Pi = sumj = 1 to 3 Pj W(j  -> i)
In matrix notation, we have

P = P W
or

WT PT = PT
where P=(P1, P2, P3), and T stands for transpose.

Together with the normalization condition:

P1 +P2 +P3 = 1
the solution to the set of linear equation is:

P1 = 1/10, P2 = 6/10, P3=3/10
This is the probability distribution after many transitions. And we have lost all knowledge of the initial state.

We can say this in another way. Let us assume that, initially, the system is in the first configuration. We are certain of this. The initial state (our knowledge) is

P01 = 1, P02 = P03 = 0

Now we perform k transitions (k large). What configuration is the system in? We do not know! The configurations are distributed according to Pi and will remain that way. There is a 10% chance the system is in configuration 1, a 60% chance to be in configuration 2, and a 30% chance to be in 3. If we could perform the experiment many times, that is, starting from configuration 1 and making k transitions and then looking at the configuration, we would find and 10% of the time it would end up in configuration 1, 60% of the time it would end up in configuration 2, and 30% of the time it would end up in configuration 3. But for each experiment, we could never be certain of the outcome.

We can see how fast the system converges to the equilibrium probability distribution. Noting that if P0 = (1,0,0),

P1 = P0 W = (0,1,0)

P2 = P1 W = P0 W2 = (1/6, 1/2, 1/3)

P3 = P2 W = P0 W3 = (1/12, 23/36, 5/8)
~ (0.083, 0.638, 0.277)

P4 = P3 W = P0 W4 = (23/216, 127/216, 11/36)
~ (0.106, 0.587, 0.3)

p6 = P0 W6 ~ (0.1007, 0.5986, 0.3007)

We see that after six transitions, the probabilities are already very close to the limits.

Homework