Probabalistic Models for Dummies Like Me: Part 2

1 Hidden Markov Models

Alright, you are still with me!

1.0.1 The Weather Cave Example

Looking back at our weather example, lets pretend you live in a cave, and cannot see the weather, but desperately want to know the weather. However, the intern who delivers you coffee every morning does know the weather. You know that if the weather is nice (sunny or cloudy). he wears sunglasses 75% of the time. If the weather is rainy or cloudy, he wears a hat 95% of the time. You could try to guess the probability of the weather outside the cave based on your intern's choice of headgear.

This is an example of a hidden Markov model: we try to to observe something hidden through something visible. In this example, the hidden layer is the weather, and the observable (emitted) layer is the choice of sunglasses and/or hat.

Rabinar, 1989 summarizes the problems we want to answer for HMMs as "[1] evaluation of the probability (or likelihood) of a sequence of observations given a specific HMM; [2] the determination of a best sequence of model states; and [3] the adjustment of model parameters so as to best account for the observed signal."

We will attempt to tackle these questions, albeit in a slightly different order. Starting with the seconded listed problem (determining the best sequence that explains a series of observations) is, in my mind, the most practical.

Lets move on to a better example

1.0.2 Occasionally Dishonest Casino

You are gambling at a casino, and you know the the croupier (dice guy) is a crooked, no good scum who uses a pair of dice: one loaded, and one fair. The fair dice has a 1/6 change for each side, but the loaded dice has a 1/2 chance of landing with a 6, and only a 1/10 chance otherwise.

In this game, the croupier throws one die at a time. You would like to know when he is using the fair dice, and when he is using the loaded die.

Note: it is important to note that only the hidden layer has to be a Markov Process. Often, you will see that the observable layer is definitely not Markovian.

1.0.3 Setup

1.0.3.1 emission probabilities

Lets code the following data in R

fair dice: 1/6 for each
loaded: 1/2 for 6, 1/10 for else

dice_probs <- matrix(c(1/6, 1/6, 1/6, 1/6, 1/6, 1/6,
                       1/10,1/10,1/10,1/10,1/10,1/2), 2, byrow=T)
rownames(dice_probs) <- c("F", "L")
dice_probs

       [,1]      [,2]      [,3]      [,4]      [,5]      [,6]
F 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667
L 0.1000000 0.1000000 0.1000000 0.1000000 0.1000000 0.5000000

1.0.3.2 Transition Probabilities

Lets say that we have it from an insider at the casino that this croupier follows the following rules/probabilities:

.95 staying with the Fair die
.9 staying with the loaded die
.05 switch to loaded
.1 switch to fair

Apparently, it s pretty risky to change dice, so most of the time, he stays with the dice he is using. From this, we can make a transition matrix of the probabilities of going back and forth between the dice:

diceTPM <- matrix(c(.95,.05,.1,.9),2, byrow=T)
states <- c("F", "L")
rownames(diceTPM) <- states
colnames(diceTPM) <- states
diceTPM

     F    L
F 0.95 0.05
L 0.10 0.90

from this, we can also calculate π:

dicePi <- eigen(t(diceTPM))$vectors[,1] / sum(eigen(t(diceTPM))$vectors[,1])
names(dicePi) <- states
dicePi

        F         L
0.6666667 0.3333333

π = c(2/3,1/3)

1.0.3.3 Problems to solve

So now we have an idea of how much time the croupier should be using each of the dies. What can we learn? The next bit will be divided up into 4 sections:

Finding the Probability of a Sequence given a Path and a Model: product of probabilities
Finding the Total Probaility of a Sequence given a Model: Forward Algorithm
Finding the Most Lilely Path to Explain the Sequnce
- Maximum Lifelihood Estimates
- Viterbi Algorithm
Finding Probaility of Being in a Certain State at a Certain Observation: The Forward/Backward Algorithm
Finding Model Parameters: The Baum-Welsh Training Algorithm

1.1 Finding the Probability of a Sequence given a Path and a Model: product of probabilities

Going back to the original 3 questions posed [link], we will look at the first. Say we are interested in the probability that particular sequence came from a certain path through the hidden layer. This is nothing more than simply the product of the likelihood of a path and the likelihood of the sequence of emmissions.

Lets look at the following proposed sequence: 4,5,6,2,6, and 1, (which we will come back to later on), and the proposed path: F,F,L, L, L, and F. To calculate the probability of this sequence of dice rolls given the proposed states, we calcaulte the following product:

pattern_comb <- c(4,5,6,2,6,1)
states_comb <- c("F","F","L", "L", "L", "F")
things_to_multiply = 1
for (i in 1:length(pattern_comb)){
  things_to_multiply = things_to_multiply * dice_probs[states_comb[i], pattern_comb[i]]
  if (i != length(pattern_comb)){
    things_to_multiply = things_to_multiply * diceTPM[states_comb[i], states_comb[i+1]]
  }
}
print(prod(things_to_multiply))

[1] 4.453125e-07

So, the probability of this sequence and the proposed path through is 4.453125e-07, the combination of the emission probabilites and the transitions between states to form the path.

To summarize, the probability of the sequence, given a series of hidden states (ρ), can be expressed by this formula

\[ P(sequence | states) = \prod_{i}^{n} P(sequence[i] | state[i]) * TPM_{this state, next state} \]

1.2 Finding the Total Probaility of a Sequence given a Model: Forward Algorithm

We are now pretty comfortable with calculating the probabilities of a sequence given the hidden states. But what if we are not given the hidden states, only the observations? If that is the case, finding the probaility for a single path through the data is not going to be enough. In this case, we will need to find the probabilities of every path through the hidden layer.

Thats a lot of paths. Luckily, we have computers.

To get the probailites of all possible paths, we will learn to implement whats called the Forward Algorithm. Lets look at our rolls, as a table. We will add a "0^th" (starting) row to making the calcualtiongs easier.

roll	0	4	5	6	2	6	1
f_f	.5
f_l	.5

What we will do here is for each roll in the table, we will calculate the probaility of getting the roll given the fair die (f_f) and given the loaded die (f_l), combined with the probaility of reaching that state from the fair state and from the loaded state. In the table, you will notice that I have filled in the 0^th probailites as .5 and .5 : we don't know which die he starts with, so we gave it an equal chance (though if you wanted, you could replace that with the stationionary distribution π!).

1.2.1 roll 1

The way I think about it, each cell f the table will have the sum, for each possible state, of the products of the following three numbers: the prior probaility (from the previous cell in that state), the transition probability, and the emission probability. Lets work out an example.

The first roll is a 4. Lets work out f_f(1) for each state, assuming coming from a previous fair state. The prior pobability in the fair state is .5. The probability of a F -> F transition, according to our TPM, is .9. Now we have our 3 numbers for the fair state (which we will multiply together), but we still need the three under the assumption we were coming from the loaded state. The prior pobability in the loaded state is .5. The probability of a L -> F transition, according to our TPM, is .05. The emission probaility of rolling a 4 given a fair die is still 1/6.

Similarly, for f_f(1) (starting with loaded), the prior is .5, transition loaded -> loaded is .95, and the emission probaility is 1/10. From a fair state prior, the prior is .5, transition fair -> loaded is .1, and the emission probaility is still 1/10.

So, we can fill in our table with that:

(.5*.95*(1/6)) + (.5*.1*(1/6))
(.5*.9*(1/10)) + (.5*.05 * (1/10))

[1] 0.0875
[1] 0.0475

roll	0	4	5	6	2	6	1
f_f	.5	(.5.951/6) + (.5.11/6) = .0875
f_l	.5	(.5.901/10) + (.5.05 1/10) = .0475

1.2.2 Rest of the rolls

Cool! So lets go on and fill out the rest of the table with some code

forward <- function(data = c(4,5,6,2,6,1), init_probs = c(.5, .5), TPM=diceTPM, emission=dice_probs ){
  nstates <- ncol(TPM)
  resdf <-data.frame("dummy" = rep(NA, nstates), "roll_0" = init_probs)
  for (i in 1:length(data)){  # fror each roll
    temp <- rep(0, nstates)
    for (j in 1:nstates){  # for each state f_j
      tempval = temp[j]
      for (k in 1:nstates){ # for all paths to that state
         tempval <- sum(tempval, (resdf[k, ncol(resdf)] * TPM[k, j] * emission[j, data[i]]))
      }
      temp[j] <- tempval
    }
   resdf <- cbind(resdf, temp)
   colnames(resdf)[ncol(resdf)] <- paste0("roll_", i)
  }
  return(resdf[,3:ncol(resdf)]) # dont return the dummy row or row 0
}
pattern <- c(4,5,6,2,6,1)
(forward_results <- forward())

  roll_1     roll_2      roll_3       roll_4       roll_5       roll_6
1 0.0875 0.01464583 0.002397465 0.0004210448 7.059538e-05 1.312151e-05
2 0.0475 0.00471250 0.002486771 0.0002357967 1.166346e-04 1.085009e-05

Great! That was a lot easier. So what is the answer - what is the probaility of this datas? Well, the very last step is to sum the final probailities (f_f(6) and f_l(6)):

forward_results[, ncol(forward_results)]
sum(forward_results[, ncol(forward_results)])

[1] 1.312151e-05 1.085009e-05
[1] 2.397161e-05

The probability of this sequnce given this model across all paths is 2.35501e-05.

"What does that actually mean?" you are probably asking. Well, as a comparison, lets get the probability of that sequnce using a model where everyhting is equal, ie the probaility of the states is just 1/6.

(1/6)^6

[1] 2.143347e-05

With that we get 2.143347e-05, which isnt too different from 2.35501e-05. If the data was definitly caused by the loaded die, we would probably see a bigger difference. Wait, lets check that

(alt_forward_results <-forward(data=c(6,1,6,5,2,6,6,6,6,6,2,6)))
alt_forward_results[, ncol(alt_forward_results)]
sum(alt_forward_results[, ncol(alt_forward_results)])

  roll_1    roll_2      roll_3       roll_4       roll_5       roll_6
1 0.0875 0.0178125 0.003183854 0.0006751259 1.225517e-04 2.086938e-05
2 0.2375 0.0218125 0.010260938 0.0009394036 8.792196e-05 4.262867e-05
        roll_7       roll_8       roll_9      roll_10      roll_11      roll_12
1 4.014796e-06 9.640867e-07 3.021047e-07 1.154909e-07 4.885786e-08 1.049692e-08
2 1.970464e-05 8.967457e-06 4.059458e-06 1.834309e-06 1.656652e-07 7.577080e-08
[1] 1.049692e-08 7.577080e-08
[1] 8.626771e-08

In this case, you can see that the probability of the data under our model (2.355e-5) is definitely more likely than under a model where no loaded dice could be used (1.059516e-07).

1.2.3 The Backward Algorithm

You may now be asking "well thats all fine, but whats the backward algorithm?".

To that I would say "get outside more! There is a whole world to explore that doesn't ever mention the backward algorithm, why would you even ask that?!?"

But here we are.

Well, its almost the exact same thing but, you guessed it, backwards. You start from the end, and instead of looking at the transition possibilities for how you got to a state, you look at the probabilities of getting to a state. Now, because we know the probability of the last state (we have seen the last observation), we initialialize those cells with a probability of 1. Other than that, go ahead and calculate ithe summed probabilitys across all combinations, as you did for the forward algorithm.

If you used uniform priors in the forward algorithm (ie, you stated off with equal probabilities for each state in for the first entry, as we did), the probability of the data according to the forward algorithm and the backward algorithm should be the same. But, if you used π for the initial probabilites and that π is not uniform, then you will probably get a different result.

Heres a code example

backward <- function(data = c(4,5,6,2,6,1), TPM=diceTPM, emission=dice_probs ){
  nstates <- ncol(TPM)
  resdf <-data.frame( "roll_last" = rep(1, nstates))
  for (i in (length(data)-1):1){  # fror each roll, starting from the end
    temp <- rep(0, nstates)
    for (j in 1:nstates){  # for each state f_j
      tempval = temp[j]
      for (k in 1:nstates){ # for all paths to that state
         # get the product of the previous cell, the transition, and the emission probs
         #  this gets done for each state, and summed
         tempval <- sum(tempval, (resdf[k, 1] * TPM[j, k] * emission[k, data[i+1]]))
      }
      temp[j] <- tempval
    }
   resdf <- cbind(temp,resdf)
   colnames(resdf)[1] <- paste0("roll_", i)
  }
  return(resdf)
}
(backward_results <- backward(data=pattern))
print("probability of data according to forward")
(forward_results <-forward(init_probs = dicePi))
sum(forward_results[, ncol(forward_results)])
scaled_backwards_probability = 0
for(s in 1:length(states)){
  scaled_backwards_probability = scaled_backwards_probability + (dicePi[s] * backward_results[s,1] * dice_probs[s,pattern[1]])
}
print("probability of data according to backward")
scaled_backwards_probability

        roll_1       roll_2      roll_3     roll_4    roll_5 roll_last
1 0.0001512844 0.0008813422 0.004770509 0.02852778 0.1633333         1
2 0.0002259836 0.0023477168 0.005040463 0.05072222 0.1066667         1
[1] "probability of data according to forward"
      roll_1      roll_2      roll_3       roll_4       roll_5       roll_6
F 0.11111111 0.018148148 0.002932716 0.0004985751 8.226601e-05 1.472942e-05
L 0.03333333 0.003555556 0.002053704 0.0001994969 1.022380e-04 9.612749e-06
[1] 2.434217e-05
[1] "probability of data according to backward"
           F
2.434217e-05

        roll_1       roll_2      roll_3     roll_4    roll_5 roll_last
1 0.0001512844 0.0008813422 0.004770509 0.02852778 0.1633333         1
2 0.0002259836 0.0023477168 0.005040463 0.05072222 0.1066667         1
[1] "probability of data according to forward"
      roll_1      roll_2      roll_3       roll_4       roll_5       roll_6
F 0.11111111 0.018148148 0.002932716 0.0004985751 8.226601e-05 1.472942e-05
L 0.03333333 0.003555556 0.002053704 0.0001994969 1.022380e-04 9.612749e-06
[1] 2.434217e-05
 Error in pattern : object 'pattern' not found
[1] "probability of data according to backward"
[1] 0

Youll notice that the last summing step is different. The Wikipedia article does a good job of explaining why, but the bottom line is that it needs to be scaled.

1.3 Finding Probailities at a moment: The Forward/Backward Algorithm

Lets look at a different problem, focussing on just a slice of the scene. How we calculate that will depend on what we know: either not knowing anything, just knowing the outcome, the outcome and the state, or the outcome and some other outcomes leading up to it.

1.3.1 Not knowing anything

What is the most likely observation at roll 5? We know that all are equally likly if the fair die is rolled. But, if the loaded die is rolled, the most lifely roll is a 6. So, we can say that for any point, without any other information, a roll of 6 is favored slightly.

1.3.2 Just knowing the outcome

What is the probability that roll 6 is a 1? Well, it could either come from a fair die (1/6) or from the loaded die (1/10). That is just the sum of the two probabilities: 1/6(.5) + .1(.5), or .1333, because we have to multiply the probability of getting a 1 by the probability of getting that die(fair or loaded). To convince yourself, imaging two fair die could be used; the porbability would be 1/6(.5) + 1/6(.5), or .1666. This sense, buecause the probability of rolling a 1 on either of two fair die shouldn't be any mor than rolling it with one die (1/6 or .1666).

1.3.3 Knowing the outcome and the state

What is the probability that roll 6 is a 1 given a Fair die? This is a simple conditional probability. If it is a fair die, then it is simtply 1/6.

Another way of looking at this would be to treat it in a bayesian manner: \[ P( roll_6=1 | state_{6}= Fair ) = \frac{P(state_{6}=Fair| roll_{6}=1) * P(roll_{6}=1)}{ P(state_{6}=Fair)} \]

1.3.4 Knowing the Sequence, but not the states: The forward-backward

Lastly, whay if we have a sequence, and want to know the proability of being in a particular state at a particular time. Later on, we will discuss how to get the most likely path, what if we just want to know, for instance, the likellihood of the loaded die bing used at the 4th roll given our series of observations. THis is actually a pretty useful thing to do, lets figure it out!

This is the algorithm I am the fuzziest about the implementation of, but I will do my best. We need to find the following: \[ P(state_{4} = L | roll_{4}=2) \] What the forward backward algorithm allows us to do is to split this into two halves: everything happening up to the 4^th roll, and everything happenong afterward. the last thing that gets done is we divide the product of those to bits by the probability of the entire se<quence (according to either algorith run on the whole dataset; that last division step is sometimes refered to as the smoothing setp. In mathspeak, it i<s this:

\[ P(state_{4} = L | roll_{4}=2) = P(roll_{1 }\dots roll_{4}, state_{4} = L, roll_{5} \dots roll_{final} \] \[ P(state_{4} = L | roll_{4}=2) = P(roll_{1} \dots roll_{4}, state_{4} = L)P(roll_{5} \dots roll_{final} | roll_{1}\dots roll_{4}, state_{4} = L \] \[ P(state_{4} = L | roll_{4}=2) = P(roll_{1},\dots roll_{4}, state_{4} = L)P(roll_{5} \dots roll_{final} | state_{4} = L \] which is really just \[ P(state_{4} = L | roll_{4}=2) = (forward)(backward) \]

If your spidey senses are tingling taht this is going to take a lot of calculation, you were right! Luckily, we already wrote implementations of the forward and backward algorithms. The last thing we need to do is put it all together:

(res <- (forward_results[2,4] * backward_results[2, 4] )/scaled_backwards_probability)

        F
0.4156954

So we have a 41% chance that the 4^th roll was made with the loaded die. Cool stuff!

Righto! But what if we wanted to find the best path?

1.4 Finding the Most Likely Path to Explain the Sequence

Lets say that the the croupier throws the following pattern of dice: \[ pattern = 6,6,6,6,5,6,6 \]

We want to know what is the most likely hidden pattern of states (fair or loaded) that he is using. In fact, we have a symbol, the Greek letter rho (ρ), which we use to refer to the most probable path through the observations. ρ is the pattern of (hidden) states. ρ_i would be the hidden state at the i^th position of the chain.

So how do we find ρ?

1.4.1 Maximum Likelihood Approach

Maximum likelihood is one of the rare things in math whose name explains just what it is: it is the scenario that maximize the likelihood of a certain set of data. Similar to Ockum's Razor, if thats a clearer way of thinking about it.

A simple example would be the following: you have two friends: a kleptomaniac and a dentist. Say your wallet goes missing when they are over at your house. The scenario with the maximum likelihood of happening (ie, what probably happened) is that the kleptomaniac stole your wallet.

The mathy way of writing things is the following: \(\rho_{ML}= argmax?(P(\rho|data))\) (argmax means that we pick whichever option maximizes the expression). The rho of maximum likelihood, or ρ_ML is the path with the highest likelihood.

So what would ρ_MLbe for this roll pattern? well, We need to just look at each roll, and choose the probability that is the highest.

The first roll was a 6. The probability of rolling a 6 with a fair die (ie P( roll a 6 | fair)) is 1/6. It is no more likely than any other roll. But, the probability of rolling a 6 with a loaded die (ie P( roll a 6 | loaded)) is 1/2. Clearly, the more likely "cause" of the roll of 6 is a loaded die.

The same is true for all the other 6's. But what about the 5? The probability of rolling a 5 with a fair die is still 1/6. The probability of rolling a 6 with a loaded die (ie P( roll a 5 | loaded)) is 1/10. In this case, the fair die most likely caused the 5. So what have we learned? pattern = 6666566 What is ρ_ML here? \(LLLLFLL\)

Done and dusted. However, we have ignored a key piece of information.

Remember how the croupier doesn't like switching dice? Here, in only 7 rolls, it would appear he has switched dice 2 times (Loaded to fair, and then back to loaded. But if we look at the probabilities of that happening (.05 and .1, respectively), we see that it is really unlikely that this happened. Based on the probabilities, he should only switch 2 times in 10-20 rolls, not 7!

The underlying problem with maximum likelihood estimates is that it doesn't consider whether the maximized pattern of hidden states (ρ) is actually probable.

This would become even more obvious if we considered a series of rolls: 2,5,6,2,6,1. the Maximum Likelihood approach would yield \(\rho_{ML}= F,F,L,F,L,F\). Again, remember how the croupier doesn't want to switch? This ρ doesn't seem to make sense.

So what can we do?

1.4.1.1 Bayes's Rule: A simple example

Bayesian statistics is all about dealing with joint probabilities. LEts say that in a bag of jelly beans, 10% are brown. Of those brown ones, 1/3 are rootbeer, 1/2 are chocolate, and 1/6 are licorice. Want to know the probability that a jelly bean is root beer flavored given that it is brown? You could calculate this the easy way, and realize that you are holding a brown jelly bean, and that you have a 33% chance of holding a rootbeer.

Bayes rule says that you can calculate it if you know the probability of \[P(H|D) = \frac{P(data|H)P(H)}{P(data)}\] or in our case:

\[P(rootbeer | brown) = \frac{P(brown | rootbeer)P(rootbeer)}{P(brown)} \]

This would be simplified to :

\[P(rootbeer | brown) = \frac{(1)P(.1*.33)}{P(.1)}\]

This is a dumb example because all root beer jelly beans were brown, and you have a pretty clear idea of the distributions.

1.4.1.2 Bayes's Rule: a slightly more complex example

Let's add in a bit of uncertainty. You just remembered that after a tragic rhubarb accident, you became colorblind. You have only an 80% chance of identifying a brown jelly bean (True positive rate), and 15% of the time, you think that another color is brown (false positive).

This is a scenario when Bayes Rule starts to get useful. Without it, we cant just say "well, it was 33% before, and that times the true positive rate (80%) is about 25%." That doesn't take into account the false positive rate of 15%. So what do we do?

Lets look again at Bayes's rules:

\[P(rootbeer | brown) = \frac{P(brown | rootbeer)P(rootbeer)}{P(brown)} \]

in our case, the main thing that has changed is the bottom part: the probability of the data , or that the bean is actually brown. Nothing else has changed. We now have to calculate all the chances that this is brown by looking at both scenarios where it is actually brown: when it is and you think it is, and when you think it isn't but it actually is :

\[P(rootbeer | brown) = \frac{P(brown | rootbeer)P(rootbeer)}{P(brown | guess correctly) + P(brown| guess incorrectly)} \] or, with numbers: \[P(rootbeer | brown) = \frac{1 * (.33 * .1)}{(.1*.8) + P(.9 * .15)} \]

(.33*.1) / ((.1*.8)+(.9*.15))

[1] 0.1534884

1.4.1.3 "Bayesian" Approach to ρ

"Cool!" you say, "but what about that casino guy?" Ah yes: how does Bayes's rule help us with the dishonest croupier? Well, we left off with maximum likliood estimation where \[\rho = argmax?(P(\rho|data))\]. Notice that the bit in the middle (the \(P(\rho|data)\) is a conditional probability. And that means we can apply Bayes's Rule to this formula! \[ \rho = argmax?(P(\rho|data)) == argmax? \frac{P(data|\rho)(P(\rho)}{ P(data)} \]

* And here is where it gets interesting. What is our \(P(data|\rho)\) in this case? it is the probability of the roll given one of the possible states (fair or loaded). This is represented by this formula \[ P(data,\rho) = \pi_{\rho} e_{\rho0}(data_0) \prod_{i=1}^{n} A_{\rho(i-1)} \rho_{i} * e_{\rho(i)}(data_{i}) \] \(A\) is our TPM. \(e\) is our emission probabilies matrix

What this is saying:

\( \pi_{\rho} e_{\rho0}(data_0)\) start off with the first observation (data₀), and multiply it by both the emission probability for some state and π at some state.
Then, \( \prod_{i=1}^{n} A_{\rho(i-1)} \rho_{i} * e_{\rho(i)}(data_i) \) means that for each of the remaining observations \(data_{i}\dots data_{n}\), do the same thing, except that instead of guessing on the transition probability using π, we use the transition matrix, and multiply all the results together. The first time, we don't have a prior state, so we cant calculate the transition.

It gets a bit messy, but the thing to know here is that you could go through, calculate the probability of every single possible path (ρ) for the data, and pick the most likely one. But this can become very difficult computationally.

1.4.2 Viterbi algorithm

wikipedia's actually useful example

We need a better way of looking through to find the best possible path. We will start referring to this ρ*.

Enter the Viterbi algorithm!

\[ V_{1} = {P} {\big (}e_{i}\|\ k{\big )}\cdot \pi_{k} \] \[ V_{i,k} =\max _{x\in S}\left(\mathrm {P} {\big (}e_{i} \ |\ k{\big )}\cdot a_{x,k}\cdot V_{i-1,x}\right) \]

here, we are calling our states \(k\) Fair or loaded, \(i\) is th position in the sequence. It sorta makes sense, but there are too many letters up there for my liking; lets put some number on it Lets work out an example!

1.4.2.1 A Lengthy Viterbi Example

observe 4,5,6,2,6,1

i	1	2	3	4	5	6
\	<
outcomes	4	5	6	2	6	1
V_F
V_L
hidden	?	?	?	?	?	?

the first entries

How do we work out first? We use the first formula above: for each state (fair or loaded) we multiply The π at that state by the the emission probabilies for the outcome given the state. What was π again?

# pi
dicePi
# emission probabilites
dice_probs

        F         L
0.6666667 0.3333333
       [,1]      [,2]      [,3]      [,4]      [,5]      [,6]
F 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667
L 0.1000000 0.1000000 0.1000000 0.1000000 0.1000000 0.5000000

V_F(0)= π_F e_F(4) = (2/3)(1/6) V_L(0)= π_L e_L(4) = (1/3)(1/10)

(2/3)*(1/6)
(1/3)*(1/10)

[1] 0.1111111
[1] 0.03333333

Great! That was pretty straightforward. Now we can fill in those boxes.

i		1	2	3	4	5	6
roll		4	5	6	2	6	1
V_F	F	.111			x
	L
V_L	F	.033
	L
hidden		?	?	?	?	?	?

The subsequent entries

Let's move on the next \(i\). Now, because we have a few potential starting states, we will stop using π as a proxy, and calculate using the transisiton matrix. Lets start with the V_F. You will notice I added an extra column to the table. This is to record both calculations, and then we can calculate the max and strike out the other For V_F, we have 2 condition to calculate: probability of going from Fair die to fair die on the second roll, and the probability of going from loaded to fair. \[ V_F(2) = e_{F}(2) max(v_{F}(1)A_{FF} or V_L(1)A_{LF}) \] Again, just to clarify, e_F(2) is the probability of rolling a 2 with a fair die, V_F(1) is the likelihood calculated before that the previous outcome came from a fair die, and A_FF and A_LFare the probabillirtes of going from Fair to fair and loaded to fair, respectively. Those are found in out TPM.

pattern
diceTPM
# if previous was fair
dice_probs["F",pattern[2]] * 0.111 * diceTPM["F","F"]
# or if loaded
dice_probs["F",pattern[2]] * 0.033 * diceTPM["L","F"]

[1] 4 5 6 2 6 1
     F    L
F 0.95 0.05
L 0.10 0.90
       F
0.017575
      F
0.00055

Alright, here, the most likely outcome was coming from a Fair dice. So we will enter both values, but crossed out the one that is less likely. How about if the die that rolled a 4 was loaded? we have to consider the opposite combinations: the probability of going from loaded to loaded, and the from fair to loaded.

\[ V_L(2) = e_{L}(2) max(v_{F}(1)A_{FL} or V_L(1)A_{LL}) \]

# if previous was fair
diceTPM
dice_probs["L",pattern[2]] * 0.111* diceTPM["L","F"]
# or if previous was loaded
dice_probs["L",pattern[2]] * .033*diceTPM["L","L"]

     F    L
F 0.95 0.05
L 0.10 0.90
      L
0.00111
      L
0.00297

Alright, so here the most likely transition was from loaded to loaded. This makes sense, if we look at the TPM, we can remember that there is a 90% chance of staying with a loaded dice for the next roll (switching is risky). Lets update the matrix!

i		1	2	3	4	5	6
roll		4	5	6	2	6	1
V_F	F	.111	0.017575		x
	L		~~0.0055~~
V_L	F	.033	~~.00111~~
	L		.00297
hidden		?	?	?	?	?	?

Cool! Just for good measure, lets go through the probabilities for the rest of the rolls

#V_F_3
dice_probs["F",pattern[3]] * 0.017575 * diceTPM["F","F"]
dice_probs["F",pattern[3]] * 0.00297 *diceTPM["L","F"]
#V_L_3
dice_probs["L",pattern[3]] * 0.017575 * diceTPM["F","L"]
dice_probs["L",pattern[3]] * 0.00297 * diceTPM["L","L"]

          F
0.002782708
       F
4.95e-05
          L
0.000439375
        L
0.0013365

And for the 4th

#V_F_4
dice_probs["F",pattern[4]] * 0.002782708 * diceTPM["F","F"]
dice_probs["F",pattern[4]] * 0.0013365 *diceTPM["L","F"]
#V_L_4
dice_probs["L",pattern[4]] * 0.002782708 * diceTPM["F","L"]
dice_probs["L",pattern[4]] * 0.0013365 * diceTPM["L","L"]

           F
0.0004405954
         F
2.2275e-05
           L
1.391354e-05
          L
0.000120285

And lets update it again

i		1	2	3	4	5	6
roll		4	5	6	2	6	1
V_F	F	.111	0.017575	0.002782708	0.0004405954
	L		~~0.0055~~	~~4.95e-05~~	~~4.455e-06~~
V_L	F	.033	~~.00111~~	~~0.000439375~~	~~1.391354e-05~~
	L		.00297	0.0013365	0.000120285
hidden		?	?	?	?	?	?

And for the 5th

#V_F_5
dice_probs["F",pattern[5]] * 0.0004405954 * diceTPM["F","F"]
dice_probs["F",pattern[5]] * 0.000120285 *diceTPM["L","F"]
#V_L_5
dice_probs["L",pattern[5]] * 0.0004405954 * diceTPM["F","L"]
dice_probs["L",pattern[5]] * 0.000120285 * diceTPM["L","L"]

           F
6.976094e-05
          F
2.00475e-06
           L
1.101489e-05
           L
5.412825e-05

And for the 6th

#V_F_6
dice_probs["F",pattern[6]] * 6.976094e-05 * diceTPM["F","F"]
dice_probs["F",pattern[6]] * 5.412825e-05 *diceTPM["L","F"]
#V_L_6
dice_probs["L",pattern[6]] * 6.976094e-05 * diceTPM["F","L"]
dice_probs["L",pattern[6]] * 5.412825e-05 * diceTPM["L","L"]

           F
1.104548e-05
           F
9.021375e-07
           L
3.488047e-07
           L
4.871543e-06

And lets fill out the rest of the matrix!

i		1	2	3	4	5	6
roll		4	5	6	2	6	1
V_F	F	.111	0.017575	0.002782708	0.0004405954	6.976094e-05	1.104548e-05
	L		~~0.0055~~	~~4.95e-05~~	~~4.455e-06~~	~~2.00475e-06~~	~~9.021375e-07~~
V_L	F	.033	~~.00111~~	~~8.7875e-05~~	~~1.391354e-05~~	~~1.101489e-05~~	~~3.488047e-07~~
	L		.00297	0.0002673	2.4057e-05	5.412825e-05	4.871543e-06
hidden		?	?	?	?	?	?

Wow, that was a lot of work! We are almost done! Looking at the last roll, we see that the most likely scenario (the outcome, V_For V_L) is V_F, meaning that the Fair die is most likely responsible for the roll. Now, the last thing we need to do is to trace back through the most likely path that leads through these results to the Fair. ending state. The reason that we save the outcomes for each step is so that we can tell with transition led to the maximum. Looking back, we can see that all of the transition up to that point were F -> F transitions. What if the croupier threw the die threw more times, and got a 6 each time?

#V_F_7
dice_probs["F",6] * 1.104548e-05 * diceTPM["F","F"]
dice_probs["F",6] * 4.871543e-06 *diceTPM["L","F"]
#V_L_7
dice_probs["L",6] * 1.104548e-05 * diceTPM["F","L"]
dice_probs["L",6] * 1.744023e-06 * diceTPM["L","L"]

           F
1.748868e-06
           F
8.119238e-08
          L
2.76137e-07
           L
7.848103e-07

#V_F_8
dice_probs["F",6] * 1.748868e-06 * diceTPM["F","F"]
dice_probs["F",6] * 7.848103e-07 *diceTPM["L","F"]
#V_L_8
dice_probs["L",6] * 1.748868e-06 * diceTPM["F","L"]
dice_probs["L",6] * 7.848103e-07 * diceTPM["L","L"]

           F
2.769041e-07
           F
1.308017e-08
          L
4.37217e-08
           L
3.531646e-07

#V_F_9
dice_probs["F",6] * 2.769041e-07 * diceTPM["F","F"]
dice_probs["F",6] * 3.531646e-07 *diceTPM["L","F"]
#V_L_9
dice_probs["L",6] * 2.769041e-07 * diceTPM["F","L"]
dice_probs["L",6] * 3.531646e-07 * diceTPM["L","L"]

           F
4.384315e-08
           F
5.886077e-09
           L
6.922603e-09
           L
1.589241e-07

i		1	2	3	4	5	6	7	8	9
roll		4	5	6	2	6	1	6	6	6
V_F	F	.111	0.017575	x0.002782708	0.0004405954	6.976094e-05	1.104548e-05	1.748868e-06	2.769041e-07	4.384315e-08
	L		~~0.0055~~	~~4.95e-05~~	~~4.455e-06~~	~~2.00475e-06~~	~~9.021375e-07~~	~~8.119238e-08~~	~~1.308017e-08~~	~~5.886077e-09~~
V_L	F	.033	~~.00111~~	~~8.7875e-05~~	~~1.391354e-05~~	~~1.101489e-05~~	~~3.488047e-07~~	~~2.76137e-07~~	~~4.37217e-08~~	~~6.922603e-09~~
	L		.00297	0.0002673	2.4057e-05	5.412825e-05	4.871543e-06	7.848103e-07	3.531646e-07	1.589241e-07
hidden		?	?	?	?	?	?

So here, something interesting has happened. Now, the last V_L looks like the more likely possibility. We then trace back the probabilities "leading" to this final state, and we see that in all cases, the maximized probability came from the loaded stated. This means that this new outcome, 456261666, most likely came from using just the loaded die (ρ* is LLLLLLLL). This reinforces what we saw in the transition porbaility matrix: that switches are very unlikely.

Hopefully, this has shown you how to work out the most likely hidden states underlying some sort of data.

1.4.2.2 Recap

Viterbi algorithm works out ρ* the path that yields the most probable sequence of hidden states
This maximizes the posterior probability of the path, P(ρ|data):
Viterbi gives the ideal path, but it may not be correct or likely due to data
forward fives the probility of data.
Together, we can get an idea of whats going on

1.4.2.3 wait, what if we don't know where we start?

Suppose we want a bit more control about the starting state of a sequence. We can do something very simple: add another state to our TPM, lets call it β, or "begin", Now we can make a TPM that looks like this :

modTPM <- matrix(c(
0,.8,.2,
0,.95,.05,
0,.1,.9), 3, byrow=T)
rownames(modTPM)<-c("begin", "fair", "loaded")
colnames(modTPM)<-c("begin", "fair", "loaded")
modTPM

       begin fair loaded
begin      0 0.80   0.20
fair       0 0.95   0.05
loaded     0 0.10   0.90

This gives us a way to account for some prior information about starting a chain (say, we know that 80% of the time, this croupier starts with a fair die). We could use a similar additional state if we knew something about the end state.

1.5 Training a Model

The last thing to figure out, how to train models, deserves its own post.

Nick Waters