1 Hidden Markov Models
Alright, you are still with me!
1.0.1 The Weather Cave Example
Looking back at our weather example, lets pretend you live in a cave, and cannot see the weather, but desperately want to know the weather. However, the intern who delivers you coffee every morning does know the weather. You know that if the weather is nice (sunny or cloudy). he wears sunglasses 75% of the time. If the weather is rainy or cloudy, he wears a hat 95% of the time. You could try to guess the probability of the weather outside the cave based on your intern's choice of headgear.
This is an example of a hidden Markov model: we try to to observe something hidden through something visible. In this example, the hidden layer is the weather, and the observable (emitted) layer is the choice of sunglasses and/or hat.
Rabinar, 1989 summarizes the problems we want to answer for HMMs as "[1] evaluation of the probability (or likelihood) of a sequence of observations given a specific HMM; [2] the determination of a best sequence of model states; and [3] the adjustment of model parameters so as to best account for the observed signal."
We will attempt to tackle these questions, albeit in a slightly different order. Starting with the seconded listed problem (determining the best sequence that explains a series of observations) is, in my mind, the most practical.
Lets move on to a better example
1.0.2 Occasionally Dishonest Casino
You are gambling at a casino, and you know the the croupier (dice guy) is a crooked, no good scum who uses a pair of dice: one loaded, and one fair. The fair dice has a 1/6 change for each side, but the loaded dice has a 1/2 chance of landing with a 6, and only a 1/10 chance otherwise.
In this game, the croupier throws one die at a time. You would like to know when he is using the fair dice, and when he is using the loaded die.
Note: it is important to note that only the hidden layer has to be a Markov Process. Often, you will see that the observable layer is definitely not Markovian.
1.0.3 Setup
1.0.3.1 emission probabilities
Lets code the following data in R
- fair dice: 1/6 for each
- loaded: 1/2 for 6, 1/10 for else
dice_probs <- matrix(c(1/6, 1/6, 1/6, 1/6, 1/6, 1/6,
1/10,1/10,1/10,1/10,1/10,1/2), 2, byrow=T)
rownames(dice_probs) <- c("F", "L")
dice_probs
[,1] [,2] [,3] [,4] [,5] [,6] F 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 L 0.1000000 0.1000000 0.1000000 0.1000000 0.1000000 0.5000000
1.0.3.2 Transition Probabilities
Lets say that we have it from an insider at the casino that this croupier follows the following rules/probabilities:
- .95 staying with the Fair die
- .9 staying with the loaded die
- .05 switch to loaded
- .1 switch to fair
Apparently, it s pretty risky to change dice, so most of the time, he stays with the dice he is using. From this, we can make a transition matrix of the probabilities of going back and forth between the dice:
diceTPM <- matrix(c(.95,.05,.1,.9),2, byrow=T)
states <- c("F", "L")
rownames(diceTPM) <- states
colnames(diceTPM) <- states
diceTPM
F L F 0.95 0.05 L 0.10 0.90
from this, we can also calculate π:
dicePi <- eigen(t(diceTPM))$vectors[,1] / sum(eigen(t(diceTPM))$vectors[,1])
names(dicePi) <- states
dicePi
F L 0.6666667 0.3333333
π = c(2/3,1/3)
1.0.3.3 Problems to solve
So now we have an idea of how much time the croupier should be using each of the dies. What can we learn? The next bit will be divided up into 4 sections:
- Finding the Probability of a Sequence given a Path and a Model: product of probabilities
- Finding the Total Probaility of a Sequence given a Model: Forward Algorithm
- Finding the Most Lilely Path to Explain the Sequnce
- Maximum Lifelihood Estimates
- Viterbi Algorithm
- Finding Probaility of Being in a Certain State at a Certain Observation: The Forward/Backward Algorithm
- Finding Model Parameters: The Baum-Welsh Training Algorithm
1.1 Finding the Probability of a Sequence given a Path and a Model: product of probabilities
Going back to the original 3 questions posed [link], we will look at the first. Say we are interested in the probability that particular sequence came from a certain path through the hidden layer. This is nothing more than simply the product of the likelihood of a path and the likelihood of the sequence of emmissions.
Lets look at the following proposed sequence: 4,5,6,2,6, and 1, (which we will come back to later on), and the proposed path: F,F,L, L, L, and F. To calculate the probability of this sequence of dice rolls given the proposed states, we calcaulte the following product:
pattern_comb <- c(4,5,6,2,6,1)
states_comb <- c("F","F","L", "L", "L", "F")
things_to_multiply = 1
for (i in 1:length(pattern_comb)){
things_to_multiply = things_to_multiply * dice_probs[states_comb[i], pattern_comb[i]]
if (i != length(pattern_comb)){
things_to_multiply = things_to_multiply * diceTPM[states_comb[i], states_comb[i+1]]
}
}
print(prod(things_to_multiply))
[1] 4.453125e-07
So, the probability of this sequence and the proposed path through is 4.453125e-07, the combination of the emission probabilites and the transitions between states to form the path.
To summarize, the probability of the sequence, given a series of hidden states (ρ), can be expressed by this formula
\[ P(sequence | states) = \prod_{i}^{n} P(sequence[i] | state[i]) * TPM_{this state, next state} \]
1.2 Finding the Total Probaility of a Sequence given a Model: Forward Algorithm
We are now pretty comfortable with calculating the probabilities of a sequence given the hidden states. But what if we are not given the hidden states, only the observations? If that is the case, finding the probaility for a single path through the data is not going to be enough. In this case, we will need to find the probabilities of every path through the hidden layer.
Thats a lot of paths. Luckily, we have computers.
To get the probailites of all possible paths, we will learn to implement whats called the Forward Algorithm. Lets look at our rolls, as a table. We will add a "0^th" (starting) row to making the calcualtiongs easier.
roll | 0 | 4 | 5 | 6 | 2 | 6 | 1 |
---|---|---|---|---|---|---|---|
f_f | .5 | ||||||
f_l | .5 |
What we will do here is for each roll in the table, we will calculate the probaility of getting the roll given the fair die (ff) and given the loaded die (fl), combined with the probaility of reaching that state from the fair state and from the loaded state. In the table, you will notice that I have filled in the 0th probailites as .5 and .5 : we don't know which die he starts with, so we gave it an equal chance (though if you wanted, you could replace that with the stationionary distribution π!).
1.2.1 roll 1
The way I think about it, each cell f the table will have the sum, for each possible state, of the products of the following three numbers: the prior probaility (from the previous cell in that state), the transition probability, and the emission probability. Lets work out an example.
The first roll is a 4. Lets work out ff(1) for each state, assuming coming from a previous fair state. The prior pobability in the fair state is .5. The probability of a F -> F transition, according to our TPM, is .9. Now we have our 3 numbers for the fair state (which we will multiply together), but we still need the three under the assumption we were coming from the loaded state. The prior pobability in the loaded state is .5. The probability of a L -> F transition, according to our TPM, is .05. The emission probaility of rolling a 4 given a fair die is still 1/6.
Similarly, for ff(1) (starting with loaded), the prior is .5, transition loaded -> loaded is .95, and the emission probaility is 1/10. From a fair state prior, the prior is .5, transition fair -> loaded is .1, and the emission probaility is still 1/10.
So, we can fill in our table with that:
(.5*.95*(1/6)) + (.5*.1*(1/6))
(.5*.9*(1/10)) + (.5*.05 * (1/10))
[1] 0.0875 [1] 0.0475
roll | 0 | 4 | 5 | 6 | 2 | 6 | 1 |
---|---|---|---|---|---|---|---|
f_f | .5 | (.5*.95*1/6) + (.5*.1*1/6) = .0875 | |||||
f_l | .5 | (.5*.90*1/10) + (.5*.05 * 1/10) = .0475 |
1.2.2 Rest of the rolls
Cool! So lets go on and fill out the rest of the table with some code
forward <- function(data = c(4,5,6,2,6,1), init_probs = c(.5, .5), TPM=diceTPM, emission=dice_probs ){
nstates <- ncol(TPM)
resdf <-data.frame("dummy" = rep(NA, nstates), "roll_0" = init_probs)
for (i in 1:length(data)){ # fror each roll
temp <- rep(0, nstates)
for (j in 1:nstates){ # for each state f_j
tempval = temp[j]
for (k in 1:nstates){ # for all paths to that state
tempval <- sum(tempval, (resdf[k, ncol(resdf)] * TPM[k, j] * emission[j, data[i]]))
}
temp[j] <- tempval
}
resdf <- cbind(resdf, temp)
colnames(resdf)[ncol(resdf)] <- paste0("roll_", i)
}
return(resdf[,3:ncol(resdf)]) # dont return the dummy row or row 0
}
pattern <- c(4,5,6,2,6,1)
(forward_results <- forward())
roll_1 roll_2 roll_3 roll_4 roll_5 roll_6 1 0.0875 0.01464583 0.002397465 0.0004210448 7.059538e-05 1.312151e-05 2 0.0475 0.00471250 0.002486771 0.0002357967 1.166346e-04 1.085009e-05
Great! That was a lot easier. So what is the answer - what is the probaility of this datas? Well, the very last step is to sum the final probailities (f_f(6) and fl(6)):
forward_results[, ncol(forward_results)]
sum(forward_results[, ncol(forward_results)])
[1] 1.312151e-05 1.085009e-05 [1] 2.397161e-05
The probability of this sequnce given this model across all paths is 2.35501e-05.
"What does that actually mean?" you are probably asking. Well, as a comparison, lets get the probability of that sequnce using a model where everyhting is equal, ie the probaility of the states is just 1/6.
(1/6)^6
[1] 2.143347e-05
With that we get 2.143347e-05, which isnt too different from 2.35501e-05. If the data was definitly caused by the loaded die, we would probably see a bigger difference. Wait, lets check that
(alt_forward_results <-forward(data=c(6,1,6,5,2,6,6,6,6,6,2,6)))
alt_forward_results[, ncol(alt_forward_results)]
sum(alt_forward_results[, ncol(alt_forward_results)])
roll_1 roll_2 roll_3 roll_4 roll_5 roll_6 1 0.0875 0.0178125 0.003183854 0.0006751259 1.225517e-04 2.086938e-05 2 0.2375 0.0218125 0.010260938 0.0009394036 8.792196e-05 4.262867e-05 roll_7 roll_8 roll_9 roll_10 roll_11 roll_12 1 4.014796e-06 9.640867e-07 3.021047e-07 1.154909e-07 4.885786e-08 1.049692e-08 2 1.970464e-05 8.967457e-06 4.059458e-06 1.834309e-06 1.656652e-07 7.577080e-08 [1] 1.049692e-08 7.577080e-08 [1] 8.626771e-08
In this case, you can see that the probability of the data under our model (2.355e-5) is definitely more likely than under a model where no loaded dice could be used (1.059516e-07).
1.2.3 The Backward Algorithm
You may now be asking "well thats all fine, but whats the backward algorithm?".
To that I would say "get outside more! There is a whole world to explore that doesn't ever mention the backward algorithm, why would you even ask that?!?"
But here we are.
Well, its almost the exact same thing but, you guessed it, backwards. You start from the end, and instead of looking at the transition possibilities for how you got to a state, you look at the probabilities of getting to a state. Now, because we know the probability of the last state (we have seen the last observation), we initialialize those cells with a probability of 1. Other than that, go ahead and calculate ithe summed probabilitys across all combinations, as you did for the forward algorithm.
If you used uniform priors in the forward algorithm (ie, you stated off with equal probabilities for each state in for the first entry, as we did), the probability of the data according to the forward algorithm and the backward algorithm should be the same. But, if you used π for the initial probabilites and that π is not uniform, then you will probably get a different result.
Heres a code example
backward <- function(data = c(4,5,6,2,6,1), TPM=diceTPM, emission=dice_probs ){
nstates <- ncol(TPM)
resdf <-data.frame( "roll_last" = rep(1, nstates))
for (i in (length(data)-1):1){ # fror each roll, starting from the end
temp <- rep(0, nstates)
for (j in 1:nstates){ # for each state f_j
tempval = temp[j]
for (k in 1:nstates){ # for all paths to that state
# get the product of the previous cell, the transition, and the emission probs
# this gets done for each state, and summed
tempval <- sum(tempval, (resdf[k, 1] * TPM[j, k] * emission[k, data[i+1]]))
}
temp[j] <- tempval
}
resdf <- cbind(temp,resdf)
colnames(resdf)[1] <- paste0("roll_", i)
}
return(resdf)
}
(backward_results <- backward(data=pattern))
print("probability of data according to forward")
(forward_results <-forward(init_probs = dicePi))
sum(forward_results[, ncol(forward_results)])
scaled_backwards_probability = 0
for(s in 1:length(states)){
scaled_backwards_probability = scaled_backwards_probability + (dicePi[s] * backward_results[s,1] * dice_probs[s,pattern[1]])
}
print("probability of data according to backward")
scaled_backwards_probability
roll_1 roll_2 roll_3 roll_4 roll_5 roll_last 1 0.0001512844 0.0008813422 0.004770509 0.02852778 0.1633333 1 2 0.0002259836 0.0023477168 0.005040463 0.05072222 0.1066667 1 [1] "probability of data according to forward" roll_1 roll_2 roll_3 roll_4 roll_5 roll_6 F 0.11111111 0.018148148 0.002932716 0.0004985751 8.226601e-05 1.472942e-05 L 0.03333333 0.003555556 0.002053704 0.0001994969 1.022380e-04 9.612749e-06 [1] 2.434217e-05 [1] "probability of data according to backward" F 2.434217e-05
s
roll_1 roll_2 roll_3 roll_4 roll_5 roll_last 1 0.0001512844 0.0008813422 0.004770509 0.02852778 0.1633333 1 2 0.0002259836 0.0023477168 0.005040463 0.05072222 0.1066667 1 [1] "probability of data according to forward" roll_1 roll_2 roll_3 roll_4 roll_5 roll_6 F 0.11111111 0.018148148 0.002932716 0.0004985751 8.226601e-05 1.472942e-05 L 0.03333333 0.003555556 0.002053704 0.0001994969 1.022380e-04 9.612749e-06 [1] 2.434217e-05 Error in pattern : object 'pattern' not found [1] "probability of data according to backward" [1] 0
Youll notice that the last summing step is different. The Wikipedia article does a good job of explaining why, but the bottom line is that it needs to be scaled.
1.3 Finding Probailities at a moment: The Forward/Backward Algorithm
Lets look at a different problem, focussing on just a slice of the scene. How we calculate that will depend on what we know: either not knowing anything, just knowing the outcome, the outcome and the state, or the outcome and some other outcomes leading up to it.
1.3.1 Not knowing anything
What is the most likely observation at roll 5? We know that all are equally likly if the fair die is rolled. But, if the loaded die is rolled, the most lifely roll is a 6. So, we can say that for any point, without any other information, a roll of 6 is favored slightly.
1.3.2 Just knowing the outcome
What is the probability that roll 6 is a 1? Well, it could either come from a fair die (1/6) or from the loaded die (1/10). That is just the sum of the two probabilities: 1/6(.5) + .1(.5), or .1333, because we have to multiply the probability of getting a 1 by the probability of getting that die(fair or loaded). To convince yourself, imaging two fair die could be used; the porbability would be 1/6(.5) + 1/6(.5), or .1666. This sense, buecause the probability of rolling a 1 on either of two fair die shouldn't be any mor than rolling it with one die (1/6 or .1666).
1.3.3 Knowing the outcome and the state
What is the probability that roll 6 is a 1 given a Fair die? This is a simple conditional probability. If it is a fair die, then it is simtply 1/6.
Another way of looking at this would be to treat it in a bayesian manner: \[ P( roll_6=1 | state_{6}= Fair ) = \frac{P(state_{6}=Fair| roll_{6}=1) * P(roll_{6}=1)}{ P(state_{6}=Fair)} \]
1.3.4 Knowing the Sequence, but not the states: The forward-backward
Lastly, whay if we have a sequence, and want to know the proability of being in a particular state at a particular time. Later on, we will discuss how to get the most likely path, what if we just want to know, for instance, the likellihood of the loaded die bing used at the 4th roll given our series of observations. THis is actually a pretty useful thing to do, lets figure it out!
This is the algorithm I am the fuzziest about the implementation of, but I will do my best. We need to find the following: \[ P(state_{4} = L | roll_{4}=2) \] What the forward backward algorithm allows us to do is to split this into two halves: everything happening up to the 4th roll, and everything happenong afterward. the last thing that gets done is we divide the product of those to bits by the probability of the entire se<quence (according to either algorith run on the whole dataset; that last division step is sometimes refered to as the smoothing setp. In mathspeak, it i<s this:
\[ P(state_{4} = L | roll_{4}=2) = P(roll_{1 }\dots roll_{4}, state_{4} = L, roll_{5} \dots roll_{final} \] \[ P(state_{4} = L | roll_{4}=2) = P(roll_{1} \dots roll_{4}, state_{4} = L)P(roll_{5} \dots roll_{final} | roll_{1}\dots roll_{4}, state_{4} = L \] \[ P(state_{4} = L | roll_{4}=2) = P(roll_{1},\dots roll_{4}, state_{4} = L)P(roll_{5} \dots roll_{final} | state_{4} = L \] which is really just \[ P(state_{4} = L | roll_{4}=2) = (forward)(backward) \]
If your spidey senses are tingling taht this is going to take a lot of calculation, you were right! Luckily, we already wrote implementations of the forward and backward algorithms. The last thing we need to do is put it all together:
(res <- (forward_results[2,4] * backward_results[2, 4] )/scaled_backwards_probability)
F 0.4156954
So we have a 41% chance that the 4th roll was made with the loaded die. Cool stuff!
Righto! But what if we wanted to find the best path?
1.4 Finding the Most Likely Path to Explain the Sequence
Lets say that the the croupier throws the following pattern of dice: \[ pattern = 6,6,6,6,5,6,6 \]
We want to know what is the most likely hidden pattern of states (fair or loaded) that he is using. In fact, we have a symbol, the Greek letter rho (ρ), which we use to refer to the most probable path through the observations. ρ is the pattern of (hidden) states. ρi would be the hidden state at the i^th position of the chain.
So how do we find ρ?
1.4.1 Maximum Likelihood Approach
Maximum likelihood is one of the rare things in math whose name explains just what it is: it is the scenario that maximize the likelihood of a certain set of data. Similar to Ockum's Razor, if thats a clearer way of thinking about it.
A simple example would be the following: you have two friends: a kleptomaniac and a dentist. Say your wallet goes missing when they are over at your house. The scenario with the maximum likelihood of happening (ie, what probably happened) is that the kleptomaniac stole your wallet.
The mathy way of writing things is the following: \(\rho_{ML}= argmax?(P(\rho|data))\) (argmax means that we pick whichever option maximizes the expression). The rho of maximum likelihood, or ρML is the path with the highest likelihood.
So what would ρMLbe for this roll pattern? well, We need to just look at each roll, and choose the probability that is the highest.
The first roll was a 6. The probability of rolling a 6 with a fair die (ie P( roll a 6 | fair)) is 1/6. It is no more likely than any other roll. But, the probability of rolling a 6 with a loaded die (ie P( roll a 6 | loaded)) is 1/2. Clearly, the more likely "cause" of the roll of 6 is a loaded die.
The same is true for all the other 6's. But what about the 5? The probability of rolling a 5 with a fair die is still 1/6. The probability of rolling a 6 with a loaded die (ie P( roll a 5 | loaded)) is 1/10. In this case, the fair die most likely caused the 5. So what have we learned? pattern = 6666566 What is ρML here? \(LLLLFLL\)
Done and dusted. However, we have ignored a key piece of information.
Remember how the croupier doesn't like switching dice? Here, in only 7 rolls, it would appear he has switched dice 2 times (Loaded to fair, and then back to loaded. But if we look at the probabilities of that happening (.05 and .1, respectively), we see that it is really unlikely that this happened. Based on the probabilities, he should only switch 2 times in 10-20 rolls, not 7!
The underlying problem with maximum likelihood estimates is that it doesn't consider whether the maximized pattern of hidden states (ρ) is actually probable.
This would become even more obvious if we considered a series of rolls: 2,5,6,2,6,1. the Maximum Likelihood approach would yield \(\rho_{ML}= F,F,L,F,L,F\). Again, remember how the croupier doesn't want to switch? This ρ doesn't seem to make sense.
So what can we do?
1.4.1.1 Bayes's Rule: A simple example
Bayesian statistics is all about dealing with joint probabilities. LEts say that in a bag of jelly beans, 10% are brown. Of those brown ones, 1/3 are rootbeer, 1/2 are chocolate, and 1/6 are licorice. Want to know the probability that a jelly bean is root beer flavored given that it is brown? You could calculate this the easy way, and realize that you are holding a brown jelly bean, and that you have a 33% chance of holding a rootbeer.
Bayes rule says that you can calculate it if you know the probability of \[P(H|D) = \frac{P(data|H)P(H)}{P(data)}\] or in our case:
\[P(rootbeer | brown) = \frac{P(brown | rootbeer)P(rootbeer)}{P(brown)} \]
This would be simplified to :
\[P(rootbeer | brown) = \frac{(1)P(.1*.33)}{P(.1)}\]
This is a dumb example because all root beer jelly beans were brown, and you have a pretty clear idea of the distributions.
1.4.1.2 Bayes's Rule: a slightly more complex example
Let's add in a bit of uncertainty. You just remembered that after a tragic rhubarb accident, you became colorblind. You have only an 80% chance of identifying a brown jelly bean (True positive rate), and 15% of the time, you think that another color is brown (false positive).
This is a scenario when Bayes Rule starts to get useful. Without it, we cant just say "well, it was 33% before, and that times the true positive rate (80%) is about 25%." That doesn't take into account the false positive rate of 15%. So what do we do?
Lets look again at Bayes's rules:
\[P(rootbeer | brown) = \frac{P(brown | rootbeer)P(rootbeer)}{P(brown)} \]
in our case, the main thing that has changed is the bottom part: the probability of the data , or that the bean is actually brown. Nothing else has changed. We now have to calculate all the chances that this is brown by looking at both scenarios where it is actually brown: when it is and you think it is, and when you think it isn't but it actually is :
\[P(rootbeer | brown) = \frac{P(brown | rootbeer)P(rootbeer)}{P(brown | guess correctly) + P(brown| guess incorrectly)} \] or, with numbers: \[P(rootbeer | brown) = \frac{1 * (.33 * .1)}{(.1*.8) + P(.9 * .15)} \]
(.33*.1) / ((.1*.8)+(.9*.15))
[1] 0.1534884
1.4.1.3 "Bayesian" Approach to ρ
"Cool!" you say, "but what about that casino guy?" Ah yes: how does Bayes's rule help us with the dishonest croupier? Well, we left off with maximum likliood estimation where \[\rho = argmax?(P(\rho|data))\]. Notice that the bit in the middle (the \(P(\rho|data)\) is a conditional probability. And that means we can apply Bayes's Rule to this formula! \[ \rho = argmax?(P(\rho|data)) == argmax? \frac{P(data|\rho)(P(\rho)}{ P(data)} \]
* And here is where it gets interesting. What is our \(P(data|\rho)\) in this case? it is the probability of the roll given one of the possible states (fair or loaded). This is represented by this formula \[ P(data,\rho) = \pi_{\rho} e_{\rho0}(data_0) \prod_{i=1}^{n} A_{\rho(i-1)} \rho_{i} * e_{\rho(i)}(data_{i}) \] \(A\) is our TPM. \(e\) is our emission probabilies matrix
What this is saying:
- \( \pi_{\rho} e_{\rho0}(data_0)\) start off with the first observation (data0), and multiply it by both the emission probability for some state and π at some state.
- Then, \( \prod_{i=1}^{n} A_{\rho(i-1)} \rho_{i} * e_{\rho(i)}(data_i) \) means that for each of the remaining observations \(data_{i}\dots data_{n}\), do the same thing, except that instead of guessing on the transition probability using π, we use the transition matrix, and multiply all the results together. The first time, we don't have a prior state, so we cant calculate the transition.
It gets a bit messy, but the thing to know here is that you could go through, calculate the probability of every single possible path (ρ) for the data, and pick the most likely one. But this can become very difficult computationally.
1.4.2 Viterbi algorithm
wikipedia's actually useful example
We need a better way of looking through to find the best possible path. We will start referring to this ρ*.
Enter the Viterbi algorithm!
\[ V_{1} = {P} {\big (}e_{i}\|\ k{\big )}\cdot \pi_{k} \] \[ V_{i,k} =\max _{x\in S}\left(\mathrm {P} {\big (}e_{i} \ |\ k{\big )}\cdot a_{x,k}\cdot V_{i-1,x}\right) \]
here, we are calling our states \(k\) Fair or loaded, \(i\) is th position in the sequence. It sorta makes sense, but there are too many letters up there for my liking; lets put some number on it Lets work out an example!
1.4.2.1 A Lengthy Viterbi Example
observe 4,5,6,2,6,1
i | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
\ | < | |||||
outcomes | 4 | 5 | 6 | 2 | 6 | 1 |
V_F | ||||||
V_L | ||||||
hidden | ? | ? | ? | ? | ? | ? |
- the first entries
How do we work out first? We use the first formula above: for each state (fair or loaded) we multiply The π at that state by the the emission probabilies for the outcome given the state. What was π again?
# pi dicePi # emission probabilites dice_probs
F L 0.6666667 0.3333333 [,1] [,2] [,3] [,4] [,5] [,6] F 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 L 0.1000000 0.1000000 0.1000000 0.1000000 0.1000000 0.5000000
V_F(0)= π_F e_F(4) = (2/3)(1/6) V_L(0)= π_L e_L(4) = (1/3)(1/10)
(2/3)*(1/6) (1/3)*(1/10)
[1] 0.1111111 [1] 0.03333333
Great! That was pretty straightforward. Now we can fill in those boxes.
i 1 2 3 4 5 6 roll 4 5 6 2 6 1 V_F F .111 x L V_L F .033 L hidden ? ? ? ? ? ? - The subsequent entries
Let's move on the next \(i\). Now, because we have a few potential starting states, we will stop using π as a proxy, and calculate using the transisiton matrix. Lets start with the VF. You will notice I added an extra column to the table. This is to record both calculations, and then we can calculate the max and strike out the other For VF, we have 2 condition to calculate: probability of going from Fair die to fair die on the second roll, and the probability of going from loaded to fair. \[ V_F(2) = e_{F}(2) max(v_{F}(1)A_{FF} or V_L(1)A_{LF}) \] Again, just to clarify, eF(2) is the probability of rolling a 2 with a fair die, VF(1) is the likelihood calculated before that the previous outcome came from a fair die, and AFF and ALF are the probabillirtes of going from Fair to fair and loaded to fair, respectively. Those are found in out TPM.
pattern diceTPM # if previous was fair dice_probs["F",pattern[2]] * 0.111 * diceTPM["F","F"] # or if loaded dice_probs["F",pattern[2]] * 0.033 * diceTPM["L","F"]
[1] 4 5 6 2 6 1 F L F 0.95 0.05 L 0.10 0.90 F 0.017575 F 0.00055
Alright, here, the most likely outcome was coming from a Fair dice. So we will enter both values, but crossed out the one that is less likely. How about if the die that rolled a 4 was loaded? we have to consider the opposite combinations: the probability of going from loaded to loaded, and the from fair to loaded.
\[ V_L(2) = e_{L}(2) max(v_{F}(1)A_{FL} or V_L(1)A_{LL}) \]
# if previous was fair diceTPM dice_probs["L",pattern[2]] * 0.111* diceTPM["L","F"] # or if previous was loaded dice_probs["L",pattern[2]] * .033*diceTPM["L","L"]
F L F 0.95 0.05 L 0.10 0.90 L 0.00111 L 0.00297
Alright, so here the most likely transition was from loaded to loaded. This makes sense, if we look at the TPM, we can remember that there is a 90% chance of staying with a loaded dice for the next roll (switching is risky). Lets update the matrix!
i 1 2 3 4 5 6 roll 4 5 6 2 6 1 V_F F .111 0.017575 x L 0.0055V_L F .033 .00111L .00297 hidden ? ? ? ? ? ? Cool! Just for good measure, lets go through the probabilities for the rest of the rolls
#V_F_3 dice_probs["F",pattern[3]] * 0.017575 * diceTPM["F","F"] dice_probs["F",pattern[3]] * 0.00297 *diceTPM["L","F"] #V_L_3 dice_probs["L",pattern[3]] * 0.017575 * diceTPM["F","L"] dice_probs["L",pattern[3]] * 0.00297 * diceTPM["L","L"]
F 0.002782708 F 4.95e-05 L 0.000439375 L 0.0013365
And for the 4th
#V_F_4 dice_probs["F",pattern[4]] * 0.002782708 * diceTPM["F","F"] dice_probs["F",pattern[4]] * 0.0013365 *diceTPM["L","F"] #V_L_4 dice_probs["L",pattern[4]] * 0.002782708 * diceTPM["F","L"] dice_probs["L",pattern[4]] * 0.0013365 * diceTPM["L","L"]
F 0.0004405954 F 2.2275e-05 L 1.391354e-05 L 0.000120285
And lets update it again
i 1 2 3 4 5 6 roll 4 5 6 2 6 1 V_F F .111 0.017575 0.002782708 0.0004405954 L 0.00554.95e-054.455e-06V_L F .033 .001110.0004393751.391354e-05L .00297 0.0013365 0.000120285 hidden ? ? ? ? ? ? And for the 5th
#V_F_5 dice_probs["F",pattern[5]] * 0.0004405954 * diceTPM["F","F"] dice_probs["F",pattern[5]] * 0.000120285 *diceTPM["L","F"] #V_L_5 dice_probs["L",pattern[5]] * 0.0004405954 * diceTPM["F","L"] dice_probs["L",pattern[5]] * 0.000120285 * diceTPM["L","L"]
F 6.976094e-05 F 2.00475e-06 L 1.101489e-05 L 5.412825e-05
And for the 6th
#V_F_6 dice_probs["F",pattern[6]] * 6.976094e-05 * diceTPM["F","F"] dice_probs["F",pattern[6]] * 5.412825e-05 *diceTPM["L","F"] #V_L_6 dice_probs["L",pattern[6]] * 6.976094e-05 * diceTPM["F","L"] dice_probs["L",pattern[6]] * 5.412825e-05 * diceTPM["L","L"]
F 1.104548e-05 F 9.021375e-07 L 3.488047e-07 L 4.871543e-06
And lets fill out the rest of the matrix!
i 1 2 3 4 5 6 roll 4 5 6 2 6 1 V_F F .111 0.017575 0.002782708 0.0004405954 6.976094e-05 1.104548e-05 L 0.00554.95e-054.455e-062.00475e-069.021375e-07V_L F .033 .001118.7875e-051.391354e-051.101489e-053.488047e-07L .00297 0.0002673 2.4057e-05 5.412825e-05 4.871543e-06 hidden ? ? ? ? ? ? Wow, that was a lot of work! We are almost done! Looking at the last roll, we see that the most likely scenario (the outcome, VF or VL) is VF, meaning that the Fair die is most likely responsible for the roll. Now, the last thing we need to do is to trace back through the most likely path that leads through these results to the Fair. ending state. The reason that we save the outcomes for each step is so that we can tell with transition led to the maximum. Looking back, we can see that all of the transition up to that point were F -> F transitions. What if the croupier threw the die threw more times, and got a 6 each time?
#V_F_7 dice_probs["F",6] * 1.104548e-05 * diceTPM["F","F"] dice_probs["F",6] * 4.871543e-06 *diceTPM["L","F"] #V_L_7 dice_probs["L",6] * 1.104548e-05 * diceTPM["F","L"] dice_probs["L",6] * 1.744023e-06 * diceTPM["L","L"]
F 1.748868e-06 F 8.119238e-08 L 2.76137e-07 L 7.848103e-07
#V_F_8 dice_probs["F",6] * 1.748868e-06 * diceTPM["F","F"] dice_probs["F",6] * 7.848103e-07 *diceTPM["L","F"] #V_L_8 dice_probs["L",6] * 1.748868e-06 * diceTPM["F","L"] dice_probs["L",6] * 7.848103e-07 * diceTPM["L","L"]
F 2.769041e-07 F 1.308017e-08 L 4.37217e-08 L 3.531646e-07
#V_F_9 dice_probs["F",6] * 2.769041e-07 * diceTPM["F","F"] dice_probs["F",6] * 3.531646e-07 *diceTPM["L","F"] #V_L_9 dice_probs["L",6] * 2.769041e-07 * diceTPM["F","L"] dice_probs["L",6] * 3.531646e-07 * diceTPM["L","L"]
F 4.384315e-08 F 5.886077e-09 L 6.922603e-09 L 1.589241e-07
i 1 2 3 4 5 6 7 8 9 roll 4 5 6 2 6 1 6 6 6 V_F F .111 0.017575 x0.002782708 0.0004405954 6.976094e-05 1.104548e-05 1.748868e-06 2.769041e-07 4.384315e-08 L 0.00554.95e-054.455e-062.00475e-069.021375e-078.119238e-081.308017e-085.886077e-09V_L F .033 .001118.7875e-051.391354e-051.101489e-053.488047e-072.76137e-074.37217e-086.922603e-09L .00297 0.0002673 2.4057e-05 5.412825e-05 4.871543e-06 7.848103e-07 3.531646e-07 1.589241e-07 hidden ? ? ? ? ? ? So here, something interesting has happened. Now, the last VL looks like the more likely possibility. We then trace back the probabilities "leading" to this final state, and we see that in all cases, the maximized probability came from the loaded stated. This means that this new outcome, 456261666, most likely came from using just the loaded die (ρ* is LLLLLLLL). This reinforces what we saw in the transition porbaility matrix: that switches are very unlikely.
Hopefully, this has shown you how to work out the most likely hidden states underlying some sort of data.
1.4.2.2 Recap
- Viterbi algorithm works out ρ* the path that yields the most probable sequence of hidden states
- This maximizes the posterior probability of the path, P(ρ|data):
- Viterbi gives the ideal path, but it may not be correct or likely due to data
- forward fives the probility of data.
- Together, we can get an idea of whats going on
1.4.2.3 wait, what if we don't know where we start?
Suppose we want a bit more control about the starting state of a sequence. We can do something very simple: add another state to our TPM, lets call it β, or "begin", Now we can make a TPM that looks like this :
modTPM <- matrix(c(
0,.8,.2,
0,.95,.05,
0,.1,.9), 3, byrow=T)
rownames(modTPM)<-c("begin", "fair", "loaded")
colnames(modTPM)<-c("begin", "fair", "loaded")
modTPM
begin fair loaded begin 0 0.80 0.20 fair 0 0.95 0.05 loaded 0 0.10 0.90
This gives us a way to account for some prior information about starting a chain (say, we know that 80% of the time, this croupier starts with a fair die). We could use a similar additional state if we knew something about the end state.
1.5 Training a Model
The last thing to figure out, how to train models, deserves its own post.