From the research paper authored by Peter Brown and Robert Mercer while at the IBM Watson Research Center in the early 90s.
It seems the initial 5 models are based on Bayes Theorem for probability analysis. They use a series of analysis to compare one against another (.e.g French vs English) to find patterns to see how connection relate with each other. This is for translation purposes.
Pg 7: We generally follow the common convention of using uppercase letters to denote random variables and the corresponding lowercase letters to denote specific values that the random variables may take. We have already used I and m to represent the lengths of the strings e and L and so we use L and M to denote the corresponding random variables.
As stated here by James Baker, https://www.quora.com/What-are-the-investment-strategies-of-James-Simons-Renaissance-Technologies-I-understand-he-employs-complex-mathematical-models-along-with-statistical-analyses-to-predict-non-equilibrium-changes
We need to understand the corresponding random variables. There is mention of Lagrange multipliers and normalizing predictive data.
Auxiliary functions are used as well to generate desirable parameters using extrema/maxima analysis. Conditional probabilities are used as well. On pg 276 there are translation, distortion and fertility probabilities.
Page 282: Model 5 is a powerful but unwieldy ally in the battle to align translations. It must be led to the battlefield by its weaker but more agile brethren Models 2, 3, and 4. In fact, this is the raison d’etre of these models. To keep them aware of the lay of the land, we adjust their parameters as we carry out iterations of the EM algorithm for Model 5. That is, we collect counts for Models 2, 3, and 4 by summing over alignments as determined by the abbreviated S described above, using Model 5 to compute Pr(ale, f). Although this appears to increase the storage necessary for maintaining counts as we proceed through the training data, the extra burden is small because the overwhelming majority of the storage is devoted to counts for t(fle ), and these are the same for Models 2, 3, 4, and 5.
Page 283 shows the number of translations done which goes into the millions to determine a small subset of useful words. EM algo used with maximum likelihood.
Pg 283 Although the entire t array has 2,437, 020,096 entries, and we need to store it twice, once as probabilities and once as counts, it is clear from the preceeding remarks that we need never deal with more than about 25 million counts or about 12 million probabilities. We store these two arrays using standard sparse matrix techniques. We 283 Computational Linguistics Volume 19, Number 2 keep counts as pairs of bytes, but allow for overflow into 4 bytes if necessary. In this way, it is possible to run the training program in less than 100 megabytes of memory. While this number would have seemed extravag…
Page 293 speaks of Viterbi algo training:
We have already used this algorithm successfully as a part of a system to assign senses to English and French words on the basis of the context in which they appear (…
Page 297 table of notation
Appendix B has summary of models. Note especially Log-Likelihood Objective Function.
Note page 300 iterative improvement.
In order to apply these algorithms, we need to solve the maximization problems of Steps 2 and 4. For the models that we consider, we can do this explicitly. T
Page 301: Parameter Reestimation Formulae: In order to apply these algorithms, we need to solve the maximization problems of Steps 2 and 4. For the models that we consider, we can do this explicitly.
Equation (73) is useful in computations since it involves only O(lm) arithmetic operations, whereas the original sum over alignments (72) involves 0(I m) operations.
Download entire collection including some old interview with Jim Simons and rare audio speech of Peter Brown and Bob MercerFACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!