This is the blog page for my course of 16 lectures to third year Cambridge mathematics students in winter 2012.
I intend to make a few comments after each lecture: to
emphasize an idea, give a sidebar, correction (!), or answer an interesting
question (that perhaps a student sends to me in email).
I mentioned in the lecture that it is initially surprising to see that in Example 16.4 the cost is greatest when there is no noise.This is a little bit counterintuitive and so we should try to understand why this is.
I guess it is because C_0 and C_1 are almost the same, and L is small compared to Q, and so we can let the noise take us to the endpoints, only expending minimal effort (which is costed at Qu^2). I guess that if C_0 were a lot less than C_1, or if L were bigger compared to Q, then noise will not be helpful and we would expect Figure 4 to look different.
I am glad to read in the feedback that most people have rated the course "5 Interesting". One person even said "The course overall was fascinating (not just interesting)."
Here are solutions to example sheet 3. If yo uhave not yet done this examples sheet, try to resist the temptation to look until you have.
Since the lecture, I have made a small change to the start of section 15.3, to better explain the derivation of (15.3). It would be easy to derive (15.3) directly, simply by writing down an infinitesimal form of the dynamic programming equation (from time t to t+δ), and then taking the limit as δ → 0.
The method of uniformization that I described in this lecture seems to have been introduced by Jensen (1953), but it was first used extensivley in solving queueing problems in the late 1970s (by Grassman (1977) and Keilson (1979)). The Wikipedia article has more about it and summarises the method in these words:
The method involves the constructions of an analogous discrete time Markov chain, where transitions occur according to an exponential distribution with the same parameter in every state.
I have used this idea many dozens of times in my own research. It is usually the best first step to take in tackling a continuous-time Markov decision problem.
Today's lecture was about solving some problems using Pontryagin's Maximum Principle. My personal favorite successful and novel use of PMP in solving an interesting problem is in the paper
R. R. Weber. Optimal search for a randomly moving object. J. Appl. Prob. 23:708-717, 1986.
Interestingly, the problem that is solved in this paper in continuous time is still an open problem in discrete time. That shows the power of PMP, in that with it we can solve problems in discrete time that cannot be solved in discrete time. There is no discrete-time analogue of the PMP.
I mentioned today in connection with section 14.5 (insects as optimizers) that zoologists have observed that in real-world choices of t* some insect populations seem to do a good job of predicting T (the time of the serious frost that ends the lives of the workers.)
Here is a digression harkening back to Lecture 2. In connection with some of my current research, I was looking yesterday for some other information about the Secretary Problem. I came across this nice paper and enjoyed reading it.
Thomas S. Ferguson. Who Solved the Secretary Problem? Statist. Sci., Volume 4, Number 3 (1989), 282--289.
In his paper Thomas Ferguson tells the folllowing story. which I had not known before. It is funny and I thought you might enjoy it also. It makes fun of how we mathematicians can take life foo seriously.
"When the celebrated German astronomer, Johannes Kepler (1571-1630), lost his first wife to cholera in 1611, he set about finding a new wife using the same methodical thoroughness and careful consideration of the data that he used in finding the orbit of Mars to be an ellipse. His first, not altogether happy, marriage had been arranged for him, and this time he was determined to make his own decision. In a long letter to a Baron Strahlendorf on October 23, 1613, written after he had made his selection, he describes in great detail the problems he faced and the reasons behind each of the decisions he made. He arranged to interview and to choose from among no fewer than eleven candidates for his hand. The process consumed much of his attention and energy for nearly 2 years, what with the investigations into the virtues and drawbacks of each candidate, her dowry, negotiations with her parents, natural hesitations, the advice of friends, etc. The book of Arthur Koestler (1960) contains an entertaining and insightful exposition of the process. The book of Carola Baumgardt (1951) contains much supplementary information.
Suffice it to say that of the eleven candidates interviewed, Kepler eventually decided on the fifth. It may be noted that when n = 11, the function phi_n(r) of (2.1) takes on its maximum value when r = 5. Perhaps, if Kepler had been aware of the theory of the secretary problem, he could have saved himself a lot of time and trouble."
Fortunately,
"His new wife, whose education, as he says in his letter, must take the place of a dowry, bore him seven children, ran his household efficiently, and seems to have provided the necessary tranquil homelife for his erratic genius to flourish."
Ferguson has many other interesitng things in this paper. The paper created some controversy and amusement when published because his answer the question "Who Sovled the Secretary Problem?" is "Nobody"! He then goes on to "solve" it for the first time! (at least a particular version he likes). He also says:
"As historians, we should take as the secretary problem, the problem as it first appeared in print, in Martin Gardner's February 1960 column in Scientific American, where it was called the game of googol and described as follows.
Ask someone to take as many slips of paper as he pleases, and on each slip write a different positive number. The numbers may range from small fractions of 1 to a number the size of a googol (1 followed by a hundred O's) or even larger. These slips are turned face down and shuf-fled over the top of a table. One at a time you turn the slips face up. The aim is to stop turning when you come to the number that you guess to be the largest of the series. You cannot go back and pick up a previously turned slip. If you turn over all slips, then of course you must pick the last one turned.
The astute reader may notice that this is not the simple form of the secretary problem described in Section 2. The actual values of the numbers are re-vealed to the decision maker in violation of condition 4. Also, there is this "someone" who chooses the numbers, presumably to make your problem of select-ing the largest as difficult as possible. The game of googol is really a two-person game."
There is a nice interactive demo of the solution to the rocket car parking problem at http://www.egwald.ca/optimalcontrol/rocketcar.php
I updated the "proof" of PMP in the notes on page 50. It was too cryptic before! I am now glad to have it written down in a clear way.
I am posting today some solutions to example sheet 2.
In Section 11.5 we proved that the optimal control is a certainty equivalence control. In doing so, we ignored certain poiicy independent terms, simply denoting them by "+ ···". If required, these terms could be computed explicitly. For example, in tripos question 2003 Paper 4, #14, one is asked to find a recurrence for the + ··· terms in a simple example.
The name "Kalman filter" refers to equation (11.5) and takes its name from Rudolf Kalman (1930 –), who developed it in the years 1958-64. He also coined the terms controllable and observable, and gave the criteria that we have seen in previous lectures. The fact that a system is controllable iff the matrix [B AB ··· A^{n-1}B] is of full rank is sometimes called Kalman's criteria,
In the IEEE biography of Kalman it is stated
The Kalman filter, and its later extensions to nonlinear problems, represents perhaps the most widely applied by-product of modern control theory. It has been used in space vehicle navigation and control (e.g. the Apollo vehicle), radar tracking algorithms for ABM applications, process control, and socioeconomic systems.
I apologise for a late start to the lecture this morning, which meant that I could not yet talk about Section 11.5 (certainty equivalence).
The theory in this lecture is admittedly quite tricky - partly because of the notation. You will understand the ideas better once you have worked through the details of a scalar example (in which n=m=p=1). You do this in Example Sheet 2 Question 11.
You will not be asked to reproduce the statement or proof of Theorem 11.2 in examinations. You should simply know that hat{x}_t is computed from hat{x}_{t-1} and y_t in the linear manner specified by (11.5), and that the covariance matrix V_t satisfies a Riccati equation. You are not expected to memorize (8.7), (8.8), (11.6) or (11.7).
I forgot to comment today, — so let me do so here —, that the Ricatti equation for V_t, i.e. V_t = g V_{t-1} runs in the opposite time direction to the one we had for Pi_t in lecture 8, where Pi_{t-1} = f Pi_t. We are given V_0 and Pi_h.I wish only to repeat the remark that I made during the lecture that Controllability and Observabliity stand in a dual relationship to one another. This is clear in the necessary and sufficient conditions: that [B AB ··· A^{n-1}B] must be rank n (for controllability), and [C' A'C' ··· (A')^{n-1}C'] must be of rank n (for observability). This duality is also nicely exhibited in Example Sheet 3 Question 1. You should now be able to do this question.
I have made a small change to the notes for this lecture. Let's say that the matrix C is p x n. To say it is r x n might confuse, since on the same page we use the letter r to talk about r-observability. I have also rewritten the pendulum example so as to make it clearer what is the matrix K.
Notice that a system can be stabilizable without being controllable. E.g. the scalar system with x_{t+1}=(1/2)x_t is trivially stabilizable, but it is not controllable.
The idea of controllability is straightforward and we can understand it using ideas of linear algebra.
I commented that in the broom balancing problem it is not possible to stabilize two brooms at equilibrium if they are the same lengths, but that it is possible if the lengths are different. This is because of the forms:
A = | 0 1 0 0 | B =| 0 | M4 =| 0 -a 0 -a^2 | | a 0 0 0 | |-a | |-a 0 -a^2 0 | | 0 0 0 1 | | 0 | | 0 -b 0 -b^2 | | 0 0 b 0 | |-b | |-b 0 -b^2 0 |
So M4 is of rank 4 iff a and b are different. However, this assumes that as you move your hand you must keep it at constant height from the ground. What do you think might happen if you can move your hand up and down also?
In 1996, paper 2, question 10 there was this cute tripos question:
A fantasy kingdom has populations of x vampires and y humans. At the start of each of the four yearly seasons the king takes a census and then intervenes to admit or expel some humans. Suppose that the population dynamics of the kingdom are governed by the plant equations:
x_{t+1} = x_t + (y_t – Y),
y_{t+1} = y_t – (x_t – X) + u_t,
where x_t and y_t are integers representing the respective populations of vampires and humans in the t-th season, t=0,1,2,3,...; X and Y represent equilibrium values in the population without regal intervention; ...
[ some more description]
Can the king regain equilibrium by only expelling humans?
Suppose, the census is taken during the day, so the king can only count humans, and thus u_t is allowed only to depend on y_t. Show that ...
The last part of the question is about observability (see Lecture 10). The full question is in this file of past questions.
The linear-quadratic regulator (LGR) that we met today is one part of the solution to the linear-quadratic-Gaussian control problem (LQG). This problem is the subject of Lectures 8–11. The LQG problem is perhaps the most important problem in control theory. It has a full solution: given in terms of the Kalman Filter (a linear-quadratic estimator) and the linear-quadratic regulator.
Following today's lecture you can do Questions 4–9 on Example Sheet 2. Here is an important hint. Do not try to solve these problems by plugging in values to the general Riccati equation solution in equations (8.6)–(8.7). It is always better to figure out the solution from scratch, by the method of backwards induction. Conjecture for yourself that the optimal value function, F(x,t), is of a form that is quadratic in the state variable, plus some gamma_t (if it is a problem with white noise). Then find the explicit form of the recurrences, working backwards inductively from a terminal time h, Notice how this problem-solving strategy works in Question 5, even when the noise is rather different.
In general, solutions to a Riccati equation can be computerd numerically, but not algebraically. However, one can find an full solution in the special case that x_t and u_t are both one-dimensional (Question 4). There is also a fully-solvable special case in which x_t is two-dimensional and u_t is one-dimensional (Question 9). A useful trick in some problems is to observe that Pi_t^{-1} satisfies a recurrence relation (Questions 4 and 9). Notice that Question 9 is a special case of Question 8, and so in solvling Question 9 you should find a way to reformulate the problem so that the state variable becomes one-dimensional.
I mentioned in the lecture that I would not expect you to memorize the Riccati equation. The important thing is to remember enough of what it is for, how it looks, and how it is derived, so that you could reconstruct its derivation in the context of a particular problem, as you do in Questions 4 and 9.
I think you might enjoy reading the quite-good Wikipedia articles on the Linear-quadratic-regulator, Linear-quadratic-Gaussian control, and algebraic Riccati equations. I recommend, as a general principle for learning, that you should always try to see how the same ideas are presented by more than one author, with different emphasis, and with different notation. That can help you to understand what are the essential ideas/concepts, versus what is merely presentational/notational. I try to make my explanations in this course efficient and "best" (in my personal view). But you will certainly gain a deeper understanding of the topics we are studying if you compare how others present them. Too often, Cambridge mathematics students see only the way their lecturer states it. You should try (at least briefly) to compare other explanations.
The policy improvement algorithm is useful in answering Questions 2 and 3 on Examples Sheet 2. In Question 2, you will find λ and φ for a certain policy ("On seeing a fillin g station, stop and fill the tank"), and then look for a condition that the policy improvement algorithm will not improve it (or, equivalently, that this λ and φ satisfy the optimality equation (7.2)).
In Question 3 you will use policy improvement idea in the context of a discounted-cost problem. You find F(x) for a simple policy (in which the engineer allocates his time randomly), and then improve it by a step of the policy improvement algorithm. This leads to a policy in which the engineer puts all his maintenatnce effort into the machine with greatest value of ci(xi+qi)/(1-βbi). This policy is better, but may not yet be optimal.
The value interation bounds in Section 7.3 are relevant to the CATAM project The Value Iteration Algorithm for a Stochastic Dynamic Programming Problem.
There is another interesting way to motivate the optimality equation (7.2). This can be made rigorous and helps us undertand the relative value function φ(x).
Let F(x) be the infinite-horizon value function when the discount factor is β. Then we know this satisfies the optimality equation
F(x) = min_u { c(x,u) + β E[ F(x1) | x0=x, u0=u ] } .
Pick some state, say state 0. By subtracting βF(0) from both sides of the above, we obtain
F(x) – F(0) + (1–β)F(0) = min_u { c(x,u) + β E[ F(x1) – F(0) | x0=x, u0=u ] }
One can show that as β →1 we have have F(x) – F(0) → φ(x) and (1–β)F(0) → λ (the average-cost). Thus we obtain
φ(x) + λ = min_u { c(x,u) + E[ φ(x1) | x0=x, u0=u ] }
and this is our average-cost optimality equation (7.2). If you would like to understand why (1–β)F(0) → λ, see this small note about the connection between the average-cost problem and the discounted-cost problem with β near 1.
It is also interesting to think about the following (which I mentioned briefly in lectures today). Suppose we have a deterministic stationary Markov policy, say π, with u=f(x). Suppose we have λ and φ such that
φ(x) + λ = c(x,f(x)) + Σy φ(y) P(y | x ) (7.6)
where P(y | x) is the transition matrix under π.
Suppose π induces an ergodic Markov chain (i.e. a Markov chain that is irreducible and positive recurrent) and this has invariant distribution μ. We know that
μy = Σx μx P( y | x) (7.7)
Multiplying (7.6) through by μx and then summing on x, we get
Σx μx φ(x) + λ Σx μx
= Σx μx c(x,f(x)) + Σy Σx μx φ(y) P( y | x)
which, using (7.7) gives
Σx μx φ(x) + λ Σx μx = Σx μx c(x,f(x)) + Σy μy φ(y).
Then using Σx μx = 1, and cancelling the terms that are equal on both sides, we have
λ = Σx μx c(x,f(x))
and so we see that λ is the average-cost of policy π.
Today's lecture was presented using these overhead slides. They are a modified version of this research talk. A slightly different presentation is in the notes. There are also some additional things in Sections 6.5 and 6.6 (non-examinable). As I commented in the lecture, the proof of the Gittins index theorem is non-examinable. However, I thought you would enjoy seeing this beautiful result and how it can be proved. Lectures 1-5 have covered everything you need to know in order to understand the Gittins index. Today's lecture has also been an opportunity to revise ideas that we already met in Sections 3.2 (job scheduling) and 4.5 (pharmaceutical trilas).
The Gittins index theorem is perhaps the most beautiful result in the field of Markov decision processes in the later part of the 20th century. Its discovery and proof in 1974 is due to John Gittins. The paper from which the proof in today's lecture is taken is On the Gittins index for multiarmed bandits. Ann. Appl. Prob. 2, 1024-33, 1992.
I am posting today some solutions to example sheet 1. I trust you to use these wisely! As mathematicians we need nice problems to train our mathematical muscles. We improve as mathematicans by trying hard problems, getting stuck, letting a day pass, trying again, looking at books, tallking to others, thinking about it as we fall off to sleep, trying again, etc. George Polya explains some nice problem-solving strategies in his famous little book How to Solve It. If you refer to my worked solutions too early you may lose out.
By the way, while we are thinking about the raison d'etre for example sheets, let me recommend Tim Gowers article The Two Cultures of Mathematics. He asks, with which of the following do you feel most sympathy?
I made some changes this morning on page 14 (at the start of Section 4.4) so that the calculations of the value iteration algorithm are set out for you more explicitly. We will say more about value iteration in Lecture 7. Also, I have newly updated the notes for Lecture 5 this morning (with an improved remark at the top of page 20).
Having completed these lectures you can now attempt all of Examples Sheet 1.
Questions 7 and 8 use the idea of a one-step look-ahead (OSLA) rule that we have met in today's lecture. Question 8 is subtle, because although it is easy to guess the answer by thinking about a OSLA rule, how do you prove this answer is correct? Hint: use an appropriate theorem about positive programming. Be careful when writing down the dynamic programming equation for Question 8 that you put the expected value (or integral and density function for an exponential random variable) in the right place relative to where you place your max{ , }).
Questions 9-11 are deeper, but that is mostly as regards their lengthy descriptions. The solutions are not actually hard. In Q9, you'll want to use the idea of substituting an equation into itself. One of the reasons I include these questions is to show you the sort of problem that classifies as a research problem in this field. I have recently written a research paper about the problem in Question 11. This paper is to be published in the journal Annals of Operations Research.
One of the CATAM projects this year is The Value Iteration Algorithm for a Stochastic Dynamic Programming Problem. Today's lecture and also Lecture 7 are highly relevant to answering the theoretical parts of this project. Doing this project will help you better understand this course. So I highly recommend this CATAM project to you.
In Section 4.3 (optimal gambling) we saw that timid play is optimal in the gambling problem when the game is favorable to the gambler (p>=0.5).
Perhaps you remember the question on the IB Markov Chains examples sheet 1 that begins "A gambler has £2 and needs to increase it to £10 in a hurry. The gambler decides to use a bold strategy in which he stakes all his money if he has £5 or less, and otherwise stakes just enough to increase his capital, if he wins, to £10."
In the case p<0.5 bold play maximizes the probabilty that the gambler reaches N pounds. However, bold play may not be uniquely optimal. An example that exhibits this non-uniqueness of optimal strategy is contained in question 13 of the IB Markov Chains example sheet 1. This is a new question that you will not have seen last year. You might like to look at it now.
When it was first discovered that bold play is not uniquely optimal, it was contrary to the accepted wisdom at that time (1955). It impressed Leonard Savage when it was presented to him by a Ph.D. student named Lester Dubins They developed a collaboration culminating in the famous monograph How to Gamble if You Must (Inequalities for Stochastic Processes). See the fourth paragraph of this obituary of Lester Dubins.
If p=0.5 all strategies are optimal. How could we prove that? Easy. Simply show that given any policy pi the value function is F(pi,i)=i/N and that this satisfies the DP equation. Then apply Theorem 4.1.
You can read more about these so-called red and black games at the Virtual Laboratories in Probability and Statistics.
In Section 4.5 (pharmaceutical trials) I introduced an important class of very practical problems. One obvious generalization is to a problem with k drugs, each of which has a different unknown probability of success, and about which we will learn as we make trials of the drugs. This is called a multi-armed bandit problem. The names comes from thinking about a gaming machine (or fruit machine), having k arms that you might pull, one at a time. The arms have different unknown probabilities of generating payouts. In today's lecture we considered a special case of the two-armed bandit problem in which one of the arms has a known success probability, p, and the second arm has an unknown success probability, theta. I will say more about these problems in Lecture 6. The table on page 16 was computed by value iteration. This table is from the book Multi-armed Bandit Allocation Indices (of which I am one of the authors). The amazon.co.uk page will let you look inside this book.
In this lecture we had an important theorem (Theorem 3.1). It had an easy but nontrivial proof. Theorem 3.1 states that F(x) satisfies a DP equation (3.7). It holds under the assumption of D (discounted), N (negative) or P (positive) programming. I explained what these three assumptions mean.
I gave an example, c(x,u) = x, a(x,u) = –x, beta = 1, for which F(x) does not exist. Note that a problem with this data fails to satisfy D, N or P. I have added this example to the bottom of page 10.
Notice that the example in Section 3.5 (of selling a tulip bulb collection) is very much like the secretary problem in Section 2.3. The difference is that now (i) we observe values (not just ranks), (ii) wish to maximize the expected value of the selected candidate (rather than probability of choosing the best), and (iii) the number of offers is infinite, but with a discount factor beta.
Read the final two lines on page 12 of the notes. One way in which discounting by a factor beta can naturally arise is when a catastrophe can occur, with probability 1–beta, bringing the problem to an end.
Subsequent to the lecture I have made some small changes to the notes for this lecture in equations (3.2) and (3.3).
A professor from Lancaster University was telling me at lunch today that he dares not talk of the "secretary problem", for fear he will get complaints that the language of the problem is sexist. He would call it the "optimal choice problem". Funnily enough, I used to do that also. But in recent years I have reverted to the traditional name of "secretary problem". We know that this problem dates from the 1950s. It is probably best to use the name by which the problem is best known to most experts (and how it is named on Wikipedia).
The highlight of this lecture was the Secretary Problem. This is the most famous problem in the field of optimal stopping. It is credited to Merrill M. Flood in 1949, who called it the fiancée problem. It gained wider publicity when it appeared in Martin Gardner's column of Scientific American in 1960. There is an interesting Wikipedia article about it. One of the interesting things said in this article is that in behavioural studies people tend to stop too soon (i.e. marry too soon, make a purchase too soon).
The optimal solution to the secretary problem can also be found as a special case application of Bruss's odds-algorithm, due to Thomas Bruss. There is a nice tripos question based on Bruss's odds algorithm in 2011, Paper 2, 29K. The solution to the secretary problem is contained in the last line of this question. You should be able to do it once we have completed Lecture 5.
RW with J. Michael Steele and Thomas Bruss at the 9th German Open Conference on Probability and Statistics, 2010, in Leipzig |
Following the lecture two students asked me an intelligent question: How is it that x_t (=0 or 1) and W_{t-1} are independent? They argue that if we have seen X_1,..., X_{t-1} and observed all of these to be near 0, and then see X_t=1000, won't this incline us to think that candidate t is very likely to be the best over all? Yes, that is correct — if we do indeed observe values like X_1,..., X_t. However, in the secretary problem we suppose that numerical scores, like X_1,..., X_t, are not available. All we can learn is how each candidate ranks against the others. Detailed information about how the first t-1 candidates rank against one another does not change the probability that the next candidate will outrank them all. That probability is aways 1/t.
Variations of the secretary problem can be created by changing the assumptions about whether or not, in addition to relative ranks, values X_1,... X_h can be observed, and whether the objective is to maximize the probability of choosing the best candidate, to maximize the expected rank, or to maximize the expected value of the candidate selected.
A variation that has never been completely solved is the so-called Robbin's Problem. In this problem we do observe values of candidates, say X_1, ... , X_h, and these are assumed to be independent, identically distributed uniform[0,1] random variables. The objective is to maximize the expected rank of the candidate that is selected (best = rank 1, second-best = rank 2, etc). It is known only that, as h goes to infinity, the expected rank that can be achieved under an optimal policy lies between 1.908 and 2.329. This problem is much more difficult that the usual secretary problem because the decision as to whether or not to hire candidate t must depend upon all the values of X_1, ... , X_t, not just upon how X_t ranks amongst them.
Following this lecture you can do questions 1–4 and 10 on Example Sheet 1. Question 2 is quite like the secretary problem (and also has a surprising answer). The tricks that have been explained in today's lecture are useful in solving these questions (working in terms of time to go, backwards induction, that a bang-bang control arises when the objective in linear in u_t, looking at the cross-over between increasing and decreasing terms within a max{ , }, as we did in the secretary problem with max{t/h, F(t)}).
Notice that throughout this blog I use pseudo-LaTeX notation for mathematics. So x_i is "x sub i", alpha is the Greek character alpha, and so forth. (I say pseudo-LaTeX, because in actual-LaTeX one has to put $ signs around things, e.g. $x_i$, and put a \ in front of a Greek character name, e.g. \alpha). LaTeX is the language in which my lecture notes (and almost all modern mathematical papers) are written. When I want to do some algebra or work with mathematical objects I sometimes use Mathematica's notation. So DiagonalMatrix[{1,1−alpha−beta}] is the 2x2 diagonal matrix which has 1 and 1−alpha−beta on the diagonal. Also, {{1,alpha},{1,−beta}} is the 2x2 matrix whose rows are {1,alpha} and {1,−beta}.
Note for next year: I spent 15, 20 and 15 minutes on the three problems in this lecture. It would be better to allow 15, 15, 20.
The first lecture was at 9am on January 19, 2012. It was attended by more than 60 students. I spoke about how I intend to deliver the couse – including the fact that I will write this blog as a supplement.
One person afterwards made a valid criticism of my hand-writing. (Yes, I am lazy about crossing my t's and dotting my i's). This is for speed. Remember, I am giving you a near-perfect set of notes, of four A4 pages, for each lecture. The main thing that I want you to do during the lecture is to concentrate on understanding the broad conceptual ideas, and the various tricks I will share with you about how to think in this subject area. You can go over the itty-bitty details at your leisure in the printed notes.
In the first lecture I set up notation for state, control, history, value function, etc, and developed dynamic programming equations for a very general case, a state-structured case, and the Markov decision process case. Please be patient with the notation. It is not as complex as it may first appear. Things like a(x,u,t), F(x_t,t), u_t, and U_t all begin to seem like old friends once you have used them a few times.
From this first lecture you should be taking away the key idea of dynamic programming, and the fact that problems in stages (with separable cost and a finite time horizon) can often be solved by working backwards from the final stage. The minimum length path (stage coach) problem is trivial, but should have made these ideas very intuitive.
As an example of a problem in which the optimal policy should be determined by working backwards from the end, I mentioned that I had once appeared on ITV's Who Wants to be a MIllionaire (October 2003). There is a nice Part II exam question on this, including the model solution. You might like to look at this now – simply to see the sort of mathematical problem that this course will enable you to solve.
You can also view the overhead slides for a little presentation called A Mathematician Plays "Who Wants to Be a Millionaire?" which I once gave to a Royal Institution workshop for school children. You might like to see how I made best use of my "Ask the audience" lifeline by employing an idea from statistics. Basically, I asked members of the audience not to vote if they felt at all unsure of the right answer.
Please send me email with comments or corrections on my lectures, the notes or the example sheets. I am very glad to receive comments and suggestions from students studyng the course.
This material is provided for students, supervisors (and others) to freely use in connection with this course. Copyright remains with the author.