Lec 21 | MIT 6.00 Introduction to Computer Science and Programming, Fall 2008


OPERATOR: The following content
is provided under a Creative Commons license. Your support will help MIT
OpenCourseWare continue to offer high quality educational
resources for free. To make a donation or view
additional materials from hundreds of MIT courses, visit
MIT OpenCourseWare at ocw.mit.edu. PROFESSOR: So let’s start. I have written a number
on the board here. Anyone want to speculate what
that number represents? Well, you may recall at the end
of the last lecture, we were simulating pi, and I
started up running it with a billion darts. And when it finally terminated,
this was the estimate of pi it gave
me with a billion. Not bad, not quite perfect,
but still pretty good. In fact when I later ran it with
10 billion darts, which took a rather long time to run,
didn’t do much better. So it’s converging very slowly
now near the end. When we use an algorithm like
that one to perform a Monte Carlo simulation, we’re
trusting, as I said, that fate will give us an unbiased sample,
a sample that would be representative of true
random throws. And, indeed in this case,
that’s a pretty good assumption. The random number generator is
not truly random, it’s what’s called pseudo-random, in that if
you start it with the same initial conditions, it will
give you the same results. But it’s close enough for, at
least for government work, and other useful projects. We do have to think about the
question, how many samples should we run? Was a billion darts enough? Now since we sort of all started
knowing what pi was, we could look at it and say,
yeah, pretty good. But suppose we had no clue about
the actual value of pi. We still have to
think about the question of how many samples? And also, how accurate do we
believe our result is, given the number of samples? As you might guess, these two
questions are closely related. That, if we know in advance how
much accuracy we want, we can sometimes use that
to calculate how many samples we need. But there’s still always
the issue. It’s never possible
to achieve perfect accuracy through sampling. Unless you sample the
entire population. No matter how many samples you
take, you can never be sure that the sample set is typical
until you’ve checked every last element. So if I went around MIT and
sampled 100 students to try and, for example, guess the
fraction of students at MIT who are of Chinese descent. Maybe 100 students would be
enough, but maybe I would get unlucky and draw
the wrong 100. In the sense of, by accident,
100 Chinese descent, or 100 non-Chinese descent,
which would give me the wrong answer. And there would be no way I
could be sure that I had not drawn a biased sample, unless
I really did have the whole population to look at. So we can never know that
our estimate is correct. Now maybe I took a billion
darts, and for some reason got really unlucky and they
all ended up inside or outside the circle. But what we can know, is how
likely it is that our answer is correct, given
the assumptions. And that’s the topic we’ll spend
the next few lectures on, at least one
of the topics. It’s saying, how can we know
how likely it is that our answer is good. But it’s always given some set
of assumptions, and we have to worry a lot about those
assumptions. Now in the case of our pi
example, our assumption was that the random number generator
was indeed giving us random numbers in the
interval 0 to 1. So that was our underlying
assumption. Then using that, we looked at a
plot, and we saw that after time the answer wasn’t
changing very much. And we use that to say, OK, it
looks like we’re actually converging on an answer. And then I ran it again, with
another trial, and it converged again at
the same place. And the fact that that happened
several times led me to at least have some reason to
believe that I was actually finding a good approximation
of pi. That’s a good thing to do. It’s a necessary thing to do. But it is not sufficient. Because errors can creep
into many places. So that kind of technique,
and in fact, almost all statistical techniques, are good
at establishing, in some sense, the reproduce-ability of
the result, and that it is statistically valid, and that
there’s no error, for example, in the way I’m generating
the numbers. Or I didn’t get very unlucky. However, they’re other places
other than bad luck where errors can creep in. So let’s look at an
example here. I’ve taken the algorithm we
looked at last time for finding pi, and I’ve
made a change. You’ll remember that we were
before using 4 as our multiplier, and here what I’ve
done is, just gone in and replaced 4 by 2. Assuming that I made a
programming error. Now let’s see what happens
when we run it. Well, a bad thing
has happened. Sure enough, we ran it and
it converged, started to converge, and if I ran 100
trials each one would converge at roughly the same place. Any statistical test I would
do, would say that my statistics are sound, I’ve
chosen enough samples, and for some accuracy, it’s
converting. Everything is perfect,
except for what? It’s the wrong answer. The moral here, is that just
because an answer is statistically valid, does not
mean it’s the right answer. And that’s really important to
understand, because you see this, and we’ll see more
examples later, not today, but after Thanksgiving, comes
up all the time in the newspapers, in scientific
articles, where people do a million tests, do all the
statistics right, say here’s the answer, and it turns out
to be completely wrong. And that’s because it was some
underlying assumption that went into the decision,
that was not true. So here, the assumption is,
that I’ve done my algebra right for computing pi based
upon where the darts land. And it turns out, if I put 2
here, my algebra is wrong. Now how could I discover this? Since I’ve already told you
no statistical test is going to help me. What’s the obvious thing I
should be doing when I get this answer? Somebody? Yeah? STUDENT: [INAUDIBLE] PROFESSOR: Exactly. Checking against reality. I started with the notion that
pi had some relation to the area of a circle. So I could use this value
of pi, draw a circle with a radius. Do my best to measure
the area. I wouldn’t need to get a very
good, accurate measurement, and I would say, whoa, this
isn’t even close. And that would tell me
I have a problem. So the moral here is,
to check results against physical reality. So for example, the current
problem set, you’re doing a simulation about what
happens to viruses when drugs are applied. If you were doing this for a
pharmaceutical company, in addition to the simulation,
you’d want to run some real experiments. And make sure that
things matched. OK, what this suggests, is that
we often use simulation, and other computational
techniques, to try and model the real world, or the
physical world, in which we all live. And we can use data
to do that. I now want to go through another
set of examples, and we’re going to look at the
interplay of three things: what happens when you have data,
say from measurements, and models that at least claim
to explain the data. And then, consequences that
follow from the models. This is often the way science
works, its the way engineering works, we have some
measurements, we have a theory that explains the measurements,
and then we write software to explore the
consequences of that theory. Including, is it plausible
that it’s really true? So I want to start, as
an example, with a classic chosen from 8.01. So I presume, everyone
here has taken 8.01? Or in 8.01? Anyone here who’s not had
an experience with 801? All right, well. I hope you know about springs,
because we’re going to talk about springs. So if you think about it, I’m
now just talking not about springs that have water in them,
but springs that you compress, you know, and expand,
and things like that. And there’s typically something
called the spring constant that tells us how stiff
the spring is, how much energy it takes to compress
this spring. Or equivalently, how much pop
the spring has when you’re no longer holding it down. Some springs are easy to
stretch, they have a small spring constant. Some strings, for example,
the ones that hold up an automobile, suspension,
are much harder to stretch and compress. There’s a theory about them
called Hooke’s Law. And it’s quite simple. Force, the amount of force
exerted by a spring, is equal to minus some constant times
the distance you have compressed the spring. It’s minus, because the force
is exerted in an opposite direction, trying
to spring up. So for example, we could
look at it this way. We’ve got a spring, excuse
my art here. And we put some weight on the
spring, which has therefore compressed it a little bit. And the spring is exerting
some upward force. And the amount of force it’s
exerting is proportional to the distance x. So, if we believe Hooke’s Law,
and I give you a spring, how can we find out what
this constant is? Well, we can do it by putting a
weight on top of the spring. It will compress the spring a
certain amount, and then the spring will stop moving. Now gravity would normally have
had this weight go all the way down to the bottom,
if there was no spring. So clearly the spring is
exerting some force in the upward direction, to keep that
mass from going down to the table, right? So we know what that
force is there. If we compress the spring to a
bunch of different distances, by putting, say, different size
weights on it, we can then solve for the spring
constant, just the way, before, we solved for pi. So it just so happens, not quite
by accident, that I’ve got some data from a spring. So let’s look at it. So here’s some data taken
from measuring a spring. This is distance and force,
force computed from the mass, basically, right? Because we know that these
have to be in balance. And I’m not going to ask you to
in your head estimate the constant from these, but what
you’ll see is, the format is, there’s a distance, and then a
colon, and then the force. Yeah? STUDENT: [INAUDIBLE] PROFESSOR: Ok, right,
yes, thank you. All right, want to repeat that
more loudly for everyone? STUDENT: [INAUDIBLE] PROFESSOR: Right, right, because
the x in the equation — right, here we’re getting
an equilibrium. OK, so let’s look at what
happens when we try and examine this. We’ll look at spring dot pi. So it’s pretty simple. First thing is, I’ve got a
function that reads in the data and parses it. You’ve all done more complicated
parsing of data files than this. So I won’t belabor
the details. I called it get data rather than
get spring data, because I’m going to use the same
thing for a lot of other kinds of data. And the only thing I want you
to notice, is that it’s returning a pair of arrays. OK, not lists. The usual thing is, I’m building
them up using lists, because lists have append and
arrays don’t, and then I’m converting them to arrays so
I can do matrix kinds of operations on them. So I’ll get the distances
and the forces. And then I’m just going to plot
them, and we’ll see what they look like. So let’s do that. There they are. Now, if you believe Hooke’s Law,
you could look at this data, and maybe you
wouldn’t like it. Because Hooke’s Law implies
that, in fact, these points should lie in a straight
line, right? If I just plug in values here,
what am I going to get? A straight line, right? I’m just multiplying
k times x. But I don’t have a straight
line, I have a little scatter of points, kind of it looks
like a straight line, but it’s not. And why do you think
that’s true? What’s going on here? What could cause this line
not to be straight? Have any you ever done
a physics experiment? And when you did it, did your
results actually match the theory that your high
school teacher, say, explained to you. No, and why not. Yeah, you have various kinds of
experimental or measurement error, right? Because, when you’re doing these
experiments, at least I’m not perfect, and I suspect
at least most of you are not perfect, you get mistakes. A little bit of error creeps
in inevitably. And so, when we acquired this
data, sure enough there was measurement error. And so the points are
scattered around. This is something
to be expected. Real data almost never matches
the theory precisely. Because there usually is some
sort of experimental error that creeps into things. So what should we
do about that? Well, what usually people do,
when they think about this, is they would look at this data
and say, well, let me fit a line to this. Somehow, say, what would
be the line that best approximates these points? And then the slope of
that line would give me the spring constant. So that raises the next
question, what do I mean by finding a line that best
fits these points? How do we, fit, in this case,
a line, to the data? First of all, I should ask the
question, why did I say let’s fit a line? Maybe I should have said, let’s
fit a parabola, or let’s fit a circle? Why should I had said
let’s fit a line. Yeah? STUDENT: [INAUDIBLE] PROFESSOR: Well, how
do I know that the plot is a linear function? Pardon? Well, so, two things. One is, I had a theory. You know, I had up there a
model, and my model suggested that I expected it
to be linear. And so if I’m testing my model,
I should and fit a line, my theory, if you will. But also when I look at it, it
looks kind of like a line. So you know, if I looked at it,
and it didn’t look like a line, I might have said, well,
my model must be badly broken. So let’s try and see
if we can fit it. Whenever we try and fit
something, we need some sort of an objective function
that captures the goodness of a fit. I’m trying to find, this is an
optimization problem of the sort that we’ve looked
at before. I’m trying to find a
line that optimizes some objective function. So a very simple objective
function here, is called the least squares fit. I want to find the line that
minimizes the sum of observation sub i, the i’th data
point I have, minus what the line, the model, predicts
that point should have been, and then I’ll square it. So I want to minimize
this value. I want to find the line
that gives me the smallest value for this. Why do you think I’m squaring
the difference? What would happen if I didn’t
square the difference? Yeah? Positive and negative errors
might cancel each other out. And in judging the quality of
the fit, I don’t really care deeply — you’re going to get
very fat the way you’re collecting candy here — I don’t care deeply whether the
error is, which side, it is, just that it’s wrong. And so by squaring it, it’s
kind of like taking the absolute value of the error,
among other things. All right, so if we look
at our example here, what would this be? I want to minimize, want to find
a line that minimizes it. So how do I do that? I could easily do it
using successive approximation, right? I could choose a line, basically
what I am, is I’m choosing a slope, here, right? And, I could, just like Newton
Raphson, do successive approximation for awhile,
and get the best fit. That’s one way to do
the optimization. It turns out that for this
particular optimization there’s something
more efficient. You can actually, there is a
closed form way of attacking this, and I could explain
that, but in fact, I’ll explain something even better. It’s built into Pylab. So Pylab has a function built-in
called polyfit. Which, given a set of points,
finds the polynomial that gives you the best
least squares approximation to those points. It’s called polynomial because
it isn’t necessarily going to be first order, that
is to say, a line. It can find polynomials
of arbitrary degree. So let’s look at the example
here, we’ll see how it works. So let me uncomment it. So I’m going to get k and b
equals Pylab dot polyfit here. What it’s going to do is, think
about a polynomial. I give you a polynomial of
degree one, you have all learned that it’s a x plus b,
b is the constant, and x is the single variable. And so I multiply a by x and I
add b to it, and as I vary x I get new values. And so polyfit, in this case,
will take the set of points defined by these two arrays and
return me a value for a and a value for b. Now here I’ve assigned a to k,
but don’t worry about that. And then, I’m gonna now generate
the predictions that I would get from this k
and b, and plot those. So let’s look at it. So here it said the k is 31.475,
etc., and it’s plotted the line that it’s found. Or I’ve plotted the line. You’ll note, a lot of the points
don’t lie on the line, in fact, most of the points
don’t lie on the line. But it’s asserting that
this is the best it can do with the line. And there’s some points, for
example, up here, that are kind of outliers, that are
pretty far from the line. But it has minimized the error,
if you will, for all of the points it has. That’s quite different from,
say, finding the line that touches the most
points, right? It’s minimizing the
sum of the errors. Now, given that I was just
looking for a constant to start with, why did I bother
even plotting the data? I happen to have known before
I did this that polyfit existed, and what I was really
looking for was this line. So maybe I should have just
done the polyfit and said here’s k and I’m done. Would that have been
a good idea? Yeah? STUDENT: You can’t know without
seeing the actual data how well it’s actually
fitting it. PROFESSOR: Right. Exactly right. That says, well how would I know
that it was fitting it badly or well, and in fact,
how would I know that my notion of the model is sound,
or that my experiment isn’t completely broken? So always, I think, always
look at the real data. Don’t just, I’ve seen too many
papers where people show me the curve that fits the data,
and don’t show me the data, and it always makes
me very nervous. So always look at the data,
as well as however you’re choosing to fit it. As an example of that, let’s
look at another set of inputs. This is not a spring. It’s the same get data function
as before, ignore that thing at the top. I’m going to analyze it
and we’ll look at it. So here I’m plotting the speed
of something over time. So I plotted it, and I’ve done
a least squares fit using polyfit just as before to get a
line, and I put the line vs. the data, and here I’m
a little suspicious. Right, I fit a line, but when
I look at it, I don’t think it’s a real good fit
for the data. Somehow modeling this data as a
line is probably not right. A linear model is not
good for this data. This data is derived
from something, a more complex process. So take a look at it, and tell
me what order were calling of polynomial do you think
might fit this data? What shape does this
look like to you? Pardon? STUDENT: Quadratic. PROFESSOR: Quadratic, because
the shape is a what? It’s a parabola. Well, I don’t know if I
dare try this one all the way to the back. Ooh, at least I didn’t
hurt anybody. All right, fortunately it’s just
as easy to fit a ravel parabola as a line. So let’s look down here. I’ve done the same thing, but
instead of passing it one, as I did up here as the argument,
I’m passing it two. Saying, instead of fitting a
polynomial of degree one, fit a polynomial of degree two. And now let’s see what
it looks like. Well, my eyes tell me this
is a much better fit than the line. So again, that’s why I wanted
to see the scatter plot, so that I could at least look at
it with my eyes, and say, yeah, this looks like
a better fit. All right, any question about
what’s going on here? What we’ve been looking at is
something called linear regression. It’s called linear because the
relationship of the dependent variable y to the independent
variables is assumed to be a linear function of
the parameters. It’s not because it has to
be a linear function of the value of x, OK? Because as you can see, we’re
not getting a line, we’re getting a parabola. Don’t worry about the details,
the point I want to make is, people sometimes see the word
linear regression and think it can only be used
to find lines. It’s not so. So when, for example, we did the
quadratic, what we had is y equals a x squared
plus b x plus c. The graph vs. x will not be a
straight line, right, because I’m squaring x. But it is, just about, in this
case, the single variable x. Now, when I looked at this, I
said, all right, it’s clear that the yellow curve is a
better fit than the red. It’s a red line. But that was a pretty
informal statement. I can actually look at this
much more formally. And we’re going to look at
something that’s the statisticians call r squared. Which in the case of a linear
regression is the coefficient of determination. Now, this is a big fancy word
for something that’s actually pretty simple. So what r squared its going
to be, and this is on your handout, is 1 minus e e over d
v. So e e is going to be the errors in the estimation. So I’ve got some estimated
values, some predicted values, if you will, given to me by the
model, either the line or the parabola in this case. And I’ve got some real values,
corresponding to each of those points, and I can look at the
difference between the 2 And that will tell me how much
difference there is between the estimated data and the,
well, between the predicted data and the measured
data, in this case. And then I want to divide that
by the variance in the measured data. The data variance. How broadly scattered the
measured points are. And I’ll do that by comparing
the mean of the measured data, to the measured data. So I get the average value of
the measured data, and I look at how different the points
I measure are. So I just want to give to you,
informally, because I really don’t care if you understand
all the math. What I do want you to
understand, when someone tells you, here’s the r squared value,
is, informally what it really is saying. It’s attempting to capture the
proportion of the response variation explained by the
variables in the model. In this case, x. So you’ll have some amount of
variation that is explained by changing the values
of the variables. So if, actually, I’m going to
give an example and then come back to it more informally. So if, for example, r squared
were to equal 0.9, that would mean that approximately 90
percent of the variation in the variables can be explained
by the model. OK, so we have some amount of
variation in the measured data, and if r squared is 0.9,
it says that 90 percent can be explained by the models, and the
other 10 percent cannot. Now, that other 10 percent could
be experimental error, or it could be that, in
fact, you need more variables in the model. That there are what are called
lurking variables. I love this term. A lurking variable is something
that actually effects the result, but is not
reflected in the model. As we’ll see a little bit
later, this is a very important thing to worry about,
when you’re looking at experimental data and you’re
building models. So we see this, for example,
in the medical literature, that they will do some
experiment, and they’ll say that this drug explains
x, or has this affect. And the variables they are
looking at are, say, the disease the patient has, and
the age of the patient. Well, maybe the gender of the
patient is also important, but it doesn’t happen to
be in the model. Now, if when they did a fit,
it came out with 0.9, that says at worst case, the
variables we didn’t consider could cause a 10
percent error. But, that could be big, that
could matter a lot. And so as you get farther from
1, you ought to get very worried about whether
you actually have all the right variables. Now you might have the right
variables, and just experiment was not conducted well, But it’s
usually the case that the problem is not that, but that
there are lurking variables. And we’ll see examples
of that. So, easier to read than the
math, at least by me, easier to read than the math, is the
implementation of r square. So it’s measured and estimated
values, I get the diffs, the differences, between the
estimated and the measured. These are both arrays, so I
subtract 1 array from the other, and then I square it. Remember, this’ll do an
element-wise subtraction, and then square each element. Then I can get the mean, by
dividing the sum of the array measured by the length of it. I can get the variance, which is
the measured mean minus the measured value, again squared. And then I’ll return
1 minus this. All right? So, just to make sure we sort
of understand the code, and the theory here as well, what
would we get if we had absolutely perfect prediction? So if every measured point
actually fit on the curb predicted by our model, what
would r square return? So in this case, measured and
estimated would be identical. What gets return by this? Yeah, 1. Exactly right. Because when I compute it, it
will turn out that these two numbers will be the, I’ll get
0, 1 minus 0 is 0, right? Because the differences
will be zero. OK? So I can use this, now, to
actually get a notion of how good my fit is. So let’s look at speed dot pi
again here, and now I’m going to uncomment these two things,
where I’m going to, after I compute the fit, I’m going
to then measure it. And you’ll see here that the r
squared error for the linear fit is 0.896, and for the
quadratic fit is 0.973. So indeed, we get a much
better fit here. So not only does our eye tell
us we have a better fit, our more formal statistical measure
tells us we have a better fit, and it tells
us how good it is. It’s not a perfect fit,
but it’s a pretty good fit, for sure. Now, interestingly enough, it
isn’t surprising that the quadratic fit is better
than the linear fit. In fact, the mathematics of
this should tell us it can never be worse. How do I know it can
never be worse? That’s just, never is a
really strong word. How do I know that? Because, when I do the quadratic
fit, if I had perfectly linear data, then this
coefficient, whoops, not that coefficient, wrong, this
coefficient, could be 0. So if I ask it to do a quadratic
fit to linear data, and the a is truly perfectly
linear, this coefficient will be 0, and my model will turn
out to be the same as the linear model. So I will always get at
least as good a fit. Now, does this mean that it’s
always better to use a higher order polynomial? The answer is no, and
let’s look at why. So here what I’ve done is, I’ve
taken seven points, and I’ve generated, if you look at
this line here, the y-values, for x in x vals, points
dot append x plus some random number. So basically I’ve got something
linear in x, but I’m perturbing, if you will, my
data by some random value. Something between 0 and 1 is
getting added to things. And I’m doing this so my
points won’t lie on a perfectly straight line. And then we’ll try and
fit a line to it. And also, just for fun, we’ll
try and fit a fifth order polynomial to it. And let’s see what we get. Well, there’s my line, and
there’s my fifth order polynomial. Neither is quite perfect, but
which do you think looks like a closer fit? With your eye. Well, I would say the red line,
the red curve, if you will, is a better fit, and sure
enough if we look at the statistics, we’ll see it’s 0.99,
as opposed to 0.978. So it’s clearly a closer fit. But that raises the very
important question: does closer equal better, or tighter,
which is another word for closer? And the answer is no. It’s a tighter fit, but it’s not
necessarily better, in the sense of more useful. Because one of the things I
want to do when I build a model like this, is
have something with predictive power. I don’t really necessarily need
a model to tell me where the points I’ve measured lie,
because I have them. The whole purpose of the model
is to give me some way to predict where unmeasured points
would lie, where future points would lie. OK, I understand how the spring
works, and I can guess where it would be if things
I haven’t had the time to measure, or the ability
to measure. So let’s look at that. Let’s see, where’d
that figure go. It’s lurking somewhere. All right, we’ll just
kill this for now. So let’s generate some more
points, and I’m going to use exactly the same algorithm. But I’m going to generate
twice as many points. But I’m only fitting it
to the first half. So if I run this one,
figure one is what we looked at before. The red line is fitting
them a little better. But here’s figure two. What happens when I extrapolate
the curve to the new points? Well, you can see, it’s
a terrible fit. And you would expect that,
because my data was basically linear, and I fit in non-linear
curve to it. And if you look at it you can
see that, OK, look at this, to get from here to here, it
thought I had to take off pretty sharply. And so sure enough, as I get
new points, the prediction will postulate that it’s still
going up, much more steeply than it really does. So you can see it’s a
terrible prediction. And that’s because what I’ve
done is, I over-fit the data. I’ve taken a very high degree
polynomial, which has given me a good close fit, and I can
always get a fit, by the way. If I choose a high enough degree
polynomial, I can fit lots and lots of data sets. But I have reason to
be very suspicious. The fact that I took a fifth
order polynomial to get six points should make
me very nervous. And it’s a very important
moral. Beware of over-fitting. If you have a very complex
model, there’s a good chance that it’s over-fit. The larger moral is, beware of
statistics without any theory. You’re just cranking away, you
get a great r squared, you say it’s a beautiful fit. But there was no real
theory there. You can always find a fit. As Disraeli is alleged to have
said, there are three kinds of lies: lies, damned lies,
and statistics. And we’ll spend some more time
when we get back from Thanksgiving looking at how
to lie with statistics. Have a great holiday,
everybody.

15 Comments

Add a Comment

Your email address will not be published. Required fields are marked *