Before I start my post, I want to inform you that there might be an unclear succession between the posts I publish. I am doing my best to keep them in order, but as a general rule in life, nothing is guaranteed! I mean there are so many things that I want to write about, and every time I want to write, something comes to my mind which is not necessarily directly related to my previous posts. So please bear with me on this.
With no further delay, let’s look at the problem we are going to discuss today. The initial problem is very simple. Assume we have two points (a and b) in 2d space, and our goal is to find the equation of the line that passes through both points. It is not hard to show that if the two points we are considering have the coordinates and , the equation of the line would be:
for our convenience we write it as , where is usually referred to as the slope of the line and is the y-intercept of the line. We can calculate each one of these two parameters from the above formula.
Looks simple, so let’s make it more challenging. Now assume that instead of two points we have three points. If we are lucky, the third point would also lie on the line passing the first and second points, but unfortunately that is not the case most of the times. So we end up with three points (and even more most of the times), and we would like to find a straight line that passes through these points “as much as possible”.
What can we do? One solution which comes to mind is to pass a line between every two points:
and then calculate the final line to be the average of these lines. In other words we have:
where:
and the corresponding line would look like:
which seems like as a fair estimation. Unfortunately (as always), there is a problem in this method. For the reason consider the figure below:
Here, the line passing through b and c is parallel to the Y-axis, so we have , and therefore the final line equation would be (mathematically this is not a correct formula, but I wrote it to give you an idea). Obviously this is not a good estimation for our data points. Meaning that an estimation which looks like is not a good estimation for the three points we have (Intuitively, we expect that the line passes over point a and then passes somewhere in the middle way between b and c.)
As mentioned before, the problem comes from the points b and c. but what is special about these points that make them troublesome? They both have equal X values. So one solution is that we change our initial rule and take the average between slopes and intercepts of the calculated lines unless they have equal X values.
It is clear that we haven’t solved the problem. Two points might have unequal X values, but the difference between the X values be very small, so that the slope of the connecting line would be very large and it will again dominate our final slope (the same argument is valid for the intercept).
So a better approach is to use a weighted average (instead of regular average) to calculate the slope and intercept of the estimated line. In this approach the weight for each line is the square of the distance between the X values of the two points, and intuitively it makes sense. The closer two points are on the X axis, the less they will contribute to the final slope. To the extreme case that two points have exactly equal X values and therefore their contribution in the final result will be zero. More precisely for the slope of the estimated line we have:
and a similar formula would be correct for the intercept. The final line using this equation looks like:
and if we compare it with our initial average estimation we can observe that the new algorithm provides a better estimate of the three mentioned points. Actually, there are some theoretical proofs that under certain assumptions the mentioned approach produces the optimal result.
What a cool way of looking at line estimation! I appreciate your post; it’s helping me with statistics.