What Is the Sum of Squares?

Sum of squares is a statistical technique used in regression analysis to determine the dispersion of data points. In a regression analysis, the goal is to determine how well a data series can be fitted to a function that might help to explain how the data series was generated. Sum of squares is used as a mathematical way to find the function that best fits (varies least) from the data.

The Formula for Sum of Squares Is

For a set X of n items:Sum of squares=i=0n(XiX)2where:Xi=The ith item in the setX=The mean of all items in the set(XiX)=The deviation of each item from the mean\begin{aligned} &\text{For a set } X \text{ of } n \text{ items:}\\ &\text{Sum of squares}=\sum_{i=0}^{n}\left(X_i-\overline{X}\right)^2\\ &\textbf{where:}\\ &X_i=\text{The } i^{th} \text{ item in the set}\\ &\overline{X}=\text{The mean of all items in the set}\\ &\left(X_i-\overline{X}\right) = \text{The deviation of each item from the mean}\\ \end{aligned}For a set X of n items:Sum of squares=i=0n(XiX)2where:Xi=The ith item in the setX=The mean of all items in the set(XiX)=The deviation of each item from the mean

Sum of squares is also known as variation.

What Does the Sum of Squares Tell You?

The sum of squares is a measure of deviation from the mean. In statistics, the mean is the average of a set of numbers and is the most commonly used measure of central tendency. The arithmetic mean is simply calculated by summing up the values in the data set and dividing by the number of values.

Let’s say the closing prices of Microsoft (MSFT) in the last five days were 74.01, 74.77, 73.94, 73.61, and 73.40 in US dollars. The sum of the total prices is $369.73 and the mean or average price of the textbook would thus be $369.73 / 5 = $73.95.

But knowing the mean of a measurement set is not always enough. Sometimes, it is helpful to know how much variation there is in a set of measurements. How far apart the individual values are from the mean may give some insight into how fit the observations or values are to the regression model that is created.

For example, if an analyst wanted to know whether the share price of MSFT moves in tandem with the price of Apple (AAPL), he can list out the set of observations for the process of both stocks for a certain period, say 1, 2, or 10 years and create a linear model with each of the observations or measurements recorded. If the relationship between both variables (i.e., the price of AAPL and price of MSFT) is not a straight line, then there are variations in the data set that need to be scrutinized.

In statistics speak, if the line in the linear model created does not pass through all the measurements of value, then some of the variability that has been observed in the share prices is unexplained. The sum of squares is used to calculate whether a linear relationship exists between two variables, and any unexplained variability is referred to as the residual sum of squares.

The sum of squares is the sum of the square of variation, where variation is defined as the spread between each individual value and the mean. To determine the sum of squares, the distance between each data point and the line of best fit is squared and then summed up. The line of best fit will minimize this value.

How to Calculate the Sum of Squares

Now you can see why the measurement is called the sum of squared deviations, or the sum of squares for short. Using our MSFT example above, the sum of squares can be calculated as:

  • SS = (74.01 - 73.95)2 + (74.77 - 73.95)2 + (73.94 - 73.95)2 + (73.61 - 73.95)2 + (73.40 - 73.95)2
  • SS = (0.06) 2 + (0.82)2 + (-0.01)2 + (-0.34)2 + (-0.55)2
  • SS = 1.0942

Adding the sum of the deviations alone without squaring will result in a number equal to or close to zero since the negative deviations will almost perfectly offset the positive deviations. To get a more realistic number, the sum of deviations must be squared. The sum of squares will always be a positive number because the square of any number, whether positive or negative, is always positive.

Example of How to Use the Sum of Squares

Based on the results of the MSFT calculation, a high sum of squares indicates that most of the values are farther away from the mean, and hence, there is large variability in the data. A low sum of squares refers to low variability in the set of observations.

In the example above, 1.0942 shows that the variability in the stock price of MSFT in the last five days is very low and investors looking to invest in stocks characterized by price stability and low volatility may opt for MSFT.

Key Takeaways

  • The sum of squares measures the deviation of data points away from the mean value.
  • A higher sum-of-squares result indicates a large degree of variability within the data set, while a lower result indicates that the data does vary considerably from the mean value.

Limitations of Using the Sum of Squares

Making an investment decision on what stock to purchase requires many more observations than the ones listed here. An analyst may have to work with years of data to know with a higher certainty how high or low the variability of an asset is. As more data points are added to the set, the sum of squares becomes larger as the values will be more spread out.

The most widely used measurements of variation are the standard deviation and variance. However, to calculate either of the two metrics, the sum of squares must first be calculated. The variance is the average of the sum of squares (i.e., the sum of squares divided by the number of observations). The standard deviation is the square root of the variance.

There are two methods of regression analysis that use the sum of squares: the linear least squares method and the non-linear least squares method. The least squares method refers to the fact that the regression function minimizes the sum of the squares of the variance from the actual data points. In this way, it is possible to draw a function which statistically provides the best fit for the data. Note that a regression function can either be linear (a straight line) or non-linear (a curving line).