Information Theoretic Matrix for Forcasting

Seasonal forecasts, such as those that try to predict the weather 3 months in the future, have a great deal of uncertainty in them; therefore it is often better to use probabilistic models that predict the distribution of possible measurements rather than a single measurement for any given location and time. To evaluate probabilistic forecasts, this work proposes that an information theoretic metric based on information gain is a more natural evaluation tool. Information gain offers insight into forecast performance by quantifying the likelihood of the actual temperature occurring in the predicted distribution. Using this measurement, we discovered that a physics based forecast model can be improved by combining it with a simple statistical model based on the past climatological record.

The Climate Prediction Center (CPC) physics-based three month seasonal forecasts within the U.S. The forecast produces three probabilities: of the temperature (or precipitation) being below, at, or above the seasonal average. In evaluating this forecast, we compare it to a baseline, “no information” forecast such as a climatology, which is the probability of a measurement occurring in a given month given the measurements in corresponding month in previous years. Typically, the climatology is an excellent predictor for a long term forecast, so “skill” is measured against this baseline.

If o is an observation and Pc is the climatology discrete probabilities, than –log2(Pc(o)) is the surprisal. It is a measure of how surprising an event is. It is nearly the number of bits needed to send a message reporting this event occurred under an optimal (variable bit) compression scheme. If the surprisal is large, that means that we were not expecting that event. The mean surprisal is the entropy. In cases with high entropy, all events are nearly equally surprising; for example winning the lottery has a high surprisal. If we have a good probabilistic forecast Pf then Pf should be relatively large (close to 1) for the probability that we observe and the surprisal –log2(P_f(o)) will be nearly zero. The difference between the baseline surprisal and the forecast surprisal is the information gain or IG. We can interpret this as how much information (in bits) our forecast tells us about the event. It should be positive for a good forecast. If we have a perfect forecast the IG will give us, on average, the relative entropy of the climatology with respect to the observation. The Information Skill Score (ISS) is the IG normalized by this maximum, so the maximum ISS is 1 and 0 represents no skill over the baseline. IG and ISS can have negative scores; that would indicate that the forecast is worse than the baseline model (climatology).

Using this framework, IG can be decomposed naturally into the sum of three components: a) the confidence term of a forecast relative to the baseline, which does not depend on actual observations and should be high for skilled forecasts; b) a forecast miscalibration term which is positive for overconfident forecasts, negative for under confident forecasts; c) baseline miscalibration term which is 0 for good baseline probability and can be non-zero when the baseline assumes a non-stationary distribution and there is a trend in the data. Different conventional metrics are formulated for use with a probabilistic forecast within the framework and the relationship is shown to the ISS.

The CPC forecast model is compared to climatology as well simple statistical trend model using the ISS and the confidence score . A hybrid model is also considered that combines the CPC model and the trend model. Using the ISS and confidence, the three models are compared and their strengths and weakness are interpreted both temporally and spatially. The result is that the ISS is shown to be a stricter metric than the other conventional metrics and that it provides insight on the confidence and performance of the models under consideration. It is also shown that the combination of the trend and forecast models performs better than either model individually.