Analog forecasting, statistical forecasting and fuzzy logic

Here is an open discussion about analog forecasting, statistical forecasting and fuzzy logic. The main problem discussed is short-term forecasting of cloud ceiling and visibility. All solutions depend on large archives of airport weather observations. Most of this discussion took place in early 1999. You are welcome to join in. Just write to me at bjarne.hansen@canada.com.

Thanks to everyone for participating.

William Burrows
William Hsieh
Robert Vislocky
Lawrence Wilson.

Bjarne Hansen, March 1999

P.S. To skip to the conclusion, click here.


Hansen

[To Vislocky] I read with great interest your article about the OBS-based forecasting technique in Weather and Forecasting. Based on my intuition from aviation forecasting, I believe that upstream airport weather observations are much more predictive for short-term cei ling and visibility than NWP.

Footnote: Numerical Weather Prediction (NWP) is useful for forecasting cloud ceiling and visibility to the extent that it provides useful predictors of large-scale, continuous weather elements (e.g., wind and humidity). NWP coupled with statistical forecasting is called Model Output Statistics (MOS). It is reasonable to suppose that NWP could be coupled with analog forecasting in a similar manner, to achieve similar benefits.

So I'm not surprised to see your results are better than MOS. I believe you have found a vital piece of the puzzle, using surrounding obs. I'm working on another piece of the puzzle, and would be interested in your thoughts. I am developing a ceiling and visibility prediction system for a master of computer science thesis. It applies analog forecasting to the problem.

Given attributes of the present case, a series of obs, it traverses the weather archive (300,000 METARs) and finds the k nearest neighbors (k-nn). Predictions are based on the median ceiling and visibility values of the k-nn.

I'd like to test the system against the strongest competitor. You said "persistence climatology is widely recognized as a formidable benchmark for very short range prediction of ceiling and visibility." Do you know of a specific reference that compares co nditional persistence and simple persistence for very-short-range cig & vis prediction?

As a meteorologist, I believe your claim is correct. But I'd like to run some experiments quickly. Simple persistence would be easier to rig together. There are papers to support the claim that simple persistence is a strong competitor. It is easy to unde rstand why conditional persistence should be a bit better. (Imagine if we used persistence to forecast tornadoes. Conditional persistence would tell us, reassuringly, that tornadoes do not tend to stick around in one place for one hour.) But has anyone bo thered to test our assumption that conditional persistence is better than simple persistence for very-short-range cig & vis forecasts? Significantly better? After all, we're talking stratus and fog, not very volatile from hour to hour.

Furthermore, I think "conditional persistence" is applicable to forecasting only if TAFs are regarded as fundamentally probabilistic forecasts of categories. Would you agree?

The k-nn approach to forecasting finds and uses the most analogous series of METARs and forecasts a median of that "ensemble." Categorization only takes place after the forecasts are made and must be compared against the actuals using standard statistics.

Vislocky

In regard to your question on persistence climo (PC) versus persistence (PER), I don't recall any references per se that assessed this issue (just my own internal pilot study results). However, I'll offer the following points:

  1. If the verification is probabilistic and uses the squared error as the verification, PC will GREATLY outperform PER. This is because PC will use the whole range of probabilities whereas PER will only forecast 100% and 0% probabilities leading it to a hars h death.
  2. At locations where climatological ceiling heights vary considerably with the time of day, such as along the west coast of the US & Canada, PC will substantially outperform PER, even if the verification is non-probabilistic. For example, consider the case of a 9am forecast valid for 1pm at SFO in the summertime. PER will forecast mostly low ceilings (since it's usually cloudy at 9am); however PC will know that the clouds will often break by 1pm and will forecast higher (or no) ceilings.
  3. If the verification is categorical, and the lead time is very short (< 3-4 hours), and there is little time of day dependence on the ceiling climatology, then PER will verify quite closely to that of PC. That is, PC will only have a razor's edge advantage . After 4 hours though, PC will begin to pull away from PER simply because PC knows that the clouds will eventually dissipate with time. By definition, PC can never perform worse than PER over a large enough sample. I don't agree with the statement that c onditional persistence is only applicable when forecasting probabilities since the probabilities can always be converted over to a single best category and then verified as such.

In regard to your proposed work, I also offer the following suggestions:

  1. You may wish to try the CART [Classification and Regression Trees] methodology instead of a k-nn/analog approach. CART tends to be more robust in that it determines which variables & stations are most useful. With k-nn, you may select an analog by finding a "similar" case where the similarity occurred at stations or variables that don't matter.
  2. Make sure you leave aside a large, multi-year database for independent verification of your forecast model. As a reviewer for many articles submitted to W&F, too often I see someone spending lots of time to develop a forecast model but only verifying it o n a tiny independent data set, or worse yet, on the same data set used to develop the system.
  3. If the forecast uses the median of the k-nn as the official model forecast, then don't verify the forecasts using the threat score or some other biased-weighted scoring system since they are inconsistent with the philosophy of using the median of the k-nn 's as the forecast. Instead you'll need to use a mean absolute error or a percent correct type of scoring system to be consistent.

Hansen

I'll use simple persistence for benchmarking in my MCS thesis, because it's simple and quite accurate for the 0-2 hour range.

I would argue about only one minor point. I claim, "Persistence climatology is applicable to forecasting only if TAFs are regarded as fundamentally probabilistic forecasts of categories."

Vislocky

Not necessarily true, since one can compute persistence climo as a continuous probability density function of outcomes, at which point can be converted to a categorical forecast, probability forecast, or whatever form desired.

Hansen

When I say, "Persistence climatology is applicable to forecasting only if TAFs are regarded as fundamentally probabilistic forecasts of categories" I mean that, in using the PC technique, one is implicitly assuming that weather is categorical, as it is perceived, represented, modeled, and predicted. But, cig and vis are continuous. There are gradual differences between the predictands (e.g., OVC001, OVC002, OVC003) and supposedly, corresponding differences between the causative predictors ( e.g., wind direction, humidity). Using a fuzzy k-nn analog technique, the k most relevant predictors (series of obs) are used. For example, k-nn does not use 1000 cases with wind directions from a discrete "east" sector, rather it uses 100 cases with wind directions from 90 +/- 10 degrees, each weighted according to nearness to 90 degrees. Subtle local effects that dominate weather are expressed this way in the database.

Maybe I'm making too fine a point of this. (?) I'm approaching the problem as someone who's worked operationally for 10 years, and studied AI for 3. Forecasters don't think in terms of categories. They understand the importance of categories, especially w hen the miserable end-of-month statistics are waved before them, but they don't look at a weather map and discretize weather. Weather is continuous, and forecasters think in continuous terms. Like fuzzy logic.

The cognitive-science-based notion that spawned case-based reasoning (CBR) is that we don't think at all, we remember similar cases. But I digress.

I agree, you can translate categorical forecasts into pseudo-values of cig and vis, middle of the range. And thereby achieve the appearance of precise values, rather than categories like LOW IFR, HIGH IFR, etc. When you do this, you are designing the syst em with the primary motive of optimizing the statistical results. That, by itself, is a worthwhile aim. It's one of my aims too.

A possible advantage of analogs is user acceptability. Case-based reasoning reassure users that: solutions are composed from actual cases. Aviation forecasters (I've been one) are wary of black box guidance, e.g., take-it-or-leave-it, regress ion forecasts such as "60% VFR, 40% alternate." I've confirmed this with others. The analog approach would present a different form of guidance: an ensemble of 100 analogous series of exact values of cig and vis drawn from the archive. I suppose it could be plotted like NWP-element spaghetti plots. Or, to tidy it up, simply plot percentiles, 10-, 30-, 50-, 70-, and 90-%iles. Forecasters could "get more out of this," such as inherent confidence in certain ranges of cig and vis, than out of 60/40 categorical forecasts. The essential bimodality of cig and vis probabilities would appear as a well-defined fork (the more bimodal, the better defined). Seem reasonable?

Vislocky

I see what you're saying, but the display method (i.e., whether you output spaghetti plots, quantile values, or probabilities of VFR/IFR) is entirely independent of the statistical method that is used (i.e., regression or analog). I've developed regressio n-based forecast systems that DO predict quantile values rather than probabilities for different categories. In fact, there are quantile regression methods in existence (see Stata's homepage: www.stata.com)

Burrows

Quantile forecasts: Watch out here. These may add to the confusion of the forecaster's TAF to the user because the forecaster tries to cover all the bases with probability. An aviation user generally needs hard info (e.g. a "sharp" forecast) because he has a plane to land. You force him into a cost-benfit analysis. This approach is now tried in public forecasts, but in my humble opinion it "vagueifies" the forecasts to the user and allows the forecaster to get away with less thought about a weather situation. What does 20% chance of rain really mean?

Wilson

[Regarding SHORT, see citation above] Until the start of TAF Tools [a current Canadian (AE) project to develop TAF production software] SHORT was the only objective forecasting technique we ever developed specifically for aviation variables as far as I know.

Like a NN, all predictors and predictands are binary, but like common statistical methods, it uses regression (REEP) [Regression Estimation of Event Probabilities] It incorporates the climatology of the full observation history of a station, more than 30 years. Through the use of "interactive predictors" (products), it is also non-linear. SHORT beats persistence at all projection times tested, including 2 hours, in terms of the Brier score. = It is a probabilistic forecast, where the probabilities are converted optimally into "best" category forecasts. For VRBL and COND, we retain some of the probability information.

For analogues, Nathan Yacowar ran an analogue forecasting technique for many years at CMC, used mainly for the medium range forecasts and for the 15 day forecasts. I don't think it still runs, but you might be interested in looking at that. It is very old, and I don't think Nathan ever published it in the open literature. If you can get your hands on the Preprints of the 4th conference on Probability and Statistics in Atmospheric Sciences, Tallahassee, Fla, 1975, there is a paper in that volume.

Hansen

Apart from user acceptability, I can't think of any other major advantages of disadvantages of analog vs. statistics. Analog can also absorb the latest data and produce forecasts accordingly in seconds.

Vislocky

As can regression & CART.

The main advantage in analog methods (over traditional statistical methods) is the direct incorporation of nonlinear relationships (both interactive and curvative) between predictors and predictand (assuming your database of past cases is very large [if y our database is small, the curse of dimensionality will get you]). In contrast, most statistical methods assume some form of additivity between predictors & predictand and fail to account for interactions [note: for some predictands interactive effects ar e small, for some they are quite large, it just depends]. Moreover, unless you apply a non-parametric regression system, accounting for curvative relationships between individual predictors & predictand is also difficult. CART accounts for both of the abo ve effects; however, the primary drawback of CART is that it does not handle continuous predictands well. GAM handles continuous & categorical predictands and is also nonparametric, but the user must pre-specify the interactive effects between predictors & predictand. Analog can do it all if the number of past cases is very large.

Burrows

NFIS and CART combined are called CANFIS. I am looking at new "Probabilistic Neural Networks" which will classify categorical predictands. Actually, CART handles continuous predictands rather well, if they have a highly linear relationship to some predictors. Even though the answer is non-continuous, it will generate many nodes in these cases, so that the discontinuities are not great. In fact it will handily beat linear regression unless the relationship is so linear that there's no point doing CART. Manon Faucher and I have a paper accepted in J. Clim. Res. on this kind of a problem. As far as a modelling technique goes, whether you do GAM or NFIS or MARS (Multiple Adaptive Regression Splines) may be a moot point. The key is predictor selection to put into these modelling methods, and I have seen nothing to equal CART for this.

Hansen

On the other hand [regarding user acceptability], users are finicky. They already complain about having "too much information." Analog plots might overwhelm them, compared to black-box solutions (?).

I'm sure using a field of OBS is a big advance over MOS. In concept, the analog technique could be extended to do the same. Rookie forecasters soon learn that certain sequences of events amongst neighboring stations are predictive. Analog could be program med to detect "certain sequences."

Vislocky

I think the fact that analog can handle all nonlinearities (both interactive effects and curvative effects) and can deal with continuous data are the main advantages of analog over stat methods (assuming your historical dataset is large). CART can deal wi th the interactions, but works poorly with continuous data (i.e., forecasting continuous temperature values using continuous thickness values). Linear regression works well with continuous data, but the user must pre-specify the nonlinearities. GAM can ha ndle the curvative relationships between individual predictors & predictand, but interactive effects must be pre-specified. Neural networks are supposed to the king of all comp/stat methods, being able to handle all nonlinearities, but a host of other pro blems crop in when using NNets such as overfitting, selecting appropriate predictors, etc. Moreover, NNets can incorporate all nonlinearities ONLY when there is an unlimited supply of hidden layer nodes...a fact that's often overlooked by NNet supporters. Try modeling a sine wave with a neural net using traditional sigmoid limiters...can't be done accurately.

Burrows

I don't have any specific opinions about which is better [analog forecasting or statistical forecasting]. Rather, each has a place, and in some cases one is better than the other, but the roles may reverse. Analogues are typically used to as part of the process making long-range forecasts, since statistical models will often not be as accurate (it's a choice between 2 not-very-accurate methods for long-range). A conditional persistence model will work very well in stable weather situations, but the skill score vs climatology will be ~0. Statistical models will work well here too but no better. I think non-parametric statistical models should outperform analogues in general when there are many cases in the learning data base because they will bridge the gap between several predictor values in a cluster of cases, whereas a nearest neighbor cluster of analogues will force you to a decision about what weight to give each predictor of each member of a cluster. The analogue forecast approach may only work well if you get clever at this. You also may not pull out an analogue that exactly will repeat itself when you forecast its subsequent evolution. However, when there is not a large learning data base, an analogue forecast based on 1 or 2 cases may be all you can do .. a statistical model could not be built here (e.g. freezing rain in western Canada, .. may get 1 or 2 occurrences in 10 years at a site).

Hansen

Analog forecasting is model-free. This is, potentially, an advantage. Models can be the weak link any chain of logic. If we overlook some important factor, models mysteriously fail.

Vislocky

Being NWP-model free is a great advantage for short-range forecasting. NWP models take a while to run and suffer from certain problems in the first 6 hours (i.e., spin-up, etc.). By the time the NWP model is completed & disseminated, several hours may hav e passed, especially with mesoscale NWP systems. In the short time range, it's best to look at the newer obs & radar than a stale NWP forecast. Plus, NWP models don't necessarily forecast ceilings & visibilities, unless there's a MOS attached to the NWP s ystem.

However, being NWP-model free is not an advantage of analog over other statistical methods such as regression. They too, of course, can be model free if they only use obs as predictors.

Hansen

Analog forecasting is model-free. The OED defines model as "a simplified (often mathematical) description of a system etc., to assist calculations and predictions." Based on this definition, statistical forecasting is more model-like than analog forecasting.

Vislocky

Don't agree with this. Just cause an object is red doesn't mean it's an apple. All forecasting methods are inherently "models". The difference is how they're constructed.

Hansen

If an item in the store, stocked with 300,000 items, most matches the description of an apple, chances are it's an apple.

Statistics, by its very nature, simplifies. It reduces Megabytes of detailed data into a kilobyte [I guess] of modelling, linear regression equations. Continuous vectors are condensed into categories. Nonlinear relationships that may exist within the data are modeled with linear relationships.

Whereas analog forecasting retains the Megabytes of data, continuous vectors along with any inherent nonlinear relationships, and applies the model-free CBR/k-nn approach to forecasting. Rather than relying on a model, k-nn relies on an expertly-tune d, fuzzy-logic-based similarity measuring function [This is the critical component of the system, which, depending on how it's designed, makes it succeed or fail.]

Vislocky

The problem with this argument is that you're representing "Statistics" by the linear regression technique, which is not fair. There are a plenty of high complex (not simplified) nonlinear techniques that can be classified as "Statistics". In fact, the analog method is actually a "Statistics" method (i.e., taking the median of an optimal subset of past cases).

Hansen

Taking the median of k-nn is just a reserved option for now, to safeguard against the outlier effect. k = 1 is an option.

One measure that can describe the k-nn set is "overall nearness." A high value would indicate typicality of the present case, and confidence in the prediction. Low value -> k-nn far away -> low confidence.

Selection of k could be made a run-time option to optimize result some how, e.g., Using at least 10 neighbors and at most 100, and cut off admission into the set at an arbitrary similarity threshold.

"Training" the fuzzy analog system was straight-forward. I specified 12 integers, each of which specified breadth of 12 fuzzy triangle functions to be used for similarity testing of 12 weather elements. Lowest values for wind direction, higher for spread, higher yet for temperature. That's all. It worked well even without going back to try and fine tune the values or tossing out predictors. It did not need training in a neural network sense, it was trained by a forecaster who discriminated weather. What's significant at this point is that it worked as it did with so little effort, I'm sure it could be made even better with OBS, CART, and NWP.

Neural networks have their weaknesses, the worst of which, in my opinion, is local minima. I took a grad course on neural networks in which I built a backpropagation feedforward network from scratch (for a brief description, press here), along with some other types of networks. Helped me to understand what the term non-linear regression means. Linear regression says y=AX, whereas neural networks say y=WVX. Instead of a vector operation on X, there is a double matrix multiplication of X. This makes it non-linear. Enables it to model anything with arbitrary precision. Makes it a "universal approximator" [Zurada]. You could model your fingerprint if you wanted. But then you would have overtrained the network, and there are safeguard procedures to avoid that, like you described, the reserved data set.

The impression the course left me with was, a neural network handled carelessly can be very misleading. It has the potential to produce very accurate results or very inaccurate. I heard a story, from William Hsieh, that there was a shoot-out recently and neural networks took both first and last place. Moral of the story: garbage in garbage out. Careful design and expertise are vital.

Troublesome neural network problems are not knowing how to:
- determine optimal structure
- choose the best predictors
- avoid local minima... or even interpret local minima
- model rare events
- establish objective confidence values in output
- provide scrutable explanations along with output

Neural networks still look like black boxes to me.

Hsieh

A useful book is "Time Series Prediction" edited by Andreas S. Weigend and N.A. Gershenfeld (1994), which arose from the times series prediction contest organized by the Santa Fe Institute. This is the competition where the best models (and the worst models) are neural net models. In one Lorenz-like time series, a delayed-coordinate state-space model (a type of analogue model) tied the neural network for the first place. Actually the neural network had the upper hand for a while, but later on the analogue was more accurate (pp.8-9 of this book). The reason is that the analogue method captured the underlying attractor accurately, so it did not suffer the eventual 'drift-off' as the forecasts extended into the distant future.

Hansen

Peripherally, I've thought about "chaos theory," about how it may apply to the problem of searching for temporal analogs in the weather database space. But I've dropped that line of thought for now, at least until I finish my thesis. "Chaos" is a huge topic by itself, and would make the thesis unnecessarily complex.

Dawson

[A professor of mathematics and statistics cautiously offers his opinions.]

"Disclaimer: I'm not a specialist in this area; this represents informal suggestions about fairly well-known mathematical ideas that might be helpful. Please don't treat it as authoritative!"

As far as trying to avoid bringing in chaos theory: there are good arguments for and against, which probably means to look for some sort of synthesis.

First off, the "popular" image of chaos theory as misapplied in Jurassic Park is baggage that you are Better Off Without, especially in scientific circles. Chaos theory, like catastrophe theory, has developed such a layer of crackpottery around genuine science that one gets negative publicity from the association. Too many people (not specialists, but many who ought to know better) think chaos in meteorology means you're going off to look for the one particular butterfly in China that causes storms...

Moreover, a lot of the genuine applied "chaos theory" is just somebody saying "Look, chaos!" Not much help to anybody else.

However, there *are* some important aspects to chaos theory that *can* be used if you need them. You don't have to buy in wholesale to chaos theory, and you don't even have to mention chaos if you don't want to, in some cases.

  1. The period doubling/bifurcation yoga. If a system or part thereof goes chaotic due to an increase in energy, it more or less always follows the same pattern - past a certain critical energy it alternates big and small peaks, then a pattern of four, then eight... faster & faster till chaos ensues at a certain point. The ratio between adjacent doubling intervals is usually "Feigenbaum's constant."

    This is very, very robust and occurs in many different systems. (eg, the Lorentz system, if you tweak one of the contants!) The reason is essentially that "almost all" critical points of differentiable systems are like y = x^2 at (0,0).

    If this *is* happening in systems you study, it's a complex but predictable feature that your program should be specifically educated about, as it's unlikely to discover it for itself. I don't know if it does happen, though. (What happens around the end of a period of repetitive weather when the day's weather is locked to the day-night cycle? Or are these a myth in our climate?)

  2. Lyupanov multipliers. Basically, the practical import is that if the effect of a transformation on a cross section of paths in phase space is:
       [1x1] -> [0.9x0.9]  : result - stability
       [1x1] -> [1.1x1.1]  : result - blowup
       [1x1] -> [0.7x1.1]  : result - chaos 
    

    Practically, you can't find those numbers; but you *can* perhaps estimate them by looking at the extent to which a given set of starting conditions is sensitive to minor variations. In particular, chaotic systems will tend to avoid *some* apparently plausible developments without being predictable.

    Now, what you're doing will handle that nicely without being told about it, if I understand correctly. [If you want to make it look not too much of a one-trick pony, you might try predicting "toy" chaotic systems like the Lorentz system (or Lorentz perturbed by noise) & see whether your method outperforms analogs of others on *that*, as well. You'd also get pretty pics for the thesis...] That's a point that could be mentioned in passing, if it is indeed true. However, you might be able to get a specific identification of the three stability regimes that cuts across other predictions... as you have fewer outcomes, it is possible that you can predict chaos/stability/blowup better than you can predict weather itself as a first pass, in domains with few precedents, then use that to say either "we can't tell easily today" or on the other hand "even though we've only seen this once it should be a very stable situation and will probably do the same thing this time."

[a week later]

Another thought, again on nonlinear systems: Obviously in forecasting the true phase space is enormously high-dimensional. However, the "inhabited part" of it (near the attractor - and chaos/nonlinear/ergodic theory tells us there must *be* an attractor in most cases) may be much lower-dimensional. More realistically, it may be shaped like a pancake, small in some directions and bigger in others. The flow lines would be nearly, but not quite, in the plane of the pancake.

An analytic model would essentially be trying to determine that plane, locally extracting the few components that made up most of the variation. [Like principle components analysis in statistics, or finding the largest eigenvalues in linear algebra, or...] CBR does not worry explicitly about doing this, if I understand correctly, but will get to the same place in the end with enough data. However, the nonlinear paradigm may help guide the CBR approach, by predicting how it may be helped along.

One obvious problem would be if the data were not intrinsically high-dimensional enough to represent the pancake in a 1-1 fashion, in some domains. Then two apparently similar cases would represent different points on the attractor of "weatherspace" Such a situation might be recognized in principle as one in which the local data from the airport failed to give a clear prediction in certain circumstances but (say) adding in a report from a neighboring town would yield a sudden increase in accuracy. I have no idea whether this would happen in practice; but it might be worth a try in cases recognized as hard-to-predict.

It would be *very* interesting to find out whether "local weather space" has a strong low-dimensional component. [Persistence is the 0-dimensional version; the fact that it works as well as it does is interesting.]

The other neat thing about nonlinear systems is that higher derivatives and higher dimensions are essentially equivalent. One gets "equivalent" solutions for one second-order equation with one variable:

   y" = -ay

or from a system of first-order equations with two variables

   y'= -x, x' = ay

Thus when you look at data over time you essentially boost the dimensionality of it. By how much depends on how autocorrelated it is, of course.

.../...

This stuff is just stuff I picked up while teaching differential equations & reading "popular" (within the math community, anyway) accounts.

While I don't think any of what I put in is wrong, a third party might conclude that my opinion had been sought because I did know something out of the ordinary about this stuff... which would be embarrassing. [For both of us.] There are undoubtedly others who could add far more useful material. [Let's hope so...

Hansen

Dear reader,

If you can add any useful material about "chaotic prediction" as it may apply to weather prediction using huge weather archives, please send me an E-mail at bjarne.hansen@canada.com

In any case, Dr. Dawson, the main point you make in your comments about "chaos" is be skeptical. And I've put "chaos" on my backburner for now. I'm only putting the "chaos" thread into the discussion because it keeps getting raised.

The advantage of the case-based reasoning (CBR) approach to weather forecasting with huge databases is that if the data is characterized by chaotic patterns, CBR faithfully picks up the patterns, whereas less direct analysis risks losing the patterns. And if the data has no chaotic patterns, the CBR method is not dependant on chaos at all.

Hsieh

One thing you could try is to use an ensemble average of 20-30 neural networks models (trained with different random initial guess for the weights) to do the forecasting. Our experience is that the ensemble greatly helps in dealing with local minima, with the forecast skill of the ensemble exceeding the average skill of the individual ensemble members. The ensemble forecasts also look less noisy than the forecasts by an individual neural networks.

Hansen

CART, as I understand it, pares down the predictor set by selecting the most predictive variables and building a binary tree which attributes causal effect to high and low values of the predictor, to either branch at a node. Finding optimal classification boundaries is the objective. It optimizes classification.

CART looks like binary binning to me, with a tree structure.

Black boxes and bins worry me, and they still look like models to me.

I'm not saying it's impossible to most-accurately duplicate all the complex, subtle relationships of weather with such a simple model... but it would be quite a feat.

Burrows

Neural networks are not really black boxes. See Hsieh's recent article in BAMS. A significant advantage is the node weights can be updated, thus you don't have to re-model the entire data set when you get new data. However, there is methodology needed to picking a good architecture. NFIS is a type of Radial Basis Function NN .. in the CANFIS methodology I outline how to generate an "honest" model, one which does not overfit the data, but it requires a test data set and cycling thru the model parameters to find the best combination. Incidentally, there is a class of probabilistic NN's suitable for classification problems, which use an RBF layer nd a linear layer.

CART can handle continuous predictands rather well, if they have a highly linear relationship to some predictors. Even though the answer is non-continuous, it will generate many nodes in these cases, so that the discontinuities are not great. In fact it will handily beat linear regression unless the relationship is so linear that there's no point doing CART, because linear regression will do. Manon Faucher and I have a paper to appear in J. Clim. Res. on this kind of a problem (coastal ocean surface winds off BC). As far as a modelling technique goes, whether you do GAM or NFIS or MARS (Multiple Adaptive Regression Splines) or NN may be a moot point. The key is the predictor selection method to provide predictors for these modelling methods, and I have seen nothing to better CART for this. After that you must ensure a continuous model is not over-trained.

Hansen

I have not used CART to select and weight predictors - I suppose it would be helpful. But even without CART, I've had encouraging results simply from using available expertise about what few factors are most important, and how they are important, in assessing similarity between elements, obs, and short series of obs.

For example, we know wind direction is important. So we incorporate it into a model. In a statistical approach, it is imperative that we weight it's contribution to the categorical output optimally to achieve optimal results. And wind direction's contribu tion to the output must be balanced along with many other possible predictors. CART cuts down the size of the predictor set, but there are still many to balance. Many permutations.

Analog approaches it differently. There are a few discernable features of wx in a METAR. We would like all these features, in the k-nn set, to be "near" the features of the present case. "Near" means one thing for wind direction, another for spread, etc. I specify reasonable functions for each. Admittedly, reasonable, not "optimal" (but I could fiddle with it). Simple fuzzy triangle functions of varying breadth enable dimensionless comparisons, quantitative differences are translated into qualitative diff erences. Figure out the degree of nearness of all the critical attributes, and select the minimum (e.g., wind directions may be exactly the same, but spread differs by 10 degrees, so the archived case is discarded), or use another fuzzy aggregating functi on.

Vislocky

Be very careful when doing this. I'm sure you are an excellent forecaster and have lots of experience; however, no matter how good a forecaster is, when it comes time to identifying important predictors & thresholds, objective stepwise selection methods w ill always select better variables than the forecaster, especially if there are a large number of lead times, locations & variables that require development of separate systems. Case in point...how many times were you told that positive vorticity advectio n was important...how many times to forecasters point to short-waves causing certain weather conditions? Well, it just so happens that PVA is rarely, if ever, selected as an important predictor when developing MOS equations. At least that's the experience of myself and NCEP's Techniques Development Lab. In fact, I did a study examining the usefulness of PVA and other quasi-geostrophic techniques (i.e., q-vectors, etc.) using a 15r year dataset. Bottom line, PVA & q-vector divergence are horrible predictor s. Much better off using the model's vertical velocity & humidity forecasts. There are many such examples like this I can site where the human gets smoked by the objective selection methods. Bottom line: use some sort of objective screening method to sele ct your important variables. Once you have the top variables, then apply analog techniques to identify the most similar past patterns & determine the forecast.

Burrows

Concerning the usefulness of PVA, we have had pretty good luck with using PVA for cloud and precip (including thunderstorms) for years in Southern Ontario. For my CART snowsquall model (Burrows, 1991) (which is still the main forecast model used by OWC), low level mass convergence centered over small areas and along lines is one of the key predictors for forecasting snowsquall occurrence and amount. This is also verified by US forecasters in Michigan and Buffalo.

Hansen

Modeling is tricky. If we overlook some important factor, we unwittingly handicap models.

Proponents of "data-mining" contrast their approach with statistics. I'm reminded of an argument at an AIRIES conference in October '98. A pro-data-mining guy gave a presentation, and afterwards a pro-statistics guy said, "Data-mining does not accomplish anything that cannot already be done with statistics." To which the data-mining guy responded, "Yes, but with data-mining techniques, you do not need Ph.D. in statistics to do these things." That exchange left me with mixed feelings, and skeptical that such a simple approach to complicated problems could obviate statistics.

And I still firmly believe domain knowledge is vital to solving a problem, regardless of the computer-based methodology applied to the problem.

I spoke with a professor of statistics about the merits of statistics for processing huge amounts of data. He said, "The thing you have to remember about statistics is that it was all built around the assumption that your sample size is 32." Hence, all the methodology for assigning confidence to predictions. It left me wondering what do "confidence intervals" mean when the "sample" is the entire population. I realize that weather at an airport not entirely represented by 300,000 METARs, but that is a large population upon which to base confidence in predictions.

Wilson

As for statistics, remember that, while our samples are typically very large, there is a great deal of correlation within the sample; we may have thousands of data values, but only relatively few of them can be considered "independent" samples extracted from a distribution. I would say that we are far from having the whole population available; we are still sampling from a distribution of, say, all the ceilings that could possibly happen at a particular station. If we had sampled the whole population, we would never have any more "record" extreme values. I would agree with you to the extent that 30 years should be long enough to get a pretty good estimate of the population distribution from the sample, that is, if we can assume stationarity. But that gets into climate change..., a whole other topic.

Hansen

I agree. My prototype system (cited above) effectively mined the data by applying successive tests to short series of obs. It took steps to prevent selected series of obs from overlapping, imposing a same-time-of-day condition, cutting the candidate population down to about 10,000 cases or 4%. Then it imposed a same-time-of year condition, cutting the candidates down to 2500 cases or 1%. Then it imposed more weather-similarity conditions, finally cutting the population down to 100 cases or 0.04%.

Programming the system to pick 100 analogs was ad hoc for the proof-of-concept prototype. In a better system, the system would make run-time decisions about how many cases to select, based on the typicality of the present case. If the present case is an outlier, one could cut the analog set down to 10 and alert the user about the atypicality of the present case.

Important point you raise about "climate change." If it did change, I think an analog approach would be more robust than current statistical approaches. To make a simple example, suppose the "storm track" shifts north 5 degrees latitude, the average temperature rises 2 degrees Celcius, and the average humidity rises 5%. All of these changes would be accompanied by, caused by, changes in prevailing winds. Weather at an airport in this "new climate regime," compared to weather at the airport during the past 34 years would be different on average, altering the statistical relationships in unknowable ways. However, weather at an airport in this new climate regime would have many analogs to weather in the old climate regime, admittedly fewer analogs than if the climate was unchanged. There are many short series of obs in the past, and in the database, where the storm track briefly shifted north and temperature and humidity rose. If the topography and proximity of the ocean are unchanged, then for similar sets of weather elements, cloud and visibility at the airport should behave the same way as always.

Wilson

Analogue techniques are also statistical; one always has to use a statistic based on data to measure the degree of similarity with past weather situations. That is just the same as correlation and regression seek to do.

Hansen

I have to agree. Others have also recently made this point to me, that analogue techniques are statistical. One went so far as to suggest that analog forecasting is a poor form of statistical forecasting.

Wilson

The analogue technique is also one of the oldest statistical methods applied in meteorology.

Hansen

The analogue technique is the oldest method for weather prediction: "Red sky at night..."

The purpose of this discussion was two-fold:

  1. compare analog forecasting with statistical forecasting
  2. find a basis for the statement: persistence climatology is widely recognized as a formidable benchmark for very short range prediction of ceiling and visibility.

We seem to agree that analog forecasting is a simple form of statistical forecasting. (?)

It seems to be most beneficial to contrast analog forecasting with persistence climatology forecasting. So, I will emphasize the following:

Analog forecasting is persistence climatology forecasting without built-in limitations.

From a data-mining perspective, persistence climatology (PC) underutilizes data. To illustrate, I will give a simple description of how PC "mines" the data (34 years of archived hourly airport observations). Then compare that with how analog forecasting would mine the data.

The basic objective of PC is to answer the question: In similar past situations, what were the outcomes 1, 2, 3,... hours later? PC is a meteorological application of joint probability. For example, suppose that it is 6 am in June and the airport is "socked in" in fog. The flying category is the lowest possible, Category 1. Using PC, one tabulates, before-the-fact, probabilities to forecast for such a situation. The database is searched for all instances of {June, 6 am, flying category 1}, the flying categories during the subsequent hours are tabulated, and probabilities are prepared accordingly. The PC approach to analog forecasting handicaps analog forecasting in two ways:

  1. PC treats weather as if it was categorical
  2. PC uses a very limited set of predictors
The pros and cons of categorization are discussed above. A problem with using crisp categories is that, as the number of stratifying conditions increases and as specified events become rarer, instances for statistical tabulation may not exist [Martin, 19 72]. Because of this effect, PC uses a limited number of predictors.

PC cannot use "real-time" predictors, variable predictors that are relevant to current weather. Operationally, meteorologists know more than what month it is and what time of day it is. They have additional knowledge about what the "problem of the day" is. For example, if a cold front is due to pass through the region during a forecaster's shift, then all weather timings hinge on the time of passage of a cold front. "Timing the cold front" is the most critical task for an aviation forecaster on such a day.

Presently, with PC, there is not an easy way to specify such a peculiar set of conditions as {June, 6 am, 1/4SM FG, OVC001, wind shift from 160 to 280 three hours hence}, because PC must be prepared before-the-fact, using only a limited set of "predicable" predictors. Information about such peculiar cases, contained in the database, is not presently made available to forecasters. It is impractical to prepare PC statistics for the full range of possible predictors.

In contrast, the analog approach (fuzzy k-nn) to cig and vis prediction selects the k-nearest neighbors, nearest in terms of a set of critical attributes knowable only at forecast time. Fuzzy logic overcomes the "empty bin problem" that crisp, categorical tests are subject to.

From a user's perspective, analog forecasting is flexible. A met colleague said it would be like "custom climatology on-the-fly." Defer important decisions until run-time. A forecaster invests considerable effort in timing a cold front. The forecaster could use "real-time persistence climatology" simply by entering expected wind-shift into a system. Let the system automatically supply the other predictors (e.g., time of day, month, OBS, NWP). Have the system output the likeliest trend of cig and vis.

The prototype system is ad hoc, not "optimally" put together in the sense of modelling. I cut corners. I have probably even overlooked some predictive elements, like 700 mb rh. But what is selected is actual cases, series of obs, which are, in the way that an expert would judge, the nearest, most analogous cases, from the huge archive, to the present, weighted according to overall nearness, as measured in a fuzzy logic based quasi-subjective manner. Because these are actual cases of weather at the airport, it seems unlikely that they would lead us wildly astray. (?) And, even supposing a few non-analogous cases did get into the k-nn set, although they shouldn't, using the median cig/vis values of a set should lessen the effect of such outliers.

Edward Lorenz [1963, 1969, 1977] explained how the practicality of analog forecasting was limited by the rarity of good analogs. Slow computers also limited practicality. Hardware has improved greatly over the past 30 years. Databases are 1000's of times larger. Computers are 1000's of times faster. It has become increasingly practical to search huge databases for analogs to make predictions.

[That's where it left off in March 1999.]

References

  • William R. Burrows, and J. Montpetit, 2000: A procedure for neuro-fuzzy dynamic-statistical data modeling with predictor selection, Preprints of the 2nd Conference on Artificial Intelligence, American Meteorological Society.
  • William R. Burrows, 1998: CART-neuro-fuzzy statistical data modeling, Part 1: Method, Preprints of the 1st Conference on Artificial Intelligence (and) Part 2: Results, Preprints of the 14th Conference on Probability and Statistics, American Meteorological Society.
  • Bjarne K. Hansen, 2000: Analog forecasting of ceiling and visibility using fuzzy sets, Preprints of the 2nd Conference on Artificial Intelligence, American Meteorological Society.
  • Bjarne K. Hansen, 1998: Fuzzy case-based prediction of ceiling and visibility, Preprints of the 1st Conference on Artificial Intelligence, American Meteorological Society.
  • William W. Hsieh, and Tang, B.,1998: Applying Neural Network Models to Prediction and Data Analysis in Meteorology and Oceanography. Bulletin of the American Meteorological Society: Vol. 79, No. 9, pp. 1855-1870.
  • Robert L. Vislocky, and Fritsch, J. M., 1997: An automated, observations-based system for short-term prediction of ceiling and visibility, Weather and Forecasting, 12, 31-43.
  • Lawrence J. Wilson, and Sarrazin, R., 1989: A classical-REEP short-range forecast procedure, Weather and Forecasting, 4, 502-516.
  • References on analog forecasting

    If you can suggest a paper to add to this list, please write to me at bjarne.hansen@canada.com

    1. Barnett, T. P., and Preisendorfer, R. W. (1978) Multifield analog prediction of short-term climate fluctuations using a climate state vector, Journal of Atmospheric Science, 35, 1771–1787.
    2. Barnston, A. G., and Livezey, R. E. (1989) An operational multifield analog / anti-analog prediction system for United States seasonal temperatures. Part II: Spring, summer, fall and intermediate 3-month period experiments, Journal of Climate, 2, 513 –541.
    3. Bergen, R.E., and Harnack, R. P. (1982) Long-range temperature prediction using a simple analog approach, Monthly Weather Review, 110, 1083–1099.
    4. Guilbaud, S. (1997) Prevision quantitative des precipitations jounalieres par une method statistico-dynamique de recherche d'analogues (Prediction of quantitative precipitation by a statistical-dynamical method for finding analogs), Ph.D Thesis, L'Institute National Polytechnique de Grenoble, France.
    5. Huschke, R. E. (editor) (1959) Glossary of Meteorology, American Meteorological Society, Boston, Massachusetts, USA, pgs. 106 and 419.
    6. Kruizinga, S., and Murphy, A. H. (1983) Use of an analogue procedure to formulate objectively probabilistic temperature forecasts in the Netherlands, Monthly Weather Review, 111, 2244–2254.
    7. Livezey, R. E., and Barnston, A. G. (1988) An operational multifield analog/antianalog system for United States seasonal temperatures. Part 1: System design and winter experiments, Journal of Geophysical Research, Vol. 93, No. D9, 10953–10974.
    8. Lorenz, E. N. (1963) Deterministic nonperiodic flow, Journal of the Atmospheric Sciences, 20, 130–141.
    9. Lorenz, E. N. (1969a) Atmospheric predictability as revealed by naturally occurring analogues, Journal of the Atmospheric Sciences, 26, 636–646.
    10. Lorenz, E. N. (1969b) Three approaches to atmospheric predictability, Bulletin of the American Meteorological Society, 50, 345–349.
    11. Lorenz, E. N. (1977) An experiment in nonlinear statistical weather forecasting, Monthly Weather Review, 105, 590–602.
    12. Lorenz, E.N. (1993) The essence of chaos, University of Washington Press, Seattle, WA, USA.
    13. Martin, D. E. (1972) Climatic presentations for short-range forecasting based on event occurrence and reoccurrence profiles, Journal of Applied Meteorology, 11, 1212–1223.
    14. Nicolis, C. (1998) Atmospheric Analogs and Recurrence Time Statistics: Toward a Dynamical Formulation. Journal of the Atmospheric Sciences, Vol. 55, No. 3, 465–475.
    15. Radinivic, D. (1975) An analogue method for weather forecasting using the 500/1000 mb relative topography, Monthly Weather Review, 103, 639–649.
    16. Roebber, P. J., Bosart, L. F. (1998) The Sensitivity of Precipitation to Circulation Details. Part I: An Analysis of Regional Analogs, Monthly Weather Review, Vol. 126, No. 2, 437–455.
    17. Ruosteenoja, K. (1988) Factors affecting the occurrence and lifetime of 500 mb height analogues: A study based on a large amount of data, Monthly Weather Review, 116, 368–376.
    18. Soucy, D. (1991) Revised users guide to days 3-4-5 automated forecast composition program, CMC Technical Document No. 37, Canadian Meteorological Centre, Environment Canada.
    19. Toth, Z. (1989) Long-range weather forecasting using an analog approach, Journal of Climate, 2, 594–607.
    20. Toth, Z. (1991a) Intercomparison of circulation similarity measures, Monthly Weather Review, 119, 55–64.
    21. Toth, Z. (1991b) Estimation of the atmospheric predictability by circulation analogs, Monthly Weather Review, 119, 65–72.
    22. Toth, Z. (1991c) Circulation patterns in phase space: A multinormal distribution?, Monthly Weather Review, 119, 1501–1511.
    23. Van den Dool, H. M. (1994) Searching for analogues, how long must we wait?, Tellus, 45A, 314–324.
    24. Van den Dool, H. M. (1989) A new look at weather forecasting through analogues, Monthly Weather Review, 117, 2230–2247.
    25. Van den Dool, H. M. (1987) A bias skill in forecasts based on analogues and antilogues, Journal of Applied Meteorology, 26, 1278–1281.
    26. Van den Dool, H. M. (1989) A new look at weather forecasting through analogues, Monthly Weather Review, 117, 2230–2247.
    27. Woodcock, F. (1980) On the use of analogues to improve regression forecasts, Monthly Weather Review, 108, 292–297.


    P.S.

    Vislocky

    You know what would be really neat (if free time were available) would be to undertake a joint project in which we have a bakeoff of statistical/computational forecast techniques (i.e., CART, GAM, NNets, Analog, Logistic Regression, etc.). I did some of this several years ago (and found GAM to be the best), but that was for that specific dataset to forecast ceiling and visibility. Plus, I may not have used the best NNet or CART algorithms available. What would be neat is for each party to take a few dependent datasets and corresponding independent datasets (where we don't know what the results are) for different types of predictands/predictors and do the bakeoff. I could do the GAM, maybe Bill could do the CART, I know someone who could do the NNets and the logistic regression.

    Hansen

    Sounds like a great idea. A shoot-out. It'd be most interesting research. And our sponsors, whoever they may be, should be willing to fund such a project, a project that aims to make airport forecasts more accurate, and easier and less expensive to produce.

    Last updated 30 March 1999.