Time Series Forecasting on Weather Data
Forecasting Global Surface Temperature Change
The dataset provided by NASA provides the monthly measurements of the average deviation of temperature measurements from meteorological stations and the ocean from their 1951-1980 averages. The measured deviation was recorded from 1880 to 2020.
Preliminary ARIMA Analysis
Observing the time plot of the average deviation from the meteorological station, the mean is generally increasing linearly hinting that the time series data is non-stationary. Because it doesn’t seem like an exponential or logarithmic growth, initial data transformations isn’t necessary except for differencing. In addition, the variability with time seems to be generally consistent.
Because the series seems to be linearly increasing, to eliminate linearity we will difference once. Technically, the process can be looked at as an ARIMA(p, 1, q) model for the series. The plot of the differenced series shows stationary with consistent mean and variance throughout time.
To help choose the values for the parameters, p and q, the plots of the acf and pacf will be observed.
The significant lags of pure AutoRegressive (AR) models will cut off after lag p in the PACF function. While the significant lags of pure Moving Average (MA) models will cut off after lag q in the ACF function. The sample ACF and PACF look consistent with x being either an MA(1) process or an AR(8) process. The ARIMA process of the original undifferenced series corresponds to ARIMA(0,1,1) and ARIMA(8,1,0).
Since we are fitting the differenced data, the process used will be ARIMA(0,0,1) and ARIMA(8,0,0) respectively. We can also exclude the include.DRIFT=TRUE.
We see that the fitted MA(1) model has σ=.03. The fitted AR(4) model has σ=.029. At this point, both models look similarly promising. To rule out one model, diagnostic plots will be observed to cancel the model with a bad fit.
The diagnostic plots for both models seem to show that both models are promising. For the MA(1) model, the ACF of residuals show that six out of 25 lags are significant while the QQ points generally fit along the line. The tails of the plot stray away from the normal line. Similarly the AR(8) model only has about four out of 25 significant lags and a good fit of the normal QQ plot. Though the tail ends of the qq plot stray away from the normal line, the model still seems to be acceptable. The standardized results plot for both models show that the residuals closely mimic white noise and is stationary with constant mean and variance throughout.
Since there is not a model that I can rule out with the diagnostic plots, I will choose the model with the lowest Information Criteria Metrics (AICC, AIC, BIC) that penalizes model complexity and prediction error.
All three information criteria results (aic,aicc,bic) show that AR(8) have the lowest error and is the better choice than MA(1).
By narrowing my model choice to only AR(8), I will forecast for the next 24 months which translates to 2 years from 2020 to 2022.
The blue line shows my line of prediction for the temperature deviation in the next two years. The shaded grey and blue lines show the .95 confidence interval of the prediction.