Drought risk assessment ideally requires longterm rainfall records especially where interannual droughts are of potential concern, and spatially consistent estimates of rainfall to support regional and interregional scale assessments. This paper addresses these challenges by developing a spatially consistent stochastic model of monthly rainfall for southeast UK. Conditioned on 50 gauged sites, the model infills the historic record from 1855–2011 in both space and time, and extends the record by synthesising droughts which are consistent with the observed rainfall statistics. The long record length allows more insight into the variability of rainfall and potentially a stronger basis for risk assessment than is generally possible. It is shown that, although localised biases exist in both space and time, the model results are generally consistent with the observed record including for a range of interannual droughts and spatial statistics. Simulations show that some of the most severe interannual droughts on the record may recur, despite a trend towards generally wetter winters.

c  correlation coefficient 
C_{ii}  variance of the unconditional error at site i 
C_{si}  vector of covariance values between s sites and another site i 
C_{ss}  covariance matrix describing the dependencies between the errors at s sites 
D  distance between two sites 
E  n × 1 vector of errors 
ē_{i}  expected value of the error at site i 
e_{s}  vector of errors observed at s sites over all the years for any month 
N(ē_{i},σ^{2}_{i})  a random sample drawn from a normal distribution with mean ē_{i} and variance σ^{2}_{i} 
r  untransformed rainfall 
X  n × m matrix containing n values of m observed input variables 
Y  n × 1 vector of observations 
y  transformed rainfall 
y¯_{i}  expected value of transformed rainfall 
α  parameter of the correlogram 
β  parameter of the correlogram. 
θ  m × 1 vector of regression coefficients 
λ  Box–Cox transform parameter 
σ^{2}_{i}  variance of the conditional error at site i 
The sustainability of water supply in much of Europe is a major concern for economic and environmental planning (Mechler and Kundzewicz, 2010). One region of particular concern is southeast UK, which has a high and increasing population, relatively low rainfall and high evaporation (Arnell and Delaney, 2006; Marsh et al., 2007). A large proportion of the supply in this region is from the chalk aquifer, which is under stress in places from overabstraction and agricultural contamination (Smith et al., 2010). To relieve the stress on water resources, options for desalination, bulk imports and interbasin transfers from the Thames and Severn basins have been considered (Arnell and Delaney, 2006). Moreover, towards more optimal sharing of water resources during drought periods, there are currently efforts to optimise water transfer schemes within the southeast (von Lany et al., 2008).
In southeast UK it is generally perceived that three dry winters in succession would present severe regional water supply deficits (the winter season, with its higher rainfall and lower potential evaporation, being the primary source of effective rainfall and recharge to the aquifers) (von Lany et al., 2008), and should the most extreme historic droughts recur (see Marsh et al., 2007) it seems unlikely that an acceptable level of service could be maintained (McIntyre et al., 2003). Of particular concern to water managers is the possible recurrence of the longterm droughts of 1887–1910, which included a series of five unusually dry winters, and shorter interannual droughts of 1920–1922, 1933–1934, 1975–1976, 1990–1992 and 1995–1997 (Marsh, 1996; Marsh et al., 2007; Subak, 2000).
As well as drought duration, the spatial aspect of drought is also of interest (Burke and Brown, 2010; Zaidman et al., 2002). The spatial properties of drought have particular practical relevance in southeast UK, where extending the water network within and beyond the region is potentially viable (von Lany et al., 2008). Assessing the scope for such transfers requires a good understanding of the spatial characteristics of water availability over the relevant scales. Hence there is considerable motivation for developing data sets and tools which deliver a capability for characterising both the temporal and spatial characteristics of extreme droughts.
Gridded climate reanalysis data sets (e.g. the ERA40 data, Uppala et al. (2005)) produce historical rainfall going back to 1957 on a grid scale of about 1·125°. However this time coverage and grid scale are rather restrictive for analysis of extremes within one region. Downscaling tools provide a simulation capability at more applicable scales (e.g. Kenabatho et al., 2011; Kigobe et al., 2011). For example, the weather generator associated with UKCP09 (Jones et al., 2009) can be used to generate scenarios of extreme rainfall on a 5 km grid scale. However, that weather generator was not designed to produce spatially consistent rainfall in the sense that observed intergrid dependence of rainfall is not preserved (Jones et al., 2009), and hence has limitations for spatial assessment of drought risk. Also, many stochastic rainfall models (e.g. Jones et al., 2009; Yang et al., 2005) are trained on only around 30 years of data, and hence their suitability for generating a range of extreme droughts is unclear. Therefore, despite their attractions, existing reanalysis data sets and stochastic rainfall models are not by themselves adequate to support regional drought risk analysis.
The concern about water stress in England, Europe and beyond calls for suitable data sets and tools to support regional water resources management (Thyer et al., 2002). This includes generating long sequences of spatially distributed rainfall over space and time. This paper aims to address this challenge by developing new statistical rainfall models using a case study of southeast UK, including the following activities.
(a) Compilation of available longterm rainfall records covering southeast UK (Kent, Sussex, Hampshire, Surrey, Isle of Wight, east Wiltshire, south Berkshire and south London).  
(b) Identification of largescale climatic drivers of rainfall and regional variability to give a deterministic model to predict expected rainfall over the region; and identification of a stochastic model to describe variability around the expected values.  
(c) Infilling of missing data to provide continuous monthly sequences of rainfall dating back to 1855, gridded over the southeast region, including uncertainty estimates.  
(d) Assessment of the ability of the model to replicate the historical extreme droughts, in particular the severe droughts of 1887–1910, 1920–1922 and 1975–1976. 
where Y is a n × 1 vector of observations, X is a n × m matrix containing n values of m observed input variables, θ is a m × 1 vector of constant regression coefficients, and e is an n × 1 vector of errors. θ is generally estimated by minimising the sum of the squared errors given the set of observations, Y and X. With the assumption that the errors in vector e are independent of each other, and are identically and normally distributed, the least squares estimate of θ is equivalent to the maximum likelihood estimate. This assumption also allows the covariance matrix of the regression coefficients to be estimated using standard linear methods (Kottegoda and Rosso, 2008). The input variables to include in X are generally identified by trial and error, aiming to produce a model which explains much of the variability in Y (generally measured using the R^{2} statistic), and also, ideally, to produce a θ with low covariance. Stepwise regression (Draper and Smith, 1998) is a set of procedures which assists with the identification of the optimal X variables (from a set of prespecified candidates).
The identification of a suitable probability density function to describe e means that Equation 1 may be employed as a stochastic model, from which random realisations of Y can be simulated. This potentially provides a model for stochastic simulation of rainfall variability and extremes. Consistent with the general statistical assumptions behind least squares regression, it is common to assume a normal distribution of errors. Towards achieving such a normal distribution, the skewness generally observed in rainfall data can be managed by transforming the rainfall prior to the regression, for example using a logarithmic or Box–Cox transform (Kottegoda and Rosso, 2008). When the errors are not independent of each other (as in the case study below), a multivariate normal distribution is required. Where the rainfall sample contains a significant number of zeros, as would be the case using daily or subdaily data in the UK, the random variability cannot conveniently be described by a single continuous distribution function. Furthermore, at these time scales there is significant serial dependence. These challenges have led to the use of statistical methods for rainfall modelling which are more flexible than simple regression (Chandler and Wheater, 2002, Segond et al., 2006). However, in this paper, the use of monthly rainfall data sufficiently simplifies the problem so that a stochastic model of the form of Equation 1 (including suitable Box–Cox transforms of the data and suitable models of intersite dependence) is proposed as sufficient.
The ‘southeast UK' is defined here as the region illustrated in Figure 1, bounded to the south and east by the coast, to the west by (using the UK national grid coordinate system) easting 410000 m and to the north by 180000 m. This spatial covergage was governed by: (1) the wish to cover a large part of south and southeast UK; (2) the increased difficulty of achieving a satisfactory spatial model if extending the region further north and/or west; and (3) the computational demands of stochastic modelling, which inhibit the inclusion of many more sites. Therefore, no particular climatic, geographical, political or water company boundaries were used to define the coverage, and they would need to be reviewed prior to a practical application of the model at the regional scale.
Southwest frontal systems dominate the rainfall of southeast UK, hence rainfall generally reduces towards the east and north. As over the UK in general, significant correlations between rainfall and the North Atlantic Oscillation, and other variables and indices related to the Atlantic low pressure systems, are observed particularly in the winter months (Lavers et al., 2010; Murphy and Washington, 2001; Wilby et al., 2004; Yang et al., 2005). Other climate indices reported to have some influence on rainfall in this region include the East Atlantic pattern (Barnston and Livezey, 1987) and storm track blocking indices (Pelly and Hoskins, 2003). The southeast is hotter and more humid in the summer than the rest of the UK, and convective type rainfall is significant. The average annual rainfall over the case study region is 730 mm, ranging from 524 mm in the dry north of Kent (site 6762 in Figure 1) to 982 mm in the relatively high altitude coastal South Downs (site 7504 in Figure 1). Considering regionalaverage annual rainfall (based on the infilled data set presented later in the paper), the standard deviation during the period 1855–2011 was 105 mm, the minimum was 430 mm (1921) and the maximum was 1017 mm (1960). The UKCP09 analysis (Jenkins et al., 2008) did not find a significant trend (95% level) in either summer or winter rainfall in the southeast over the period 1914–2006 (their analysis included the whole of the Thames basin). Significant droughts in the southeast have included the long droughts from 1887 to 1910, and the shorter but more severe 1920–1922, 1975–1976 and 2004–2006 droughts (Marsh et al., 2007).
The rainfall data used in this study originate from the UK Meteorological Office MIDAS database. Details of the rain gauge network and recording practices can be found on the Hadley Centre website (see Table 1). Twentyeight of the rain gauges provide longterm data (defined here as more than 80 years), and almost all gauges have considerable periods of missing data. In this study, the 28 longterm gauges were used to fit the rainfall model, supplemented by 22 shorterterm gauges to provide a spatially representative set. The gauge numbers and locations are shown in Figure 1, and the extent and continuity of data are shown in Figure 2. The data period used was from March 1855 (the earliest record available, at the Southampton East Park gauge) to December 2011. The daily data were aggregated to monthly; any month which contained one or more missing days was considered to be a missing month (to be infilled by the model). The monthly timeseries were checked for inconsistencies and, for each gauge, any months with clearly perceived quality problems were removed (44 values of monthly rainfall in total).

Data  Definition  Units  Period available  Data source  Website 

Mean sea level pressure (MSLP)  The Met Office Hadley Centre's mean sea level pressure data set, HadSLP2, on a 5° latitude–longitude grid  mbar  1850–2004  Met Office Hadley Centre observations datasets  http://www.hadobs.org/ 
Central England temperature (CET)  Representative of a roughly triangular area of the UK enclosed by Bristol, Lancashire and London  °C  1659–2010  Met Office Hadley Centre observations datasets  http://www.hadobs.org/ 
North Atlantic oscillation (NAO)  Normalised pressure difference between Gibraltar and Reykjavik, Iceland  —  1821–2010  University of East Anglia Climatic Research Unit  http://www.cru.uea.ac.uk/cru/data/nao/ 
Atmospheric carbon dioxide  From 1958–2008, the Mauna Loa air intakes; from 1855–1957 from a spline of the Law Dome DE08 and DE082 ice cores  PPM  1832–1978/ 1958–2010  The Carbon Dioxide Information Analysis Center  http://cdiac.ornl.gov/ 
Trend  A linear trend  years  —  —  — 
Elevation  Above UK Ordnance Datum (Newlyn)  m  —  British Atmospheric Data Centre  http://badc.nerc.ac.uk/ 
Northing  UK National Grid reference  m  —  British Atmospheric Data Centre  http://badc.nerc.ac.uk/ 
Easting  UK National Grid reference  m  —  British Atmospheric Data Centre  http://badc.nerc.ac.uk/ 
Monthly climate data used as inputs to the model were selected according to the availability of longterm records and according to indications from the literature about their possible importance (Barry and Chorley, 2003; Hulme and Barrow, 1997). These climate variables are: the North Atlantic Oscillation index, Central England temperature, Mean Sea Level Pressure, and the East Atlantic index. Central England air temperature (as opposed to more local air temperature) is used because it spans the rainfall time period of 1855–2011 and at a monthly scale it is almost perfectly correlated with the southeast regional average temperature during the period 1914–2006 (correlation coefficient = 0·99). The spatial inputs are: northing and easting on the UK national grid coordinate system in units of metres, and altitude in units of metres above sea level. The definitions, origins and time periods covered by the data sets are listed in Table 1.
The aim of the regression is to identify a model which characterises the space and time variability of rainfall, and allows simulation. This includes a deterministic component (which estimates the expected rainfall given the input variable values for any month for any location) and a stochastic component (to estimate variability around the expected value including intersite dependence). The analysis methods are essentially empirical, although any models found to be inconsistent with known physical relationships would be rejected. All modelling was done using Matlab version R2010b.
In the study presented here the general regression model in Equation 1 is applied where Y is the vector of rainfall observations including all 50 sites, and X is the corresponding values of the predictor variables in Table 1. In pursuit of a normal distribution of regression errors, a oneparameter Box–Cox transform (Kottegoda and Rosso, 2008, p. 366) is applied to the monthly rainfall data before the regression model is fitted
where y is the transformed rainfall sample (i.e. a sample from the Y vector), r is the corresponding untransformed value (in mm/month) and λ is the Box–Cox parameter which is optimised to minimise skewness of the error distribution. After the model is applied, y is transformed back into r using the inverse of Equation 2. Each of the input variables in X was normalised so that its sample had zero mean and unit variance. This transformation allows the magnitudes of the optimised regression coefficients to be interpreted as relative sensitivity measures (Draper and Smith, 1998; Tabachnik and Fidell, 1996).
An independent regression model is developed for each of the 12 months. While this divides the data set into 12 and hence restricts the number of data points available per model, this monthbymonth approach has the advantage that it allows the seasonal variability of the rainfall to be characterised by the model coefficients rather than imposing an approximate seasonal structure. Despite splitting of the data set into 12, the longterm data and multiple sites ensure that there are sufficient data to identify statistically significant models.
Within this regression framework, the model may be fitted either to a single site, where the vector Y contains transformed rainfall data from only one rain gauge and matrix X contains no spatial information, or to multiple sites, where Y contains data from multiple gauges and X contains spatial input variables which aim to explain the variation in expected rainfall between gauges. Only the multisite analysis results are presented herein.
The deterministic regression of transformed rainfall allows identification and analysis of significant input variables, and infilling expected values of monthly rainfall at gauged and ungauged sites. However, to represent variability around the expected value a stochastic error model is also required. This allows the uncertainty in reconstructing partially observed events such as those in 1897–1910 to be modelled, and is essential for the simulation of possible but yet unobserved extreme droughts. The Box–Cox transform allows the errors to be approximately normally distributed with zero mean, hence the error model for any one month for a single site is straightforward. However, two types of errortoerror dependency potentially exist: dependency between errors from one month to another, and dependency between errors from one site to another. The former turns out to be insignificant (as confirmed in the results reported below); the intersite dependency, as should be expected using monthly data from sites within one region, is crucial.
For the set of 50 gauged sites, all of which have periods of overlapping data (Figure 2), the covariances can be estimated, so that C_{si}, C_{ss} and C_{ii} are known for any s set of sites and any site i. A missing month of data at site i is then simulated as
where y¯_{i} is the expected value from the deterministic component of the model and N(ē_{i},σ^{2}_{i}) signifies a random sample drawn from a normal distribution with mean ē_{i} and variance σ^{2}_{i}. In principle, this method can be used to synthesise data for missing periods in the data record while approximating the observed spatial dependence structure. This would result (as far as the underlying model assumptions allow) in a spatially and temporally consistent historical time series. Furthermore, the stochastic nature of Equation 5 means that multiple realisations can be generated to represent the uncertainty associated with the infilling. For example, periods with few operating gauges will have relatively high uncertainty in regional rainfall, and sites at large distances from the nearest gauged sites will have relatively high uncertainty.
In practice, the direct use of the observed covariances in Equations 3 and 4 was problematic using the case study data, because C_{ss} was not positivedefinite (Horn and Johnson, 1985), an indication that the sampled covariance is not consistent with a multivariate normal distribution. This is assumed to arise because the overlapping periods used to estimate C_{ss} were not the same for all pairs of sites, and so the sample used to calculate C_{ss} is not necessarily from a unique multivariate distribution. A potential solution is to form C_{ss} using only the sites nearest to site i. However, when tested, this only consistently resolved the problem when data from less than five sites were included, which is unlikely to produce an acceptable level of spatial consistency over the region. Instead, the problem of obtaining a real solution to Equations 3 and 4 was resolved by smoothing out the unwanted variability within C_{ss} by fitting a model of intersite covariance, specified below, rather than directly using the sampled observations.
Also considering the difference in elevation between pairs of sites did not significantly improve upon this model. Parameters α and β were optimised using nonlinear least squares using the observed intersite correlations. Only pairs of sites with more than 50 years of overlapping data were used in this optimisation to reduce influence of less precise estimates of correlation. For any two gauged sites, multiplying the correlation by the observed standard deviation of errors at both sites gives an estimate of the covariance. Hence a smoothed version of the observed C_{ss} is obtained, which leads to a consistently real solution to Equations 3 and 4. The significance of using a modelled instead of observed intersite error covariance will be tested as part of model verification.
The error model described above can be modified to allow extension of the historical record in space and time. Extension only in time requires generation of sets of errors over the 50 sites for months when no rainfall observations exist. Extension only in space means generating rainfall within the record period for hypothetical sites, for example to produce gridded rainfall. In this case (because i represents an ungauged site) rather than using an observed error variance in C_{ii} and C_{si}, a model is needed. This is approached by assessing whether and how error variance changes across the 50 gauged sites, and interpolating to the synthesised sites. Extending the record in both space and time combines these two modifications.
The aim of model verification is, first, to test to what degree the statistical properties of the errors conform to the assumptions which have been made in model estimation. The specific tests carried out are listed here.
(a) Bias in errors over space and time.  
(b) Deviation of errors from a normal distribution.  
(c) Dependence of errors on input variables.  
(d) Stationarity of variance in errors over space and time.  
(e) Autocorrelation of errors between months. 
Recognising that the properties of the errors will not exactly conform to the assumptions (no model is perfect), the second stage of verification is to test if this nonconformity significantly affects the model's ability to simulate relevant observed rainfall statistics. Multiple realisations of rainfall are simulated for the gauged time period and sites, not conditional on the observed historical rainfall, while still being conditional on the historical input variables X. This simulation represents the range of possible rainfall timeseries which could have occurred (according to the model) given the historic climate variability. If the model is adequate, the observed rainfall data will appear to be one realisation from the simulated distribution of rainfall (Chandler and Wheater, 2002; Yang et al., 2005). Because the observed rainfall statistics have some uncertainty themselves due to the missing data, this stage of verification is preceded by using the model for infilling the historical record, in our case from 1855 to 2011. While the infilled data are dependent on the model itself, and thus not a perfect testbed, the infilling uncertainty proves to be low in the case study; moreover, explicitly estimating the uncertainty in the historical rainfall in this way is considered an improvement upon the typical practice of neglecting observation uncertainty. The following specific comparisons of simulated and infilled rainfall were used.
(a) Timeseries of annual siteaveraged rainfall, winter (October to March) siteaverage rainfall, and summer (May to September) siteaveraged rainfall. These averages do not include any weighting to represent the area represented by each site.  
(b) Statistics of interannual variability of siteaveraged rainfall for each of the 12 months: average, standard deviation, skewness and selected percentiles.  
(c) Annual average rainfall at each site.  
(d) Variance, skewness and correlation of annual average rainfall over sites.  
(e) Twoyear, fiveyear and tenyear running averages of annual and winter rainfall, to assess the ability of the model to represent persistence. 
It was decided not to apply splitsample validation, in which some data are omitted from model fitting and used solely for verification, in order to maximise data available for model fitting. However, the analysis of model residuals provides information about model bias over time and space that is similar to splitsample testing, and the testing of the model on various statistics not used in the model fitting is the typical approach to verification of stochastic rainfall models (Chandler and Wheater, 2002; Yang et al., 2005). Although there is not enough space to show them all, a sample of results is shown below. Some supplementary results are available on the article's webpage.
The input climate variables found to significantly affect the time variation of rainfall in at least some months are: Central England temperature, Mean Sea Level Pressure and the North Atlantic Oscillation index. For most months, a linear trend (increasing rainfall) was also present. This trend term by itself does not necessarily mean increasing rainfall because it is the combined effect of all the input variables that matters; however repeating the regression using only a trend term also illustrates a general increase in rainfall. Easting and northing coordinates and altitude were significant in explaining the regional variability. The coefficient estimates over the 12 months are shown in Figures 3(a) to (i), together with intervals which are not significantly different from zero at the 95% significance level. There is no easy analytical solution for these 95% significance intervals because of the intersite dependencies, and so they have been estimated using simulation (i.e. with the regression coefficients set to zero, the data were simulated 200 times from the estimated error model, and 200 sample regression coefficients were identified: the top and bottom 2·5% were removed to give the simulated intervals). There was interaction between the effects of coefficients due to the colinearity between input variables. This was most notable for the coefficients for Mean Sea Level Pressure and Central England temperature (e.g. in January, their correlation was −0·77), and for the coefficients for Central England temperature and trend (e.g. in January, their correlation was −0·25). This leads to relatively high variance in these coefficient estimates and hence wide significance intervals in Figure 3. Nevertheless, Figure 3 illustrates that all the input variables have significant independent effects in at least some months. The secondorder effect of variables (e.g. whether the North Atlantic Oscillation has greater influence for the more southerly gauges) was tested by using combinations of variables as inputs to the regression. The only significant secondorder effect was the combined effect of Mean Sea Level Pressure and the North Atlantic Oscillation: in February, May and November, pressure had a greater influence when the oscillation was strong (Figure 3(i)). The magnitude of the coefficient values are measures of relative sensitivity of the rainfall to the inputs showing the dominant roles of Mean Sea Level Pressure, northing and altitude (Figures 3(c), (g) and (h)).
The linear trend term is significant at the 95% level in only three months – January, March and December. However, it is above zero for all months except July, and for this to occur due to random variability is extremely improbable. Hence it was concluded that the trend over the period 1855–2011 was significant in all seasons except summer. For the purpose of explaining the rainfall variability and providing the potential for extrapolating the model, the trend would ideally be explained by physical phenomena. Various attempts were made to introduce explanatory variables to explain this trend, including nonlinear transforms of Central England temperature, Mean Sea Level Pressure and the North Atlantic Oscillation index and their interactions, but these were not helpful. If timeseries of atmospheric carbon dioxide concentrations (constructed from the Hawaii measurements of Keeling et al. (1995) and the Antarctic icecores of Etheridge et al. (1996)) are used as inputs then the R^{2} values are slightly increased and the linear trend term becomes much less significant. While the statistical explanation for this is simple – the carbon dioxide data increase over time hence replacing the trend term – there is no clear physical explanation of why carbon dioxide should explain rainfall variability when the climate variables do not and hence the carbon dioxide input was not adopted. The attraction of this model, however, is noted again below when considering its effect on the structure of errors.
The regression model summarised in Figure 3 used the same Box–Cox transform for all 12 months and all 50 sites with optimised λ of 0·41. The use of a constant λ for all 12 regression models was necessary to make meaningful comparisons of coefficients between months (the Box–Cox transform rescales the data, so that use of 12 different coefficients would result in coefficients which were not comparable over months as they are in Figure 3). The use of constant λ, however, causes undesirable skewness in the errors for several months (in July, for example, the skewness coefficient was 0·39) and reduces applicability of the error model specified in Equations 3, 4 and 5. For further analysis, therefore, λ was optimised for each month individually, which produced nearnormal distributions of errors for each of the 12 models. Optimising λ individually for all 50 sites for each month is possible, but likely to lead to nonunique solutions and, in any case, using a spatially uniform value produced satisfactory error distributions.
When averaged over sites, the errors showed little apparent structure. This included no discernible relationships between errors and input variables, the error histograms had no visible deviation from a zeromean normal distribution, and there were no significant autocorrelations of errors from one month to the next. The latter result is illustrated in Figure 4, in which the error autocorrelations for the gauges in Kent are plotted. Although there is significant monthtomonth correlation in the actual rainfall time series, this is represented by the deterministic part of the model, leaving the monthtomonth dependency between errors insignificant. This supports the view that a continuous time series can be simulated using an independent model for each month. There was a tendency for the model to underestimate rainfall in the early years of the record, introducing a visible bias in the errors in the period 1855–1875 (although not illustrated here, this will be seen in the verification results described later and shown in Figures 8 and 9). This apparent bias occurred because the linear trend term describing the general increase in winter rainfall was applied over the whole series, whereas closer inspection reveals that there was a much weaker trend between 1855 and 1900. Again, it is tempting to use the atmospheric carbon dioxide concentrations instead of the linear trend: this substantially reduces the bias because carbon dioxide concentrations rose more slowly in the pre1900 period. However, as previously noted, there is a reluctance to do so without a physical explanation. Also, the small number of gauges operational during these problematic early years (Figure 2) means that there would be relatively low confidence in such a model.
The spatial error analysis also illustrated potential minor flaws in the model. This is seen in Figure 5, which, as an example, plots the mean monthly errors for the 15 sites in Kent. While the statistical significance of many of these errors is indicated by their lying outside the estimated 95% significance intervals, their physical significance is questionable. This is because the bias may be explained by measurement error, for example Rodda and Smith (1986) present 5% as the typical undercatch associated with gauges not installed at ground level, and they found that in some cases the measurement error was much larger than that. From our model, the maximum observed relative error, out of all sites, was 5% (at the driest site in the region, Figure 5(m)). Nevertheless, an improved spatial model should be considered in future model development.
For each month, there was no evident spatial structure in the error variance estimates. This is illustrated in Figure 6 which shows the sample standard deviation of errors for the sites in Kent with their 95% confidence intervals (the intervals are calculated using the approximate solution described by Kottegoda and Rosso, 2008; p. 244). For comparison, superimposed upon those results as a horizontal line, Figure 6 also shows the sample standard deviation of errors when all 50 sites are considered together, illustrating that, with very few exceptions, this regionally lumped value is a fair estimate for each individual site. Hence the assumption is made that variance of errors is uniform over the whole region within any one month. The correlation of errors between sites, on the other hand, displays a strong spatial structure, with correlation decreasing with distance as described by Equation 6. The fitted correlogram models are illustrated in Figure 7. The models are relatively consistent over the months, with a faster decline in correlation with distance from April to September in comparison with October to March, reflecting the increasing role of more localised, convective events in summer.
First, the historical data from 1855–2011 were infilled using the model. For each month/site with missing data, 200 samples of the timeseries of errors were used to represent the stochastic variability. The 200 timeseries of infilled annual, summer and winter siteaverage rainfall are shown in Figure 8. Notably, uncertainty in the infilled data is highest during the earlier years when there were fewer active gauges. Nevertheless, the uncertainty is not overriding in terms of the regional rainfall estimate, because: (1) much of the rainfall variability is predictable by the regression equation; (2) the relatively high intersite correlations evident in Figure 7 mean that the longterm sites provide much of the necessary information about residual variability; and (3) averaging over sites, and over years or seasons (as in this plot) reduces the variance. If considering subregions, the uncertainty in the earlier years would become higher especially when moving further from the longterm gauges; and if considering rainfall in individual months then the uncertainty is also higher. The infilling was also applied to synthesised sites on a 5 km grid, producing a spatially quasicontinuous data set covering the period 1855–2011 (results not shown here).
Second, 200 timeseries of rainfall (not conditioned upon the observations) were simulated with the model to represent statistically plausible ranges of rainfall. The 95% confidence intervals derived from the ensemble of siteaverage rainfall are shown in Figure 8, as well as the maximum and minimum values from the ensemble. Comparing the infilled and simulated distributions in Figure 8, it appears that the infilled data are a sample from the simulated rainfall, supporting the view that the model usefully represents the historic variability. The rainfall during the extreme winter drought of 1975–1976 and the extreme summer drought of 1921 are only just encompassed by the simulation bounds, implying that these drought events were extreme given the largescale climatic conditions at the time. The long drought of 1887–1910 also appears from Figure 8 to be captured by the simulations, as are the dry winters in 1879–1880 and 1897–1898, and the pairs of dry winters in 1995–1997 and 2004–2006.
Figure 8, however, does not allow interannual drought persistency to be properly evaluated. To do so, twoyear, fiveyear and tenyear running averages are presented in Figure 9. This illustrates that the series of droughts from 1887 to 1910 are captured, but only by the driest of the simulations, illustrating the extremeness of this drought period given the forcing climate. It is pertinent to note that, according to Figure 9, the most severe twoyear drought on record (1920–1922) could recur; indeed it appears that the southeast region was fortunate in 2004–2006 not to have suffered a similar episode given the general climatic conditions at that time. As previously discussed, a feature of Figure 9 is the model's tendency to underestimate multiyear rainfall in the period 1855–1875.
Figure 10 shows a number of temporal and spatial statistics of the infilled and simulated data. Generally, this further supports the view that the model is approximating the properties of the observed rainfall. Some statistics – the minimum, maximum, standard deviation and skewness over time – are persistently towards the lower bound of the simulated distribution, which is expected due to the skewed nature of the rainfall distribution. Figure 10(c) shows, however, a clear tendency to overestimate the maximum July and October rainfalls: this is associated with errors in representing the distribution of transformed rainfall in these months using a normal distribution. Another interesting result in Figure 10 is the model's tendency to overestimate the spatial skewness of average monthly rainfall in October, November and December. While the model predicts insignificant spatial skewness in these months, the infilled data imply that there are a few sites with much lower monthly averages than the norm, producing significant negative skewness. This is due to the overestimation of rainfall at some of the driest sites in the region, in northern Kent. This was seen in the negative residuals at sites 6762 and 6898 in Figure 5. As previously discussed, this may be resolved by using a more sophisticated spatial model (e.g. quadratic terms for east and north coordinates), however arguably this would be overfitting as the biases at these sites are within possible measurement errors.
This paper was motivated by the need for spatially and temporally complete, longterm rainfall records to support regional drought management. The southeast UK is an example of a region which is vulnerable to extreme droughts, and repetition of historical interannual droughts is a worrying prospect under current and future demand for water. The southeast UK is, however, fortunate in having gauged sites going back to the midnineteenth century, allowing more insight into rainfall variability and more reliable estimation of rainfall extremes than is generally possible. Nevertheless, there are long periods of missing records, and many parts of the region with no longterm records. Significant effort has previously been made under the UKCP09 programme to produce a nationally applicable statistical downscaling tool for UK daily rainfall simulation. However, that downscaling tool is not applicable to infilling historic rainfall in a spatially consistent manner, and may have limitations in replicating extreme historic droughts because it is not linked to the physical drivers of rainfall and has been fitted using a limited range of droughts (Chun, 2011; Jones et al., 2009). The model presented in this paper aims to address these limitations and hence provide complementary data sets.
This paper described a set of regression models for characterising rainfall variability, and infilling and simulating monthly rainfall. The models include a deterministic component that models expected monthly rainfall under specified largescale climatic conditions, and also a stochastic component that simulates the random variability around the expected value. Gridded rainfall can be produced for a range of observed or synthetic droughts. Using the case study of southeast UK, 50 longterm rain gauges with records spanning from 1855–2011 were used to identify and assess the models. The largescale variables found to affect rainfall were generally consistent with the findings of previous research on UK rainfall: air pressure, air temperature and North Atlantic Oscillation. A positive linear trend term was identified throughout the twentieth century in all seasons except summer. However, the trend was weak in comparison with the other effects and the random component, and did not preclude recurrence of the severe interannual droughts observed in the record.
The model assessment illustrates the potential value of relatively simple rainfall models for generating realistic monthly rainfall patterns. Performance in terms of error diagnosis and comparison of infilled and simulated statistics was considered to be good, although there were two main issues which might benefit from further investigation. First, spatial biases arose from the use of a simple spatial model, causing apparent overestimation of rainfall at some of the driest sites in Kent. These biases might be explained by rainfall measurement errors, although their particular prevalence in north Kent makes this seems unlikely. Second, temporal biases arose in the period 1855–1875 because the linear trend was weaker in this early period. Using atmospheric carbon dioxide as an input helped to explain the nonstationarity in the trend. It may be speculated that carbon dioxide has influenced global climate patterns, and hence southeast UK rainfall, in a manner that cannot be represented by the combinations of pressure, temperature and NAO and their interactions investigated in this paper. For example, although east Atlantic ‘blocking' patterns are known to be influenced by global climate and to affect rainfall (Pelly and Hoskins, 2003), they were omitted in this investigation because reconstructions of blocking only date back to 1958. This deserves some further investigation. In terms of the model's ability to simulate interannual drought, indices of the long droughts within 1887–1910 were within the range of simulations, as were indices of the extreme twoyear droughts of 1920–1922, 1933–1934 and 1975–1976. According to the model, the recent droughts of 2004–2006 could have been much more severe given the climatic conditions at the time – potentially more severe than the 1920–1922 event.
The ability of the model to simulate rainfall as a function of largescale climate variables and indices makes it tempting to employ the model for downscaling global climate model outputs for climate change impacts assessment. However, extrapolating the historic signals to future climate in this manner, although common practice (e.g. Chun et al., 2009; Haylock et al., 2006; Maraun et al., 2010), is not recommended unless it can be shown that the signals are expected to be stationary under a changed climate. Further research is required towards characterising nonstationarity and how it might be resolved in the model. Perhaps the primary limitation of the model described here is that for some applications daily rainfall would be preferred. Development to simulate daily rainfall would require the wet–dry day distribution to be modelled independently of the rainfall depth distribution (Mehrotra and Sharma, 2010). This would naturally lead to the more generalised linear modelling techniques used, for example, by Yang et al. (2005). However, for regional analysis of interannual droughts in systems with large storage capacity such as southeast UK, monthly scale analysis is likely to be sufficient. Another potential extension to the analysis would be extending records even further back in time by including palaeo data as predictors (Henley et al., 2011).
A major challenge in water resource planning is the hindcasting of hydrological data to ensure that possible extreme droughts, including interannual sequences of droughts, are adequately considered. A second challenge, important when considering options for intra and interregional water transfers, is spatially consistent characterisation of droughts. These challenges are especially relevant in the waterstressed southeast UK. Currently available climate modelling tools and data sets, such as UKCP09, are not by themselves designed to meet these challenges. This paper describes and tests a statistical model that infills and extends historical rainfall observations to allow improved consideration of extreme and interannual droughts in the southeast UK, with potential applicability to other regions where similar problems exist.
Acknowledgements
This research was supported by the Grantham Institute for Climate Change at Imperial College London. Thanks also to NERC and the Meteorological Office, and other data providers listed in Table 1.