Page Banner

United States Department of Agriculture

Agricultural Research Service

Related Topics

headline bar
1 -
2 - Figure 1
3 - Figure 2
4 - Figure 3
5 - Figures 5-9
6 - Table 1
7 - Table 2
8 - Table 3
9 - Table 4
10 - Table 5
11 - Table 6
Hydrology and Remote Sensing Laboratory Research Papers - Crop Condition and Yield Research - Paul C. Doraiswamy and Alan J. Stern
Ongoing Projects
Research Papers
Submitted to P.E. and R.S.

Selection of the Optimum Linear Regression Relationship for Determining Spring Wheat Yields Using NOAA AVHRR Data

Paul W. Cook
National Agricultural Statistical Service, USDA
Fairfax, VA 22030.

Paul C. Doraiswamy
Agriculture Research Service, USDA
Beltsville, MD. 20705.


The objective of this study was to examine the linear regression relationship of North Dakota (see Figure 1) county spring wheat yields to the county averages of Normalized Difference Vegetation Index (NDVI) data derived from NOAA AVHRR data for ten biweekly periods for the 1989-1992 crop growing seasons. An analysis of the available data using S-plus showed that Period 22 (late June to early July) was the best individual period for each individual year. However, for the combined four years of 1989-1992, Period 22 NDVI county averages gve a much lower R-square than had any individual year. Summing the NDVI county averages for Periods 20,22,24, and 26 (June 22 - August 16,1990) for the four years gave the highest R-squares for any sums of groups of four or more periods. These analyses show that the previously chosen sums of four biweekly periods of NDVI data (Doraiswamy and Cook, 1995) give the best linear relationship to spring wheat county yields for the four years of the study.


The Agricultural Statistics Board (ASB) of the USDA's National Agricultural Statistics Service (NASS) is responsible for publishing official estimates of crop acreage. yield, and production for the United States. The ASB considers indications from multiple surveys and administrative sources throughout the growing season to calculate these estimates at National and State levels. Survey information includes farmer interviews conducted by enumerators, telephone contacts, in-field objective yield measurements, and mail questionnaires. NASS field offices calculate and publish county estimates of crop acreage, yield, and production for major crops. County level estimates must add to equal the official state estimates and are generally estimated from administrative data and a large mail survey (Iwig, 1993). State crop yields are obtained by dividing the total State crop production by the total harvested acres for that crop. County crop yields can be decided in this manner as well.

Remote sensing data have had a continuing potential use for monitoring extreme weather conditions that can adversely affect the crop's development (Doraiswamy and others, 1994). Satellite-based sensors usually have visible, near-infrared, shortwave-infrared, and even thermal spectral bands. The LANDSAT Thematic Mapper sensor collects data in all these spectral bands with a 30-meter (I 20-meter thermal resolution), but its repeat cycle of 16 days and high processing costs have limited its effectiveness to small geographic areas. Other sensors such as SPOT's MultiSpectral Sensor have less thorough spectral coverage (visible and infrared only), but do provide greater spatial resolution and repeat cycles from their twosatellite system. However, higher resolution sensors, such as TM and S130T are both expensive to purchase and process. Even if their costs were less, their extended repeat cycle times would limit their capabilities to assess crop condition during the growing season (Homer, C. HR. and others, 1997; Penner, 1995).

Large scale crop yield studies require frequent remote sensing data collection over wide areas. One sensor that provides daily data collection at low cost over wide areas is that of the Advanced Very High Resolution Radiometer (AVHRR) sensor on board a National Oceanic and Atmospheric Administration (NOAA) weather satellite. The AVHRR sensor provides only a 1. 1 km pixel resolution (at nadir). However, by providing daily overpass data from five spectral bands, the AVHRR sensor is more useful for monitoring crop condition over large areas and over time than the more spatially accurate data from LANDSAT or SPOT. What is most important, the visible (red) and near-infrared reflectance bands from AVHRR can be used to calculate the Normalized Difference Vegetation Index (NDVI = (Near Infrared Band - Red Band)/(Near Infrared Band + Red Band)) that has a definite relationship to crop condition (Yin and Williams, 1997). Indeed, Statistics Canada has used NDVI data in a regression procedure to make preliminary spring wheat yield forecasts for Canada since 1989 (Reichert, and others, 1998). This study evaluates NDVI values from ten biweekly AVHRR composite images from each of four years to assess the relationship between spring wheat yields and AVHRR data.

Figure One


Tile USGS's EROS Data Center (EDC) composites individual date AVHRR images over a 14-day period to create a biweekly geo-registered digital product for the Conterminous Unites States. Each image is essentially cloud free and provides the maximum NDVI (over the 14-day period) at each pixel location. EROS Data Center provided the AVHRR biweekly composite data for the conterminous U. S. resampled to one kilometer resolution (Eidenshink. 1992). When clouds do appear in the composite imagery, the pixel values for areas surrounding the clouded regions may also be in question since haze or the effects of cloud shadow can be averaged into these NDVI values. Therefore, care must be taken in using NDVI values taken from scenes with clouded regions.

The EDC also provided a categorized conterminous U. S. land cover product with 167 categories (Brown, and others, 1993) developed using 1992 NDVI data, also at one kilometer resolution. The North Dakota biweekly AVHRR composite data for the four crop seasons from 1989 through 1992 were chosen Im this analysis, both North Dakota and South Dakota were examined in an earlier paper (Doraiswamy and Cook, 1995).

Subsetting the categories of the EDC land cover product provided a mask for likely locations of spring wheat production in North and South Dakota. For many counties this mask was an Lipper bound on the spring wheat acreage ('or these flour years since 1992 had the largest State spring wheat acreage for the four years. Specifically, North Dakota had 7.6 million acres of spring wheat in 1989, 8 million for 1990, 7 million for 1991, and 9.2 million for 1992 (North Dakota Agricultural Statistics, 1995). Actual harvested acres were 7.25 million (1989), 7.7 million (1990),6.85 million (199 1) and 9.1 million (1992). The county averages for those pixels most likely to have spring wheat acreage were calculated from the pixels chosen by the mask. Those counties with less than 100 pixels likely to contain spring wheat were masked in calculating the value of the coefficients for predicting yields.

Although the relationship between the masked AVHRR pixels and the spring wheat acreage within each county was improved for many counties, many counties among the 53 counties still had too many remaining nonspring wheat AVHRR pixels (See Figure 2.). Improving the accuracy of the mask should aid in increasing the relationship between the county NDVI averages and the county spring wheat yields. More recent unpublished work (Stem and others, 1998) holds some promise in greatly improving on which AVHRR pixels to denote as spring wheat for creating county AVHRR averages through use of Landsat TM categorized imagery. Finally, an improved mask would help to ensure that the county NDVI averages represent spring wheat phenological development rather than mixing in the development patterns of other crops and land covers.

The NDVI values relate to tile "greenness" of the crop and so rise and fall during the crop growing season. Growth phases included in the original analysis extended froill flowering to wax ripeness of spring wheat. Weather and planting data for North Dakota (and South Dakota) suggested that the dates of June 13 through August 13 on average would match these growth stages. The four U. S. AVHRR biweekly period composites covering the time period were from June 22 through August 16 in 1990 (see Table 1). The dates from year to year do vary, however, changes in crop development cycles can vary much more due to planting dates and weather conditions (North Dakota Agricultural Statistics, 1995). So the crop development changes would have much greater impact on the final relationship between NDVI values and county spring wheat yields than would the changes in dates for the periods of AVHRR data between years.

Table One

Figure Two

A linear regression model in the earlier paper (Doraiswamy and Cook, 1995) related official NASS county spring wheat yield estimates to sums of four selected biweekly NDVI average county values of AVHRR data for the combined four years of data from 1989 through 1992. NDVI averages for

S-Plus Analysis of the Data Set: Evaluation of the Ten Individual AVHRR Periods

Figure 3 shows how the county-based spring wheat yields, were highly variable during the four years at the crop reporting district level. Clearly, as we go from 1999 in the lower left to 1990 in the lower right and finally to 1992 in the Lipper right, tile highest yielding districts vary by year. Howevcr. the South Central (SC) district does remain low in all four years with consistency. 'File variation in spring wheat yields by crop reporting district and the inaccuracy of the mask that removed nonspring wheat related AVHRR pixels increased the difficulty in creating the final relationship between the AVHRR county averages and that of' tile County spring wheat yields.

The S-PLUS Leaps procedure was used to examine all regressions using the AVHRR period data either individually or as groups of periods without running each possible least squares regressions individually. The criterion used by the S-Plus Leaps procedure to evaluate tile regressions was the Mallows Cp statistics. Specifically, the Mallows C. Statistic (Kotz and Johnson, 1982) is defined as the following:

Cp = RSSp/s 2 - n+2P

where RSSp is the Residual Sum of Squares over the k sets of variables of the residual variables and S2 is the estimated residual variance using the full rnodel. Also, here, n equals 53 for the number of counties, k equals ten for the number of period variables, and p is the number of parameters in the regression equation and so equals two since tile intercept is included. The evaluation of the statistic by the S-PLUS Leaps procedure required choosing the value of Cp that is closest to the number of' regression parameters (p). However, since no one variable model is sufficient, only the smallest Cp helped to choose the best period. Of course. each analysis included an intercept term within the model definition. Each year had a varying selection from among the ten periods and so no definite pattern prevailed First, the Leaps' analysis of the individual PERIODS of AVHRR data provided Insight into tile relative importance of each PERIOD considered. However, three years (1989 through 199 1) gave PERIOD 22 the best Cp ranking for two factors (the other factor was the intercept). The 1992 data gave the best ranking to Period 26 whereas Period 22 was number four. Table 2 summarizes the results of the Leaps' analysis by presenting the Cp values for individual periods and showing tile ranking of the variables for each year. Table 3 presents the results of the regressions using Period 22 alone.

Table Two

Figure Three

Table Three

Cp values for additional periods in the model were less, however, there was no consistent grouping of periods across years to choose. The intercept and period 22 terms had highly significant t-values of 0.0000 for all coefficients except the intercept for 1989 that was significant at 0.0011 and the intercept for 1992 that was not significant at 0.6780. The large ranges of values for the intercept values and regression coefficients for the Period 22 terms plus the large drop in Rsquare values when the years are combined show that the years are not consistent.

1 Values in this table are rounded to simplify the presentation.

2 Values in brackets [ ] are the rank of the specific period for that year.

Using all ten periods in the regression equation did provide higher Rsquare values for each individual year. Table 4 provides some results of the regressions for the four years. Although the Rsquare values might be quite good for each regression equation. the accuracy measures evaluating tile regression fits tell another story. The coefficients for the periods have very large standard errors that are from two to five times as large as the coefficient itself'. Thus, very few of the regression coefficients are significant. Clearly. using all the periods individually will not provide a good predictive equation, so the combined years are not considered.

Table Four

Evaluation of SUMS of Four or More PERIODS

Summing average county NDVI values over periods should be more related to yields than Would individual composite periods (Doraiswamy and Cook, 1995). The next analysis used the S-Plus Leaps' function applied to all possible combinations of sums of the period county averages using four through all ten periods. This procedure evaluated whether or not sums of another group of four or more periods would relate better to the spring wheat yields.

Although few sums produced linearly independent variables, a subset of all possible sums remained in the Leaps' analysis. Period 12 became the first period (denoted as 1) while Period 30 became tile tenth (denoted as A) period during the growing season. Sums of the period county NDVI averages were expressed as, for example, SUM. I A, that is, the sum of periods one through ten (A). Ten sums remained in the analysis as shown in Table 5, below. SUM.58 has the smallest Cp for each of the four years. However, no year has a Cp value sufficiently near the optimum value of two to conclude that tile regression will fully explain tile spring wheat yields.

Table Five

The plot in Figure 4 helps to explain more clearly the meaning of tile coefficients from Table 5. The three years of 1989, 1991, and 1992 have comparable slopes. However, the slope for 1990 is clearly steeper than any of the other three years. Indeed, even the years with similar slopes have different intercept values. These intercept differences are due to the four years having spring wheat yields that are not comparable.

Table 6 continues the analysis for years and presents the intercept term standard error of tile intercept, and the coefficient of the county averages sum for four periods with standard error (std. error). These values vary significantly from year to year and as compared to the combined four-year analysis.

    3Only the original sums of periods for 1989 are in the table, since the remaining sums of periods did not have CP values that were better than those listed for 1989.

Figure Four

Table Six

    Combining the four years of spring wheat yields with tile corresponding AVHRR NDVI county averages as in Figure 5 shows differences in the four years of data. Those Counties having the lowest recorded yields (from 25 - 40 bushels per acreage) during 1992 relate to similar values of AVHRR county NDVI averages as do much lower yields (from 10 to 20 bushels per acreage) for the years of 1989 and 1990. These larger minimum spring wheat yields for 1992 are substantial outliers in the combined regression for the four years.
    One way to explain the lack of agreement among the four years of spring wheat yields with the AVHRR NDVI county averages is that the mask derived from the EDC land cover analysis overestimates spring wheat acreage. This overestimation is particularly true for the earlier years. 1989-1991. Another difficulty in relating the two data sets is that the spring wheat yields within the crop reporting districts of the State (Figure 2) are not consistent for the four years. Figures 6-9 should help in seeing just some of those inconsistences in between spring wheat yields and the SUM.58 NDVI data for the years 1989 and 1992. These graphs depict 3-D perspective views of the county yield estimates and the corresponding county SUM.58 NDVI data.

Figures Five through Nine

There are some definite similarities between the surfaces for the county spring wheat yields:

  1. Figures 6 and 7 for 1989 and figures 8 and 9 for 1992 show a steady increase in values both for yields and for SUM.58 NDVI averages as we go to the east from the center of the State,
  2. the western part of the State clearly has Much lower values than does the eastern part of the State due to tile Much poorer oils in the vest. and
  3. An improved "ridge" of' yields and Sum.58 values separates two lower, yielding regions within this western part of' the State. The perspective plots for both years also exhibit some definite differences as follows:
  4. Figures 6 shows much more extreme spring wheat yield increases in tile eastern part of' tile State than does the SUM.58 NDVI data in figure 7,
  5. The spring wheat yields in Figure 9 show many more ridges than does the relatively smooth surface of the Sum.58 NDVI data in figure 9, and
  6. Tile relationships between the lowest Sum.58 NDVI values and yields for 1999 show greater spring wheat yields in 1999 being associated with even lower Sum.58 NDVI averages than in 1992.

Even with the above mentioned consistencies, differences that occur during the four years of data might increase the difficulty in developing a stable regression relationship between the spring wheat yields and Sum.58 NDVI values.


These analyses of the North Dakota NDVI data set using all individual periods and all sums of flour or more periods (including all ten available periods) have not provided a final answer to developing a model using AVHRR NDVI data to predict spring wheat yields. However, the possibilities of using either individual periods of NDVI data for spring wheat in North Dakota during the growing season or using periods other than those corresponding to periods 20, 22, 24, and 26 for these four years are excluded. This observation implies that the relationships between NDVI values and yield require a relationship integrated over an important eight-week period of crop growth for the spring wheat crop. This integration over time is critical to developing the final relationship. The linear model might be properly relating observed county spring wheat yields with NDVI data for the four period sum, but not with a sufficiently high degree of accuracy to account for the large differences in spring wheat yields among the four years.

Additional analyses of the relationship among spring wheat yields, other NASS yield survey and objective yield data, and weather data with AVHRR data are needed. Although the use of Landsat TM data with a resolution of 30 meters will add additional costs, the improved accuracy of spring wheat acreage is necessary to establish the characteristics of the yield data in relationship to remote sensing observations. Categorized Landsat TM scenes will be available from USDA/NASS classification work starting with the 1997 growing year. This categorized Landsat TM data set will aid in creating an improved mask to place county spring wheat acreages within the NDVI data.


The authors wish to express their gratitude to George Hanuschak,. Associate Research Director. for development of the USDA/NASS capability to work with AVHRR data and for frequent encouragement to improve in our understanding of the relationship between crop yields and AVHRR data. Also, the authors thank Dr. Dan Carr, George Mason University, for encouragement to learn S-plus and examine the data more fully during his Exploratory Data Analysis Course. Our thanks to Rick Mueller. USDA/NASS, for assistance with the graphics manipulation In Framemaker for final printing and to Mike Craig for his editorial recommendations that improved the paper.


Brown. J. F., T. R. Loveland, J. W. Merchant, B. C. Reed, and D. O. Ohlen. 1993. "Using MultiSpectral Data in Global Landcover Characterization: Concepts, Requirements, and Methods." Photogrammetric Engineering and Remote and Remote Sensing. Vol. 59, No. 6, pp. 977-987.

Doraiswamy, P., Hart, G., Craig, M., Cook, P., 1994. "The Anomalous '93 Growing Season -- How USDA Used AVHRR Data," Proceedings of the 60th ASPRS, 1: 144-15 1, Reno, Nevada.

Doraiswamy, P. C. and P. Cook 1995. "Spring Wheat Yield Assessment Using NOAA AVHRR Data." Canadian Journal of Remote Sensing, Vol. 2 1, No. 1, pp. 43-5 1.

Eidenshink J. C. 1992. "The 1990 Conterminous U.S. AVHRR Data Set." Photogrammetric Engineering and Remote Sensing, Vol. 58. pp. 809-813.

Horner, C. H., Ramsey, R.D.,Edwards,T. C., Falconer, A. 1997. "Landscape Cover-Type Modeling Using a Multi-Scene Thematic Mapper Mosaic," Photogrammetric Engineering and Remote Sensing, Vol. 63, pp. 59-67.

Kotz, S. and Johnson, N. 1992. Encyclopedia of Statistical Sciences, Vol. 2, pp 218-219.

North Dakota Agricultural Statistics, USDA/NASS, Ag Statistics No. 64, June, 1995,

Penner. M. "Comparing AVHRR and CANF91 Land Classes," Canadian Journal of' Remote Sensing, Vol. 2 1, No. 1. pp. 10-15.

Reichert, G. C., Nixon, 11. R. , Dobbins, R. N. "Statistics Canada's Near Real-Time Crop Condition Assessment Program Utilizing NOAA AVHRR Data - Remote Sensing. GIS and the Internet," Agricultural Statistics 2000, Washington, D. C., April, 1998.

Statistical Sciences. Inc. 1995. S-PLUS User's Manual, Version 3.3 for Windows, Seattle.

Stern, A., Zora, and Doraiswamy, P. "Spring Wheat Yields from Landsat TM Data in North Dakota," in progress.

Yin, Zhangshi and Williams, T. H. Lee 1997. "Obtaining Spatial and Temporal Vegetation Data From Landsat MSS and AVHRR/NOAA Satellite Images for a Hydrologic Model."Photogrammetric Engineering and Remote Sensing, Vol. 63, pp. 69-77.

Horizontal Rule [Home] [Ongoing Projects] [Research Papers] [Links]
[1]     2     3     4     5     6     7     8     9     10     11     Next >>

Last Modified: 9/30/2005
Footer Content Back to Top of Page