We first identify the location of organic crop fields in Kern County and then estimate whether status as organic versus conventional fields determines pesticide use (Fig. 5).

Fig. 5: Methodology overview.

Figure outlines the main method steps from identifying organic fields to creating the analysis data to performing the statistical analyses. All images shown are simplified, visual representations of the datasets. CDFA refers to the California Department of Food and Agriculture, while APN is the Assessor’s Parcel Number and TRS is the Township-Range-Section. Identifying organic fields combines the created CDFA organic APN, CDFA organic TRS, and organic pesticides data layers together to create the final organic versus conventional fields layer used in the analysis data section. All analysis data layers are then inputted into the various statistical analyses.

Identifying organic fields

We identified organic fields using a combination of California Department of Food and Agriculture (CDFA) records and Kern County Agricultural Commissioner’s Office spatial data (“fields shapefiles”) and pesticide use records. No single source was complete, and as such, we evaluated several different approaches to identifying organic fields.

California Department of Food and Agriculture (CDFA) records

Data on the location of organic fields, per the California State Organic Program, for 2013–2019 was obtained by request from the California Department of Food and Agriculture (CDFA). The CDFA, through the State Organic Program, requires annual registration of certified organic producers who have an expected gross sale of over $5000. We were specifically interested in the pesticide aspects of organic production, which is governed in our study region by the USDA’s National List of Allowed and Prohibited Substances. The National List of Allowed and Prohibited Substances delineates which synthetic substances can be used and which natural substances cannot be used for pest control in US organic production. Besides substances specifically (dis)allowed on the National List, allowed substances include non-synthetic biological, botanical, and mineral inputs. Field location data were in the form of either Assessor’s Parcel Number (APN) or PLS System Township-Range-Section (TRS) values, though data were reported without systematic formatting. We harmonized the CDFA APN values to merge with the Kern County Assessor’s parcel shapefile (2017), which we then spatially joined with the Kern fields shapefiles. We followed a similar process with PLSS TRS values, which were then merged with the Kern County PLS Section shapefile, and joined to Kern field shapefiles. We refer to our final organic designation as “CDFA Organic”. Details of the data cleaning process are described in the Ancillary Data Processing Methods section below.

Using pesticide use reports to refine organic field identification

After spot-checking pesticide use on CDFA Organic fields, it became clear we had not entirely eliminated conventional fields. This was likely due to variation in polygon geometries between PLSS Sections, Kern County Assessor parcels, and Kern agricultural fields data. To further refine our classification, we used field-level pesticide use, again from the Kern County Agricultural Commissioner’s Office. As thousands of pesticide products (active ingredients + adjuvants) are in use in Kern County, we took an iterative approach to eliminate fields using conventional pesticides. We first limited the universe of pesticides to those applied to fields that were CDFA Organic. We then identified the 50 most commonly used pesticide products by a number of applications, and manually classified each as organic or conventional. Having identified these products as described below, we matched them back in, eliminating fields that used conventional products and identifying as “PUR Organic” those that used only organic products. We repeated this process, hand identifying the most commonly used products and eliminating fields using conventional products until we had isolated fields using only organic products.

To classify a product as organic or conventional, we first searched for each product’s U.S. EPA-registered product label, using the exact product name and EPA product registration number. If there was any indication on the label that the product was certified as organic by the Organic Materials Review Institute (OMRI), or said “for use in organic production” or “organic”, then the pesticide was identified as organic (n = 132). If there was no organic indication on the product label, we searched the OMRI certification database for products with identical names and manufacturers, and identified products as organic if such certifications existed (n = 39). If all ingredients were defined (i.e., no inert or undefined ingredients) and were known organic active ingredients, then the pesticide was identified as organic (n = 1) (Supplementary Data 1). We failed to find EPA-registered labels for three products and confirmed on the California Department of Pesticide Regulation website that they are either inactive or out of production (EPA registration numbers: 52467-50008-AA-5905, 36208-50020-AA, 2935-48-AA-120). Each of the three was rarely used (n < 4) on CDFA organic fields across the 2013 to 2019 data. To be conservative, we defined them as conventional. If a field used no pesticides and was CDFA organic, we labeled it as organic. If it used no pesticides and was not CDFA organic, we maintained it as conventional.

Kern County agricultural commissioner’s office spatial data

For 2017–2019, the commodity attribute field in the Kern County agricultural fields shapefiles indicated self-reported organic fields with “-ORGANIC” or “-ORG” following the commodity name. Self-reported organic crop fields may be less accurate than those identified using CDFA data as self-reporting is not validated nor required. After researching multiple large crop producers in Kern County, we suspect that many crops being grown organically with CDFA certifications are not being reported as such to the county. Still, we run a robustness test including these fields in addition to the PUR Organic fields for 2017 to 2019 to capture smaller farms that may be exempt from certifications. Doing so did not qualitatively change our results (Supplementary Figs. 2 and 3).

Analysis data

Beyond identifying organic fields, the goal of this project was to understand the impact of organic versus conventional production on different metrics of pesticide use. As such, we calculate various pesticide metrics, and farm, field, and crop characteristics.

Kern County fields shapefiles

Annual Kern County fields shapefiles are publicly available and include data on growing permit ID (“farm”), site ID, field size, crop type, date active and inactive, among other information. A crop field is defined as a permit-site-year combination. Permit numbers track individual growers through time, while site IDs are a record of each permitted site maintained by a grower each year. Site IDs occasionally persist from one year to the next (particularly for perennial crops), and they are not duplicated within the same year such that rotated crops are given unique site IDs. In some cases, multiple crops are grown on the same field simultaneously. In such cases, each crop has a unique site ID. However, when multiple crops are being grown on the same field simultaneously, the total area occupied by each crop is not specified, but rather is recorded as the total field size. To reduce bias in pesticide use rates, the field size was divided by the number of crops being grown on a field, under the assumption that all crops on multi-crop fields occupy an equal area. From crop type, we determined the crop’s taxonomic family to be used in later analyses.

Pesticide use reports

Field-by-day-by-product level pesticide use data are available to the public on the Kern County Department of Agriculture and Measurement Standards website. Pounds of pesticide products were converted to kg of active ingredients (AI) and identified by the type of product (e.g., insecticides) and “potential hazard” to non-target organisms (e.g., bees) using the California Department of Pesticide Regulation (DPR) Product Database (see Ancillary Data Processing Methods below). We define insecticides as insecticides, insect growth regulators, miticides, and repellents, excluding products that have dual action (insecticide and fungicide), but not excluding insecticide products with adjuvant or fertilizer additives. The Product Database includes binary indicators for whether a given product is a potential hazard to different environmental outcomes. Since pesticide adjuvants (“inert” ingredients) are sometimes used in the absence of active ingredients and can nevertheless have environmental or wildlife impacts per the label, we used kg of the product rather than a kg of active ingredients as measures of use for products of potential hazard to environment and non-target organisms. Using active ingredients instead did not qualitatively change our results (Supplementary Fig. 4). Potential hazards to non-target organisms are based on the Environmental Hazards statement on the pesticide label, which is regulated by the EPA, and based on toxicity to birds, fish, invertebrates, bees, and mammals40. We chose to use binary metrics rather than an aggregate index because thousands of products are used in Kern County, and toxicity metrics for the many different chemicals and numerous environmental endpoints of interest are not readily available. Thus, we proceed with measures of chemical product use (kg ha−1) of potential hazard to a series of ecological and environmental endpoints (fish, aquatic species, drift, etc.). We also include products of high acute toxicity (EPA signal words 1 and 2) and lower acute toxicity (EPA signal words 3 and 4 or not required) to people. For completeness, we also aggregated ecotoxicity data for 29 products that represent about 50% of each organic and conventional pesticide use (Supplementary Table 10) from the Pesticide Products Database61, primarily.

Using the fields shapefiles, pesticide use rates were summarized for each permit-site-year grouping for various pesticide outcomes, including kg ha−1 of a pesticide product, active ingredients, and products of potential hazard to fish, bees, and/or aquatic species, products prone to drift, those categorized as insecticides, and high and low acute toxicity products, as described above. Not all environmental outcomes were equally likely, and in particular several pesticide use outcomes of potential concern lacked a sufficient number of observations on organic fields for robust comparison to their conventional counterparts (Supplementary Table 11). We statistically analyze the subset of environmental outcomes that were reasonably common in both organic and conventional fields, and thus more likely to produce reliable estimates.

Soil spatial data

To address the potential that organic and conventional fields have systematically different soil quality, we used the California Revised Storie Index. The Storie Index land classification system is widely used across California to assess soil quality and agricultural productivity39 and is provided in Soil Survey Geographic Database (SSURGO) tabular data for most of the state. Ratings are systematically determined by a model in the Natural Resources Conservation Service (NRCS) National Soil Information System (NASIS) software, based on tabular data in the SSURGO database. Soil characteristics used in the model include soil profile, surface texture (e.g., loamy to clay-rich, excluding organic horizons), topographical features, growing season length, and dynamic properties (e.g., drainage, alkalinity, acidity, erosion)39. Fertility and other readily modified characteristics are excluded. The system uses six rating levels—” Grade One” being the highest quality soil suitable for most crops and “Grade Six” being unproductive land. We include the Storie Index as a measure of soil quality in all analyses. Further, we evaluate whether organic and conventional fields differ systematically in soil quality, using both the overall Storie Index as well as widely measured components of dynamic properties, which theoretically could be influenced by on-farm management (Supplementary Table 1).

Storie Index values were converted from a shapefile to a 60-m raster using R’s fasterize package. The soil raster was used to extract Storie Index values for each Kern field polygon, using an area-weighted mean function. 319 fields had no Storie Index ratings. To assess the importance of these missing fields, we interpolated the Storie Index values using an inverse distance weighting function in R’s gstat package62,63. The accuracy of interpolated values was checked by applying a Leave-One-Out Cross-Validation function to 500 randomly selected points. Including these fields did not qualitatively change our results.

Statistical analysis

Our statistical analysis proceeded in two steps. First, we evaluated whether conventional versus organic fields differed in pesticide use, modeled as a continuous variable, using pooled ordinary least squares and panel data models to determine the influence of different model specification decisions (see Ancillary Statistical Methods below, Supplementary Notes, Supplementary Tables 2 and 3). However, pesticide use can be conceived as a two-part decision. First, there is the decision to use pesticides at all, and second is the decision of how much to spray when using pesticides. Tobit models are traditionally used to estimate models with censoring. However, Tobit models force the mechanisms determining whether to spray (i.e., moving from pesticide = 0 to pesticides > 0) to be the same as the mechanisms determining the amount sprayed when some pesticides are used (pesticides when pesticides > 0). Double-hurdle models64 are an alternative to the Tobit model that allows for the separation of these two decisions.

The mechanisms underlying the two decisions (to spray, how much to spray if spraying) can differ such that different covariates can describe each process, and the same covariates are allowed to influence the two processes in different ways (i.e., sign and magnitude can differ). The first, binary decision is usually modeled with a probit model.

$${{{{{rm{P}}}}}}left(y=0|{{{{{bf{x}}}}}}right)=1-Phi left({{{{{bf{x}}}}}}gammaright)$$


Then, the second decision is modeled as a linear model with pesticide use following a lognormal distribution, conditional on positive pesticide use64

$$log (y)|{{{{{bf{x}}}}}},y , > , 0 sim {{{{{rm{Normal}}}}}}({{{{{bf{x}}}}}}{{{{{mathbf{upbeta }}}}}},{sigma }^{2})$$


where Φ is the standard normal cdf, x is a vector of explanatory variables including organic status, y is pesticide use, and ({{{{{mathbf{upbeta }}}}}}) is a vector of coefficients. We use a lognormal hurdle model rather than a truncated normal hurdle model since pesticide use is highly non-normal, and Q-Q plots suggested substantial model improvement using a lognormal rather than normal distribution. In contrast to the panel data models described in the Ancillary Statistical Methods below, our estimation equation used natural log-transformed variables for pesticides (and field and farm size) rather than inverse hyperbolic sine (IHS) transformation since only positive observations are included in the second hurdle model. Following insights derived from our panel data models (Supplementary Notes), we build on the basic hurdle model concept using the farm-by-crop family interaction as a random intercept in both the first and second hurdle. We chose the farm-by-crop family interaction rather than a crossed random effect due to computational feasibility with thousands of permits and hundreds of crops, due to similarity of results to the within estimator model (i.e., fixed effects in causal inference terminology; Supplementary Notes, Supplementary Table 2), and due to AIC/BIC (Supplementary Table 3). Further, we find evidence of heteroskedasticity from both visual inspection and Levine’s test, which adds additional complications to computing crossed random effects. Thus, we proceed with the farm-by-crop family interaction in a random intercept model with cluster robust standard errors clustered at the same grouping. In doing so, observations, where the taxonomic family of the crop was unclear, were dropped. Of the 7367 fields that were dropped due to missing crop families, 6684 were uncultivated agriculture.

Our data are effectively repeated cross-sections rather than a true panel since fields are defined by the farm-site-year combination and thus generally change year-to-year or when crops rotate. We model it as such. This implies we do not require observations to have no spray in all time periods, as would be the case in a double hurdle panel model. Linking field IDs over time through spatial processing is complicated by crop rotations of different size areas. Since farmers may farm multiple fields under different management systems, as we illustrate here, and have different contractual obligations at a sub-farm level, requiring farms to never use pesticides on all fields is not reflective of on-the-ground decisions.

We repeated the lognormal hurdle models individually for carrots, grapes, oranges, potatoes, and onions, which were the only widely-grown crops with more than 100 organic fields. This allowed for a different slope and intercept by crop type.

We conduct several robustness checks. First, we do not have data on crop yields. However, to assess the potential implications of a yield gap on our results, we modify our per hectare rates following Ponisio et al.15 as a robustness check. We group commodities into cereals, roots and tubers, oilseeds, legumes/pulses, fruits, and vegetables and assign them the Ponisio et al.15 yield gap estimates for that group. Crops that did not fall into any of the above groups (e.g., cannabis) were provided the all-crop average from Ponisio et al.15. Second, we analyze how conventional and organic differ with respect to soil quality using a within estimator approach to account for crop-specific differences in soil quality. Third, binary toxicity metrics, while valuable given the number of chemicals and endpoints of interest here, nevertheless fail to distinguish gradations of toxicity for chemicals above (or below) the regulatory threshold. As mentioned above, the data needed to calculate many aggregate indices (e.g., Pesticide Load57 and Environmental Impact Quotient58) are not readily available for all of the chemicals in our study. For completeness, we attempted to calculate the Pesticide Toxicity Index for one well-studied endpoint, fish. We supplemented data provided in Nowell et al.41 with data from Standartox42. However, only about 70% of the chemicals used in our study matched, and pesticide products used on organic fields were more likely to lack toxicity information for one or more chemicals. We briefly discuss the highly preliminary investigation, given the non-random missing toxicity data.

All spatial analyses were performed in R Statistical Software v 3.5.3 and all statistical analyses were performed Stata 16 MP. For all tests, statistical significance was based on two-tailed tests with (alpha =0.05.)

Ancillary data processing methods

Cleaning parcel data

To spatially locate organic fields, we needed to match the Assessor’s parcel numbers (APNs) provided in the CDFA tabular data to APNs in the Kern County Parcel shapefile (from 2017). Over 90% of the APN entries in the CDFA data were in the format [xxx-xxx-xx], though multiple APNs were often provided in the same cell separated by line breaks, semi-colons, commas, and/or spaces. We made initial edits separating values into individual cells in Microsoft Excel since formatting was highly inconsistent. Observations whose APNs were not in the [xxx-xxx-xx] were modified so that their format matched. In the R environment, dashes were inserted after the third, sixth, and eighth characters (1234567895 became 123-456-78-95) for APNs that did not already contain them. Occasionally, APN numbers were provided with dashes, but with segments of incorrect length (e.g., 12-34-567). In these instances, APN segments were either trimmed from the right or padded with a zero on the left so they matched the [xxx-xxx-xx] format. This approach yielded the greatest number of matches and was checked for accuracy as described below. Additional segments (from APNs with more than two dashes and eight numeric characters) were dropped. A handful of APNs with fewer than eight numeric characters and no dashes were dropped entirely.

The edited CDFA APNs were then joined with the Kern County Assessor’s parcel shapefile, creating the “CDFA organic shapefile”. In total, 1637 of 1829 individual CDFA records joined successfully. To evaluate the accuracy of joins between CDFA tabular data, Kern County parcel, and Kern County agricultural spatial data, we spot-checked ownership information using “Company” (CDFA) and “PERMITTEE” (Kern County agricultural data) values.

To then identify the crop fields within the organic parcels, we performed a spatial join between the CDFA organic shapefile and the Kern County fields shapefiles. Prior to performing the join, the CDFA parcels’ dimensions were reduced with a 50-m buffer to eliminate spatial joins between CDFA parcels and crop fields that were only touching the parcel margins. Of five different buffer widths evaluated, 50 m reduced the number of false positives and negatives, as determined by comparing the “Company” and “PERMITTEE” values. We refer to the fields that match as “APN Organic”.

Cleaning PLSS Township-Range-Section values

Each year several producers reported Township, Section, and Range (TRS) values, consistent with the PLS System (PLSS), rather than APN values. We used these TRS values to identify PLSS Sections that contained organic fields.

We separated any cell containing multiple TRS values and removed any prefixes such as “S”, “Section”, “Sec.”, “T”, and “R” that would prevent joining to Kern County PLSS spatial data in Excel. In the R environment, we padded the left side of the “S” value with a 0 if it was a single digit, then concatenated the three columns into a “TRS” column. We joined TRS from the CDFA tabular data to PLSS spatial data, which identified 563 Sections as containing organic fields, from 2013 to 2019, out of a total of 664 unique TRS codes in the CDFA dataset. We then performed a spatial join between PLSS Sections that contain organic fields and Kern County fields shapefiles, to identify all agriculture fields that overlap with those Sections. Additional processing using the Pesticide Use Reports is described above.

Ancillary statistical methods

We began with a pooled ordinary least squares (OLS) model that, as the name suggests, pools observations over farms, years, and crop types. However, there may be attributes of crops or farms that may be systematically different between organic and conventional, and this systematic difference could bias our pooled OLS results. To address this, we first considered propensity score approaches but were unable to find a sufficient balance of our covariate distribution between organic and conventional fields. As an alternative, we limited our sample to fields with overlapping farmers and crop types. In other words, we focused on the subset of fields that are grown by farmers producing both organic and conventional fields and to crops that are produced both conventionally and organically. However, this shrunk our dataset by two-thirds.

To leverage more of our data, we next considered panel data models as a means to address unobserved variables. We consider both within-estimator models (also known as “fixed effects” in causal inference terminology, but different from the biostatistical use of the term) and random effects models (with random intercepts), seeking to capture characteristics of the crop, grower, and year. The advantage of a within-estimator approach is that the omitted variables are removed (through differencing) and thus, they can be correlated with covariates without biasing the estimation. In other words, pesticide use and all covariates are differenced from their crop-specific mean (or crop family, farmer, etc. specific mean, depending on the model). In doing so, the propensity for certain crops (crop family, farmer) to be grown organic or to be fast or slow adopters of new technologies is removed. The disadvantage is that characteristics shared by all fields of a crop (e.g., value) are lost in the differencing, and more importantly, that the differencing is not easily translated to nonlinear models that we employ later in the analysis. Random effects are more easily translated to nonlinear models. The disadvantage of random effects is the strong assumption that the unobserved variables are uncorrelated with the covariates18,65, which is required for random effects coefficient estimates to be unbiased. Here, we see the difference in coefficient estimates between the within-estimator and random effects models are quite small (Supplementary Table 2).

Random effects particularly crossed random effects with thousands of permits and hundreds of crops, introduce computational challenges due to large, sparse matrices. Further, we find evidence of heteroskedasticity from both visual inspection and Levine’s test, which adds additional complications to computing crossed random effects. We proceed using the farm-by-crop family interaction in a random intercept model with cluster robust standard errors clustered at the same grouping based on AIC/BIC (Supplementary Table 3), computational feasibility, and similarity to the within-estimator results (Supplementary Table 2). Observations, where the taxonomic family of the crop was unclear, were dropped in any models including family in either the random effects or the cluster robust standard errors. Of the 7367 fields that were dropped due to missing crop families, 6684 were uncultivated agriculture.

In the panel data models, we used IHS transformations to accommodate highly non-normal pesticide (and field and farm size) data. IHS is very similar to natural log transformation66 but is defined at zero, which is important given a sizable fraction of our observations have zero pesticide use. As with log–log transformations, IHS–IHS transformation can be interpreted as elasticities. We pre-multiply pesticide use by 100 to improve estimation66, though this does not affect interpretation. As described above, we leverage insights on model specification from the panel data models, but rely on the double hurdle models to parse apart the decision to spray from the decision of how much to spray.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.