1 Introduction

The MRG package has been created to generate gridded output of data from censuses and surveys, respecting different requirements, such as confidentiality rules for protecting the privacy of individuals. The package has been particularly developed for handling Farm Structure Survey data, but can be used for many other purposes.

The package provides two main methods for gridding of data:

In addition, there is support functionality for the main methods above.

The package is highly flexible, in the sense that it includes three different methods for confidentiality and reliability, where parameters can be varied. Additionally, it is possible to add a user defined function for applications where other restrictions are necessary.

Together with the gridding approach, there is also a possibility to apply a
contextual suppression of grid cells. This means that grid cells will not be aggregated with neighbours if their value is small compared with the non-confidential neighbours, they will instead be suppressed in a post-processing step at the end.

2 The Farm Structure Survey data

The Farm Structure Survey data comes from a European wide survey of agricultural holdings, conducted by the national statistical bureaus. A range of variables have been collected from each of the holdings in the survey, including agricultural area, organic agricultural area, number of employees, gender and age of owners, number of animals and their type, etc. A census (attempt to sample data from all holdings) usually takes place every 10 years (2000, 2010, 2020, …), whereas surveys (samples from about 20% of the holdings) take place every 3-4 years between the census years. Some holdings, particularly in survey years, are representative for a larger group, and are attributed a weight to be applied if data are aggregated in any way.

A large number of the agricultural variables would be of interest to stakeholders and researchers. However, there are confidentiality rules in place, which limits the possibilities for sharing the data, they can only be distributed as aggregated variables. To distribute the aggregated values (mean or sum) for a grid cell or administrative unit, the data must respect the following two confidentiality rules: - Frequency rule: Data for a grid cell can not be distributed if it is based on less than 10 (weighted) holdings - Dominance rule: If the two largest holdings (or the largest holding if it has a weight of two) represent more than 85% of the value in a grid cell, the data of the grid cell can not be distributed.

Additionally, it also includes a reliability rule for sample data with weights. This is to assure that the expected uncertainty of the result is below a threshold. The default is that the expected coefficient of variation (CV) should be below 0.35.

The original data set is restricted. However, the package includes a “fake” data set, which has the same structure as a real data set, but manipulated to be the same as any real farms. This makes it possible to test the methodology also for those who don’t have access to the original data.

3 Methods

3.1 multiResGrid

This function attempts to keep the resolution as high as possible, while respecting the confidentiality rules. Starting with a high resolution grid (for example 1*1 km), it will first check groups of four grid cells which are within the lower resolution grid cells. If the confidentiality rule fails for any of the high resolution cells, the function will only keep the aggregated value from the lower resolution grid.

The figure below gives an indication of how the function works. In this simplified example, the numbers reflect the number of holdings in each grid cell, and we are only considering the frequency rule, i.e., that the number of holdings in a grid cell has to be at least 10. In the first grid in the figure below, non of the grid cells have more than 10 holdings, they are all yellow (except for the white ones without holdings at all). In the second grid, the four grid cells in the upperright corner have reached the limit and are green, in addition to two grid cells in the central left side.
However, there are still grid cells with less than 10 holdings in the lower left corner, so also these are merged in the transition to the grid at the right in the figure. The four grid cells in the top right corner are kept though.

Idea of multiResGrid
Idea of multiResGrid

However, we could imagine cases where we would rather like to suppress the grid cell with 1 holding in the central image, to be able to keep the higher resolution grid cell with 11 holdings. The gridding function therefore includes a parameter (suppresslim) that sets a minimum limit for the values inside a confidential grid cell to be merged with neigbouring grid cells. This limit could for example be 0.1, in that case the grid cell with one holding should be suppressed as in the figure below.

multiResGrid with suppression
multiResGrid with suppression

The package includes a synthetic test data set, which includes data of a similar type and distribution as the true FSS data. This is used in the examples of the methodology below.

4 Examples

4.1 Reading data and producing basic grids

library(sf)
library(dplyr)
library(viridis)
library(ggplot2)
library(giscoR)
library(ggforce)
library(patchwork)
library(kableExtra)
library(MRG)

data(ifs_dk) # Census data (weights in EXT_CORE, all equal to 1)
ifs_weight = ifs_dk[ifs_dk$Sample == 1, ] # The sample data, weights in EXT_MODULE, varying
# The sum of the weights (EXT_MODULE) are equal to the population size
sum(ifs_weight$EXT_MODULE) - dim(ifs_dk)[1] 
#> [1] 1.999433e-08

# Create spatial data
# Move coordinates away from grid cell boundaries with locAdj = "LL",
# as holdings are registered with coordinates in the lower left corner
# of the INSPIRE 1*1 km grid.
# Coordinates exactly on the border can cause issues in some functions
# This function is particular for the FSS data set, with a particular format
# for the geo-location. For other data sets, the coordinate shift can be done
# on an sf-object either with the function locAdjFun, or in a call to 
# createMRGobject.
ifg = fssgeo(ifs_dk, locAdj = "LL")
ffg = fssgeo(ifs_weight, locAdj = "LL")

# Read country borders for Denmark, only used for plotting
borders = gisco_get_nuts(nuts_level = 0)
dkb = borders[borders$CNTR_CODE == "DK",] %>% st_transform(crs = 3035)
# Necessary to avoid some warnings for intersection further down
st_agr(dkb) = "constant"

# Set the base resolutions, and create a hierarchical list with gridded data
# No variable name is needed if we're only looking at the number of holdings
# but we include the utilized agricultural area, as it will also be used further below.
ress = c(1,5,10,20,40, 80, 160)*1000

ifl = list()
ifl = gridData(ifg, vars = "UAA",  res = ress)

# List of grids for the organic utilized agricultural area
ifl2 = gridData(ifg, vars = "UAAXK0000_ORG", res = ress)

# List of grids with two variable (weights could be ignored, as they are all equal to one)
ifl3 = gridData(ifg, vars = c("UAA", "UAAXK0000_ORG"), weights = "EXT_CORE", res = ress)

# List of grids for for the sample data (weights necessary)
ffl = gridData(ffg,  vars = c("UAA"), weights = "EXT_MODULE",  res = ress)

Just to understand the distribution of grid cells which should be confidential or not, We can create a table, giving an overview of grid cells with the number of holdings greater or equal to 10, and grid cells with less than 10 holdings. Below the different grids are plotted together.

ifltab = data.frame("N GE 10" = unlist(lapply(ifl, FUN = function(x) sum(x$count >= 10))),
                    "N LT 10" = unlist(lapply(ifl, FUN = function(x) sum(x$count < 10))))
ifltab
#>   N.GE.10 N.LT.10
#> 1       0   23776
#> 2    1794     385
#> 3     582      52
#> 4     178      11
#> 5      57       1
#> 6      22       0
#> 7       8       0


# Create a single data.frame for the hierarchical grids for plotting
ifall = do.call("rbind", ifl[1:6])
ifall$res1 = paste(ifall$res/1000, "km")
ifall$res2 = factor(ifall$res1, levels = unique(ifall$res1))

# Plot the hierarchical grids of Denmark
ggplot() + geom_sf(data = ifall, aes(fill = count), lwd = 0) +
  scale_fill_viridis( name = "number of \n holdings", trans = "log10") +
  scale_color_viridis( name = "number of \n holdings", trans = "log10") +
  geom_sf(data = dkb, fill = NA, colour='black', lwd = 1) +
  #  coord_sf(crs = 3035) +
  theme_bw() +
  theme(panel.grid.major = element_line(colour = 'transparent')) +
  ggtitle("Number of holdings for different resolutions") +
  facet_wrap(vars(res2))
plot of chunk unnamed-chunk-3
plot of chunk unnamed-chunk-3

4.2 Only number of holdings

First we create a multi-resolution grid function only with farm number as confidentiality rule (The dominance rule does not apply if we only analyse the number of holdings)

himg0 = multiResGrid(ifl, checkValidity = FALSE)
#> [1] "ires 2 5000 #himg-cells: 2179 ; removed: 23704 ; added: 2107 ; confidential: 72"
#> [1] "ires 3 10000 #himg-cells: 1642 ; removed: 768 ; added: 231 ; confidential: 41"
#> [1] "ires 4 20000 #himg-cells: 1539 ; removed: 137 ; added: 34 ; confidential: 9"
#> [1] "ires 5 40000 #himg-cells: 1451 ; removed: 96 ; added: 8 ; confidential: 1"
#> [1] "ires 6 80000 #himg-cells: 1366 ; removed: 86 ; added: 1 ; confidential: 0"
#> [1] "ires 7 160000 #himg-cells: 1366 ; removed: 0 ; added: 0 ; confidential: 0"

# Some points that will help visualizing some differences between plots
pta = data.frame(matrix(ncol = 2, byrow = TRUE, 
                        data = c(4245000, 3685000,
                                 4345000, 3815000)))
pta$rad = c(20000)
pta$color = c("red")

xlim = c(4200000, 4400000)
ylim = c(3650000, 3840000)

# Clip the result to the coastline of Denmark and create ggplot
himg00 = st_intersection(dkb, himg0)
#> Warning: attribute variables are assumed to be spatially constant throughout all geometries

g0 = ggplot() + geom_sf(data = himg00, aes(fill = count), lwd = 0) +
  scale_fill_viridis( name = "number of farms", trans = "log10") +
  geom_sf(data = dkb, fill = NA, colour='black', lwd = 1) +
  scale_color_identity(guide = "none") +
  geom_circle(data = as.data.frame(pta), aes(x0 = X1, y0 = X2, r = rad, color = color)) +
  xlab("") + ylab("") +
  ggtitle("Only frequency rule") +
  coord_sf(xlim = xlim, ylim = ylim) +
  theme(text = element_text(size = 10)) +
  theme_bw()


# Create multi-resolution grid for utilized agricultural area (UAA), 
# clip it to the coastline and create ggplot for the number of farms
himg1 = multiResGrid(ifl, vars = "UAA", ifg = ifg)
#> [1] "ires 2 5000 #himg-cells: 2179 ; removed: 23704 ; added: 2107 ; confidential: 72"
#> [1] "ires 3 10000 #himg-cells: 1601 ; removed: 824 ; added: 246 ; confidential: 42"
#> [1] "ires 4 20000 #himg-cells: 1493 ; removed: 145 ; added: 37 ; confidential: 9"
#> [1] "ires 5 40000 #himg-cells: 1405 ; removed: 96 ; added: 8 ; confidential: 1"
#> [1] "ires 6 80000 #himg-cells: 1333 ; removed: 73 ; added: 1 ; confidential: 0"
#> [1] "ires 7 160000 #himg-cells: 1333 ; removed: 0 ; added: 0 ; confidential: 0"
himg01 = st_intersection(dkb, himg1)
#> Warning: attribute variables are assumed to be spatially constant throughout all geometries
g1 = ggplot() + geom_sf(data = himg01, aes(fill = count), lwd = 0) +
  scale_fill_viridis( name = "number of farms", trans = "log10") +
#  scale_color_viridis( name = "number of farms", trans = "log10") +
  geom_sf(data = dkb, fill = NA, colour='black', lwd = 1) +
  scale_color_identity(guide = "none") +
  geom_circle(data = as.data.frame(pta), aes(x0 = X1, y0 = X2, r = rad, color = color)) +
  coord_sf(xlim = xlim, ylim = ylim) +
  xlab("") + ylab("") +
  ggtitle("Frequency and dominance rule") +
  theme(text = element_text(size = 15)) +
  theme_bw()

print(g0+g1 + plot_layout(guides = "collect"))
plot of chunk unnamed-chunk-4
plot of chunk unnamed-chunk-4

In a second step, we can run the same procedure, but for the utilized agricultural area (UAA). Then the function will also check that the dominance rule is respected for all pixels. Two plots are necessary to show the results of this procedure. The first one shows the number of farms, as above, but also respecting the dominance rule. The second plot shows the gridded UAA from the same holdings. It should be noted that there is only a small effect from applying the dominance rule in this example.



pta = data.frame(matrix(ncol = 2, byrow = TRUE, 
                        data = c(4245000, 3685000,
                                 4345000, 3815000)))
pta$rad = c(20000)
pta$color = c("red")

xlim = c(4200000, 4400000)
ylim = c(3650000, 3840000)
# Clip the result to the coastline of Denmark and create ggplot
himg00 = st_intersection(dkb, himg0)
#> Warning: attribute variables are assumed to be spatially constant throughout all geometries
g0 = ggplot() + geom_sf(data = himg00, aes(fill = count), lwd = 0) +
  scale_fill_viridis( name = "number of farms", trans = "log10") +
#  scale_color_viridis( name = "number of farms", trans = "log10") +
  geom_sf(data = dkb, fill = NA, colour='black', lwd = 1) +
  #  coord_sf(crs = 3035) +
  scale_color_identity(guide = "none") +
  geom_circle(data = as.data.frame(pta), aes(x0 = X1, y0 = X2, r = rad, color = color)) +
  xlab("") + ylab("") +
  ggtitle("Only frequency rule") +
  coord_sf(xlim = xlim, ylim = ylim) +
  theme(text = element_text(size = 10)) +
  theme_bw()



# Create multi-resolution grid for utilized agricultural area (UAA), 
# clip it to the coastline and create ggplot for the number of farms
himg1 = multiResGrid(ifl, vars = "UAA", ifg = ifg)
#> [1] "ires 2 5000 #himg-cells: 2179 ; removed: 23704 ; added: 2107 ; confidential: 72"
#> [1] "ires 3 10000 #himg-cells: 1601 ; removed: 824 ; added: 246 ; confidential: 42"
#> [1] "ires 4 20000 #himg-cells: 1493 ; removed: 145 ; added: 37 ; confidential: 9"
#> [1] "ires 5 40000 #himg-cells: 1405 ; removed: 96 ; added: 8 ; confidential: 1"
#> [1] "ires 6 80000 #himg-cells: 1333 ; removed: 73 ; added: 1 ; confidential: 0"
#> [1] "ires 7 160000 #himg-cells: 1333 ; removed: 0 ; added: 0 ; confidential: 0"
himg01 = st_intersection(dkb, himg1)
#> Warning: attribute variables are assumed to be spatially constant throughout all geometries
g1 = ggplot() + geom_sf(data = himg01, aes(fill = count), lwd = 0) +
  scale_fill_viridis( name = "number of farms", trans = "log10") +
#  scale_color_viridis( name = "number of farms", trans = "log10") +
  geom_sf(data = dkb, fill = NA, colour='black', lwd = 1) +
  scale_color_identity(guide = "none") +
  geom_circle(data = as.data.frame(pta), aes(x0 = X1, y0 = X2, r = rad, color = color)) +
  coord_sf(xlim = xlim, ylim = ylim) +
  xlab("") + ylab("") +
  ggtitle("Frequency and dominance rule") +
  theme(text = element_text(size = 15)) +
  theme_bw()

print(g0+g1 + plot_layout(guides = "collect"))
plot of chunk unnamed-chunk-5
plot of chunk unnamed-chunk-5

Using the results from above, we can also plot the UAA for each grid cell. First, one can notice that the UAA is somewhat dependent on the grid cell size, which is typical for variables that are summed. An alternative would be to present the UAA as UAA/km\(^2\) for each grid cell. Second, one can observe that the final grid cells are the same size as above because this is simply another variable produced from the same underlying input data.

g11 = ggplot() + geom_sf(data = himg01, aes(fill = UAA, color = UAA)) +
  scale_fill_viridis( name = "UAA (ha)", trans = "log10") +
  scale_color_viridis( name = "UAA (ha)", trans = "log10") +
  geom_sf(data = dkb, fill = NA, colour='black', lwd = 1) +
  xlab("") + ylab("") +
  ggtitle("") +
  theme_bw()

g11 
plot of chunk unnamed-chunk-6
plot of chunk unnamed-chunk-6

We can run the same procedure for the organic UAA. Here the grid cells are generally much larger. This is because there are considerably fewer holdings with organic farming in this synthetic data set. Only one grid cell can be disseminated at 5 km, whereas the majority are 10 km (91) or 20 km (67). Then there are 18, 2 and 1 grid cells of 40 km, 80 km and 160 km, respectively. In total there are 180 grid cells in this map with organic farms.

himg2 = multiResGrid(ifl2, vars = "UAAXK0000_ORG", ifg = ifg)
#> [1] "ires 2 5000 #himg-cells: 5513 ; removed: 19922 ; added: 1659 ; confidential: 7"
#> [1] "ires 3 10000 #himg-cells: 894 ; removed: 5180 ; added: 561 ; confidential: 23"
#> [1] "ires 4 20000 #himg-cells: 229 ; removed: 832 ; added: 167 ; confidential: 13"
#> [1] "ires 5 40000 #himg-cells: 148 ; removed: 110 ; added: 29 ; confidential: 4"
#> [1] "ires 6 80000 #himg-cells: 125 ; removed: 29 ; added: 6 ; confidential: 3"
#> [1] "ires 7 160000 #himg-cells: 114 ; removed: 13 ; added: 2 ; confidential: 2"
himg02 = st_intersection(dkb, himg2)
#> Warning: attribute variables are assumed to be spatially constant throughout all geometries
g12 = ggplot() + geom_sf(data = himg02, aes(fill = UAAXK0000_ORG, color = UAAXK0000_ORG)) +
  scale_fill_viridis( name = "UAA Organic (ha)", trans = "log10") +
  scale_color_viridis( name = "UAA Organic (ha)", trans = "log10") +
  geom_sf(data = dkb, fill = NA, colour='black', lwd = 1) +
  ggtitle("") +
  theme_bw()

g12 
#> Warning in scale_fill_gradientn(colours = viridisLite::viridis(256, alpha, : log-10 transformation introduced infinite values.
#> Warning in scale_color_gradientn(colours = viridisLite::viridis(256, alpha, : log-10 transformation introduced infinite values.