QA-weighted Statistics
Goal
Compute QA-weighted means and standard deviations and propagate them during the aggregation process. Ideally the end result should look something like the sample ncdump below:
group: Aerosol_Variable {
variables:
double Mean(longitude, latitude) ;
double Standard_Deviation(longitude, latitude) ;
double Sum(longitude, latitude) ;
int Pixel_Counts(longitude, latitude) ;
double Sum_Squares(longitude, latitude) ;
double QA_Mean(longitude, latitude) ; # new variable
double QA_Standard_Deviation(longitude, latitude) ; # new variable
double Weighted_Sum(longitude, latitude) ; # new variable, required for aggregation
int Sum_of_Weights(longitude, latitude) ; # new variable, required for aggregation
double Weighted_Sum_Squares(longitude, latitude) ; # new variable, required for aggregation
...
} // group Aerosol_Variable
Changes to Configuration File
- The configuration file will have an additional option to turn on the computation of the QA-weighted quantities (default off)
variable_settings:
- name_in: aerosol_variable
name_out: aerosol_var_out_with_qa
compute_QA: QA_weights
attributes:
...
If compute_QA
is not defined the code runs as usual, if it is defined the name of the variable that contains the weights need to be specified (e.g. QA_weights
). Names are not final.
Changes to the Code:
In yori-grid
the QA-weighted quantities are computed like the normal statistics. Weighted sum, weighted sum of squares and sum of weights have to be saved for the aggregation. An additional function to compute these quantities will be added in gridtools.py
and called when requested by the configuration file.
In yori-aggr
the QA-weighted quantities are aggregated if the weighted sum, weighted sum of squares and sum of weights are found. A function to propagate the QA-weighted quantities from the gridded files will be added to yori_aggregate.py
. A conditional instruction will be added as well, to run the new function when the "QA quantities" are found in the gridded files. Depending on how the code will end up looking at the end of this process I might restructure the yori_aggregate
module to be more readable (e.g. move utility functions in a separate module and leave only the aggregation).
On the Computation of the QA-weighted quantities
Recap of the Math Behind Yori
Let's define the sum of a given variable, that is, the sum of its values v_j
within a grid cell c
, as:
S_c = \sum_j v_j
and similarly the sum of squares:
\textit{SS}_c = \sum_j v_j ^2
where j
is the j
-th pixel and v_j
is the quantity under consideration for that pixel.
With these quantities we can compute the mean M
as:
M_c = \frac{\sum_i S_{c,i}}{\sum_i N_{c,i}}
with c
and i
the c
-th cell and the i
-th file respectively and N_{c,i}
is the number of points. With the same indexing logic, the standard deviation is derived from:
\textit{STD}_c = \left[\frac{\sum_i \textit{SS}_{c,i}}{\sum_i N_{c,i}} - \left(\frac{\sum_i S_{c,i}}{\sum_i N_{c,i}}\right)^2\right]^{1/2}
Computation of the QA-weighted Quantities
For the QA-weighted quantities we can use the same approach done before, with some minor changes to account for the weights. First let's define the general formulas for the weighted mean and the standard deviation:
Weighted Mean:
M_w = \frac{\sum_j w_j v_j}{\sum_j w_j}
Weighted Standard Deviation:
\textit{STD}_w = \left( \frac{\sum_j w_j (v_j - \bar{v})^2}{\sum_i w_j} \right)^{1/2}
where w_j
is the weight of the j
-th pixel and \bar{v}
is mean value of all v_j
.
Now, in order to propagate the weighted mean and standard deviation we need three additional quantities in the same way we did for the unweighted mean and standard deviation. Let's define these three additional quantities as follows:
- sum of weights:
W = \sum_j w_j
- weighted sum:
S_w = \sum_j w_j v_j
- weighted sum of squares:
\textit{SS}_w = \sum_j w_j (v_j - \bar{v})^2
Now we can use W
, S_w
and \textit{SS}_w
to compute the weighted mean M_w
and standard deviation \textit{STD}_w
:
M_w = \frac{S_w}{W}
\textit{STD}_w = \left[\frac{\sum_i \textit{SS}_{w,i}}{\sum_i W_i} - \left(\frac{\sum_i S_{w,i}}{\sum_i W_i}\right)^2\right]^{1/2}
Note that in the previous two equations the subscript c
has been dropped for readability but these are still per-cell quantities.
Test
Here a code snippet is provided to verify the above formulas
import numpy as np
def main():
# create the fake data
scale = np.random.randint(100)
data1 = np.random.random(500)*scale
data2 = np.random.random(300)*scale
# create the weight arrays
w1 = np.random.randint(4, size=len(data1))
w2 = np.random.randint(4, size=len(data2))
# merge the two arrays
alldata = np.append(data1, data2)
allW = np.append(w1, w2)
# compute sum of weights, weighted sum and weighted sum of squares
W1 = np.sum(w1)
W2 = np.sum(w2)
ws1 = np.sum(w1*data1)
ws2 = np.sum(w2*data2)
wss1 = np.sum(w1*data1**2)
wss2 = np.sum(w2*data2**2)
# reference values
reference_mean = np.sum(allW*alldata)/np.sum(allW)
reference_std = np.sqrt(np.sum(allW*(alldata - reference_mean)**2)/np.sum(allW))
# Weighted mean and standard deviation with the formulas defined above
qa_mean = (ws1 + ws2)/(W1 + W2)
qa_std = np.sqrt((wss1+wss2)/(W1+W2) - ((ws1+ws2)/(W1+w2))**2)
# compare results
print('Means are the same: {}'.format(np.isclose(reference_mean, qa_mean)))
print('Stds are the same: {}'.format(np.isclose(reference_std, qa_std)))
if __name__ == '__main__':
main()