|
|
After setting `flo_user` so we can access the `flo` account:
|
|
|
```
|
|
|
flo_user="-d postgresql://flo3@ratchet.sips/flo3"
|
|
|
```
|
|
|
we look at how many jobs succeeded/failed
|
|
|
```
|
|
|
satellite='snpp'; psql $flo_user -c "select count(*) from stored_products where computation='flo.sw.fusion_matlab:FUSION_MATLAB' and context->'satellite'='''$satellite''' and context->'version'='''1.0.0dev3''';"
|
|
|
satellite='snpp'; psql $flo_user -c "select count(*) from failed_jobs where head_computation='flo.sw.fusion_matlab:FUSION_MATLAB' and context->'satellite'='''$satellite''' and context->'version'='''1.0.0dev3''';"
|
|
|
```
|
|
|
```
|
|
|
count
|
|
|
--------
|
|
|
109572
|
|
|
(1 row)
|
|
|
count
|
|
|
-------
|
|
|
20996
|
|
|
(1 row)
|
|
|
```
|
|
|
|
|
|
So `20996/109572 = 19.1%` failure rate. We can use the failed jobs table to group then by `exit_code` which can sometimes be useful:
|
|
|
```
|
|
|
satellite='snpp'; psql $flo_user -c "select exit_code,count(*) from failed_jobs where head_computation='flo.sw.fusion_matlab:FUSION_MATLAB' and context->'satellite'='''$satellite''' and context->'version'='''1.0.0dev3''' group by exit_code;"
|
|
|
```
|
|
|
```
|
|
|
exit_code | count
|
|
|
-----------+-------
|
|
|
-11 | 19
|
|
|
99 | 29
|
|
|
6001 | 7
|
|
|
6002 | 456
|
|
|
6003 | 5
|
|
|
| 20480
|
|
|
(6 rows)
|
|
|
```
|
|
|
We already know that the `99` code is due to a collocation bug. Next lets checkout the -11 exit codes:
|
|
|
```
|
|
|
satellite='snpp'; psql $flo_user -c "select job, pydt(context->'granule') from failed_jobs where head_computation='flo.sw.fusion_matlab:FUSION_MATLAB' and context->'satellite'='''$satellite''' and context->'version'='''1.0.0dev3''' and exit_code=-11;"
|
|
|
```
|
|
|
```
|
|
|
job | pydt
|
|
|
-----------+---------------------
|
|
|
138444177 | 2018-01-27 20:18:00
|
|
|
138444176 | 2018-01-27 20:12:00
|
|
|
138444175 | 2018-01-27 20:06:00
|
|
|
138473534 | 2018-04-25 19:30:00
|
|
|
138473535 | 2018-04-25 19:36:00
|
|
|
138473533 | 2018-04-25 19:24:00
|
|
|
138481767 | 2018-05-25 05:06:00
|
|
|
138481766 | 2018-05-25 05:00:00
|
|
|
138481765 | 2018-05-25 04:54:00
|
|
|
138522996 | 2018-11-19 01:06:00
|
|
|
138522997 | 2018-11-19 01:12:00
|
|
|
138522995 | 2018-11-19 01:00:00
|
|
|
138522994 | 2018-11-19 00:54:00
|
|
|
138532886 | 2018-12-18 18:54:00
|
|
|
138532885 | 2018-12-18 18:48:00
|
|
|
138532884 | 2018-12-18 18:42:00
|
|
|
138544240 | 2019-01-17 09:12:00
|
|
|
138544239 | 2019-01-17 09:06:00
|
|
|
138544238 | 2019-01-17 09:00:00
|
|
|
(19 rows)
|
|
|
```
|
|
|
Taking a look at one of thesefailures (in the flo account):
|
|
|
```
|
|
|
[flo@vultur ~]$ jobout 138522996 | xargs tail -n 2
|
|
|
```
|
|
|
```
|
|
|
CalledProcessError: Command '/mnt/software/support/viirsmend/1.2.12/bin/python /mnt/software/support/viirsmend/1.2.12/bin/viirsl1mend VNP02MOD.A2018323.0100.001.2018323062016.uwssec.bowtie_restored.nc /dev/shm/dir_1514/tmppO3glQ/8/VNP03MOD.A2018323.0100.001.2018323062028.uwssec.nc' returned non-zero exit status -11
|
|
|
INFO 2019-08-22 01:52:27,568 driver -- job exiting with failure
|
|
|
```
|
|
|
Those 19 failures (`exit_code` `-11`) appear to be `viirsmend` failures. That will happen so not going to worry about those.
|
|
|
|
|
|
That leaves us with the big elephant in the room, the null exit_codes. So I can group the failures by day. Interestingly it looks like there are whole days that are failing.
|
|
|
```
|
|
|
satellite='snpp'; psql $flo_user -c "select date_trunc('days',pydt(context->'granule')) as d,count(*) from failed_jobs where head_computation='flo.sw.fusion_matlab:FUSION_MATLAB' and context->'satellite'='''$satellite''' and context->'version'='''1.0.0dev3''' and exit_code is null group by d order by d;"
|
|
|
satellite='snpp'; psql $flo_user -c "select date_trunc('days',pydt(context->'granule')) as d,count(*) from failed_jobs where head_computation='flo.sw.fusion_matlab:FUSION_MATLAB' and context->'satellite'='''$satellite''' and context->'version'='''1.0.0dev3''' and timestamp > '2019-08-22' and exit_code is null group by d order by d;"
|
|
|
```
|
|
|
So I list out a few of the job numbers for the 13th so that we can look at the output.
|
|
|
```
|
|
|
satellite='snpp'; psql $flo_user -c "select job from failed_jobs where head_computation='flo.sw.fusion_matlab:FUSION_MATLAB' and context->'satellite'='''$satellite''' and context->'version'='''1.0.0dev3''' and exit_code is null and context->'granule' like '%2018, 3, 13,%' limit 5;"
|
|
|
```
|
|
|
```
|
|
|
job
|
|
|
-----------
|
|
|
138461881
|
|
|
138461912
|
|
|
138461896
|
|
|
138461883
|
|
|
138461932
|
|
|
(5 rows)
|
|
|
```
|
|
|
```
|
|
|
[flo@vultur ~]$ jobout 138461896 | xargs tail -n 20
|
|
|
```
|
|
|
```
|
|
|
Bad VIIRS radiance data found (0.1f% according to band 0 quality flags) - exiting
|
|
|
Bad VIIRS radiance data found (0.1f% according to band 15 quality flags) - exiting
|
|
|
ERROR 2019-08-21 03:31:21,105 __init__ -- There are no Matlab files "fusion_output.mat" to convert, aborting
|
|
|
INFO 2019-08-21 03:31:21,131 __init__ -- run_fusion_matlab() generated None
|
|
|
ERROR 2019-08-21 03:31:21,189 runner -- Failure on computation flo.sw.fusion_matlab:FUSION_MATLAB context {'satellite': 'snpp', 'version': '1.0.0dev3', 'granule': datetime.datetime(2018, 3, 13, 1, 48)}
|
|
|
```
|
|
|
So, it is saying there is bad VIIRS data. Looking at a more recent null granule
|
|
|
```
|
|
|
satellite='snpp'; psql $flo_user -c "select job from failed_jobs where head_computation='flo.sw.fusion_matlab:FUSION_MATLAB' and context->'satellite'='''$satellite''' and context->'version'='''1.0.0dev3''' and exit_code is null and context->'granule' like '%2019, 6, 30,%' limit 5;"
|
|
|
```
|
|
|
```
|
|
|
job
|
|
|
-----------
|
|
|
138589438
|
|
|
138589439
|
|
|
138589440
|
|
|
138589441
|
|
|
138589442
|
|
|
(5 rows)
|
|
|
```
|
|
|
```
|
|
|
[flo@vultur ~]$ jobout 138589440 | xargs tail -n 20
|
|
|
tail: cannot open ‘/scratch/flo/jobs/d138/d589/138589440-stdout’ for reading: No such file or directory
|
|
|
```
|
|
|
...which indicates that the job didn't even start the delivered package. To get a look at why we take a look at the job log file
|
|
|
```
|
|
|
[flo@vultur ~]$ jobout 138589440 | sed 's/-stdout/-log/' | xargs tail -n 20
|
|
|
```
|
|
|
```
|
|
|
000 (825961.440.000) 08/22 13:17:54 Job submitted from host: <128.104.109.26:9618?PrivAddr=%3c10.1.4.30:9618%3fsock%3d1882238_49e3_11%3e&PrivNet=FLOOD&addrs=128.104.109.26-9618+[2607-f388-1090-0-266e-96ff-fe7b-4238]-9618&noUDP&sock=1882238_49e3_11>
|
|
|
...
|
|
|
009 (825961.440.000) 08/22 16:17:00 Job was aborted by the user.
|
|
|
via condor_rm (by user flo)
|
|
|
``` |
|
|
\ No newline at end of file |