Geoff Cureton · e0b42ab7
--- a/Examing-failed-cluster-jobs.md
+++ b/Examing-failed-cluster-jobs.md
+After setting `flo_user` so we can access the `flo` account:
+```
+flo_user="-d postgresql://flo3@ratchet.sips/flo3"
+```
+we look at how many jobs succeeded/failed
+```
+satellite='snpp'; psql $flo_user -c "select count(*) from stored_products where computation='flo.sw.fusion_matlab:FUSION_MATLAB' and context->'satellite'='''$satellite''' and context->'version'='''1.0.0dev3''';"
+satellite='snpp'; psql $flo_user -c "select count(*) from failed_jobs where head_computation='flo.sw.fusion_matlab:FUSION_MATLAB' and context->'satellite'='''$satellite''' and context->'version'='''1.0.0dev3''';" 
+```
+```
+ count  
+--------
+ 109572
+(1 row)
+count 
+-------
+ 20996
+(1 row)
+```
+
+So `20996/109572 = 19.1%` failure rate. We can use the failed jobs table to group then by `exit_code` which can sometimes be useful:
+```
+satellite='snpp'; psql $flo_user -c "select exit_code,count(*) from failed_jobs where head_computation='flo.sw.fusion_matlab:FUSION_MATLAB' and context->'satellite'='''$satellite''' and context->'version'='''1.0.0dev3''' group by exit_code;"
+```
+```
+ exit_code | count 
+-----------+-------
+       -11 |    19
+        99 |    29
+      6001 |     7
+      6002 |   456
+      6003 |     5
+           | 20480
+(6 rows)
+```
+We already know that the `99` code is due to a collocation bug. Next lets checkout the -11 exit codes:
+```
+satellite='snpp'; psql $flo_user -c "select job, pydt(context->'granule') from failed_jobs where head_computation='flo.sw.fusion_matlab:FUSION_MATLAB' and context->'satellite'='''$satellite''' and context->'version'='''1.0.0dev3''' and exit_code=-11;"
+```
+```
+   job    |        pydt         
+-----------+---------------------
+ 138444177 | 2018-01-27 20:18:00
+ 138444176 | 2018-01-27 20:12:00
+ 138444175 | 2018-01-27 20:06:00
+ 138473534 | 2018-04-25 19:30:00
+ 138473535 | 2018-04-25 19:36:00
+ 138473533 | 2018-04-25 19:24:00
+ 138481767 | 2018-05-25 05:06:00
+ 138481766 | 2018-05-25 05:00:00
+ 138481765 | 2018-05-25 04:54:00
+ 138522996 | 2018-11-19 01:06:00
+ 138522997 | 2018-11-19 01:12:00
+ 138522995 | 2018-11-19 01:00:00
+ 138522994 | 2018-11-19 00:54:00
+ 138532886 | 2018-12-18 18:54:00
+ 138532885 | 2018-12-18 18:48:00
+ 138532884 | 2018-12-18 18:42:00
+ 138544240 | 2019-01-17 09:12:00
+ 138544239 | 2019-01-17 09:06:00
+ 138544238 | 2019-01-17 09:00:00
+(19 rows)
+```
+Taking a look at one of thesefailures (in the flo account):
+```
+[flo@vultur ~]$ jobout 138522996 | xargs tail -n 2
+```
+```
+CalledProcessError: Command '/mnt/software/support/viirsmend/1.2.12/bin/python /mnt/software/support/viirsmend/1.2.12/bin/viirsl1mend VNP02MOD.A2018323.0100.001.2018323062016.uwssec.bowtie_restored.nc /dev/shm/dir_1514/tmppO3glQ/8/VNP03MOD.A2018323.0100.001.2018323062028.uwssec.nc' returned non-zero exit status -11
+INFO  2019-08-22 01:52:27,568 driver -- job exiting with failure
+```
+Those 19 failures (`exit_code` `-11`) appear to be `viirsmend` failures. That will happen so not going to worry about those.
+
+That leaves us with the big elephant in the room, the null exit_codes. So I can group the failures by day. Interestingly it looks like there are whole days that are failing.
+```
+satellite='snpp'; psql $flo_user -c "select date_trunc('days',pydt(context->'granule')) as d,count(*) from failed_jobs where head_computation='flo.sw.fusion_matlab:FUSION_MATLAB' and context->'satellite'='''$satellite''' and context->'version'='''1.0.0dev3''' and exit_code is null group by d order by d;"
+satellite='snpp'; psql $flo_user -c "select date_trunc('days',pydt(context->'granule')) as d,count(*) from failed_jobs where head_computation='flo.sw.fusion_matlab:FUSION_MATLAB' and context->'satellite'='''$satellite''' and context->'version'='''1.0.0dev3''' and timestamp > '2019-08-22' and exit_code is null group by d order by d;"
+```
+So I list out a few of the job numbers for the 13th so that we can look at the output.
+```
+satellite='snpp'; psql $flo_user -c "select job from failed_jobs where head_computation='flo.sw.fusion_matlab:FUSION_MATLAB' and context->'satellite'='''$satellite''' and context->'version'='''1.0.0dev3''' and exit_code is null and context->'granule' like '%2018, 3, 13,%' limit 5;"
+```
+```
+    job    
+-----------
+ 138461881
+ 138461912
+ 138461896
+ 138461883
+ 138461932
+(5 rows)
+```
+```
+[flo@vultur ~]$ jobout 138461896 | xargs tail -n 20
+```
+```
+Bad VIIRS radiance data found (0.1f% according to band 0 quality flags) - exiting
+Bad VIIRS radiance data found (0.1f% according to band 15 quality flags) - exiting
+ERROR 2019-08-21 03:31:21,105 __init__ -- There are no Matlab files "fusion_output.mat" to convert, aborting
+INFO  2019-08-21 03:31:21,131 __init__ -- run_fusion_matlab() generated None
+ERROR 2019-08-21 03:31:21,189 runner -- Failure on computation flo.sw.fusion_matlab:FUSION_MATLAB context {'satellite': 'snpp', 'version': '1.0.0dev3', 'granule': datetime.datetime(2018, 3, 13, 1, 48)}
+```
+So, it is saying there is bad VIIRS data. Looking at a more recent null granule
+```
+satellite='snpp'; psql $flo_user -c "select job from failed_jobs where head_computation='flo.sw.fusion_matlab:FUSION_MATLAB' and context->'satellite'='''$satellite''' and context->'version'='''1.0.0dev3''' and exit_code is null and context->'granule' like '%2019, 6, 30,%' limit 5;"
+```
+```
+    job    
+-----------
+ 138589438
+ 138589439
+ 138589440
+ 138589441
+ 138589442
+(5 rows)
+```
+```
+[flo@vultur ~]$ jobout 138589440 | xargs tail -n 20
+tail: cannot open ‘/scratch/flo/jobs/d138/d589/138589440-stdout’ for reading: No such file or directory
+```
+...which indicates that the job didn't even start the delivered package. To get a look at why we take a look at the job log file
+```
+[flo@vultur ~]$ jobout 138589440 | sed 's/-stdout/-log/' | xargs tail -n 20
+```
+```
+000 (825961.440.000) 08/22 13:17:54 Job submitted from host: <128.104.109.26:9618?PrivAddr=%3c10.1.4.30:9618%3fsock%3d1882238_49e3_11%3e&PrivNet=FLOOD&addrs=128.104.109.26-9618+[2607-f388-1090-0-266e-96ff-fe7b-4238]-9618&noUDP&sock=1882238_49e3_11>
+...
+009 (825961.440.000) 08/22 16:17:00 Job was aborted by the user.
+        via condor_rm (by user flo)
+```
\ No newline at end of file