Deploy the glue code to the user account cluster
We import the flo3
interface python code for fusion_matlab
into the software tree
/mnt/software/geoffc
by running rsync
...
sudo su - flo
cd /mnt/software/geoffc
rsync -urLv /home/geoffc/py3flo/git/fusion_matlab/source/flo/ fusion_matlab/flo/ --progress --exclude=.*.sw*
Deploy the glue code to the development (flo) account cluster
We import the flo3
interface python code for fusion_matlab
into the software tree
/mnt/software/flo
by changing to the flo
account and running rsync
...
sudo su - flo
cd /mnt/software/flo/
rsync -urLv /home/geoffc/py3flo/git/fusion_matlab/source/flo/ fusion_matlab/flo/ --progress --exclude=.*.sw*
Commit glue code to PeateScience repo
The actual glue code was copied to /mnt/software
in the last step, but pushing the python code to the fusion_matlab
repo will provide the submission scripts
example_local_prepare.py
and submit_fusion_matlab.py
for use on condor.
cd /home/geoffc/py3flo/git/fusion_matlab
git pull
git commit -m "Some change in the glue code."
git push
Running the Fusion code on the cluster
We can now submit fusion_matlab
to the cluster from condor, on the development (flo
) account:
sudo su - flo
cd /data/geoffc/fusion_matlab/cluster_processing/logs
$ python /home/geoffc/code/PeateScience/packages/fusion_matlab/submit_fusion_matlab.py
Verbosity is 2
(INFO): Submitting intervals...
(INFO): Submitting interval 2018-04-10 00:00:00 -> 2018-04-10 23:59:59
(INFO): Opening log file fusion_matlab_snpp_s201804100000_e201804102359_c20201124170636.log
(INFO): There are 240 contexts in this interval
(INFO): First context: {'satellite': 'snpp', 'version': '1.2.0.1dev0', 'granule': datetime.datetime(2018, 4, 10, 0, 0)}
(INFO): Last context: {'satellite': 'snpp', 'version': '1.2.0.1dev0', 'granule': datetime.datetime(2018, 4, 10, 23, 54)}
(INFO): Attempted 240 submits; results: 239/1/0 (submitted/not needed/not ready) [jobs 187647274 to 187647512]
(INFO): Job nums: range(187647274, 187647513)
(INFO): contexts: [{'satellite': 'snpp', 'version': '1.2.0.1dev0', 'granule': datetime.datetime(2018, 4, 10, 0, 0)}, {'satellite': 'snpp', 'version': '1.2.0.1dev0', 'granule': datetime.datetime(2018, 4, 10, 23, 54)}]; job numbers: [187647274,187647512]
(INFO): job numbers: [187647274..187647512]
(INFO): Closing log file fusion_matlab_snpp_s201804100000_e201804102359_c20201124170636.log
We can keep track of running jobs by doing the various incantations:
sudo su - flo
condor_q -autoformat FloClusterComputations | sort | uniq -c
condor_q -constraint 'FloClusterComputations=="flo.sw.fusion_matlab:FUSION_MATLAB"' -constraint 'Owner=="flo"'
condor_q -autoformat FloClusterComputations Owner ClusterID ProcID
condor_q -format '%d' ClusterId -format '.%d\n' ProcId
condor_q -constraint 'FloClusterComputations=="flo.sw.fusion_matlab:FUSION_MATLAB"' -format '%d' ClusterId -format '.%d\n' ProcId
We can change the job mods of running jobs by doing...
condor_qedit -const 'FloclusterComputations == "flo.sw.fusion_matlab:FUSION_MATLAB"' RequestMemory 6000
To look at the log files of a particular job(s)
run -e /home/geoffc/git/sips_utils/snippets.py
job_range = (86694864, 86694865)
job_file_branches = [job_number_to_dir('/scratch/flo/jobs',job) for job in range(*job_range)]
if len(job_file_branches)>1:
job_stdout_files = list(np.squeeze([glob(dir+'-stdout') for dir in job_file_branches]))
job_stderr_files = list(np.squeeze([glob(dir+'-stderr') for dir in job_file_branches]))
else:
job_stdout_files = list([glob(dir+'-stdout') for dir in job_file_branches][0])
job_stderr_files = list([glob(dir+'-stderr') for dir in job_file_branches][0])
In order to check the database for the fusion matlab output
flo_user='-h ratchet -U ro flo3'
> psql $flo_user -c "SELECT job,size,output,context,file_name from stored_products where computation='flo.sw.fusion_matlab:FUSION_MATLAB' and output='fused_l1b' order by file_name;"
job | size | output | context | file_name
----------+-----------+-----------+-------------------------------------------------------------------------------------------------------+-------------------------------------------------
91073252 | 366596456 | fused_l1b | "granule"=>"datetime.datetime(2015, 4, 18, 6, 6)", "version"=>"'2.0.1dev0'", "satellite"=>"'snpp'" | VNP02FSN.A2015108.0606.001.2018025180544.nc
(1 row)
To group granules by day/month etc...
satellite='snpp'; psql $flo_user -c "select date_trunc('months', jts(context->'granule')) as m,count(*) from stored_products where computation='flo.sw.fusion_matlab:FUSION_MATLAB' and context->>'satellite'='$satellite' and context->>'version'='2.0.0' group by m order by m;"
To select granules which match or are between certain dates:
satellite='snpp'; psql $flo_user -c "SELECT job,size,context,file_name from stored_products where computation='flo.sw.fusion_matlab:FUSION_MATLAB' and context->>'satellite'='$satellite' and context->>'version'='2.0.1dev0' and (context#>>'{granule,value}')::timestamp between '2019-01-01' and '2019-01-02' order by file_name;" | less
To remove old files:
sudo su - flo
flo_user_rw='-h ratchet flo3'
satellite='snpp'; psql $flo_user_rw -c "SELECT job, size, context, file_name FROM stored_products WHERE computation='flo.sw.fusion_matlab:FUSION_MATLAB' and context->>'satellite'='$satellite' and context->>'version'='2.0.1dev0' order by file_name" | less
satellite='snpp'; psql $flo_user_rw -c "DELETE FROM stored_products WHERE computation='flo.sw.fusion_matlab:FUSION_MATLAB' and context->>'satellite'='$satellite' and context->>'version'='2.0.1dev0'"
Other Database Querys
satellite='snpp'; $flo_user -c "SELECT jts(context->'granule') as d,count(*) FROM stored_products WHERE computation='flo.sw.fusion_matlab:FUSION_MATLAB' and context->>'satellite'='$satellite' and context->>'version'='2.0.1dev0' group by d order by d order by file_name;" | less
satellite='snpp'; $flo_user -c "SELECT jts(context->'granule') as d,count(*) FROM stored_products WHERE computation='flo.sw.fusion_matlab:FUSION_MATLAB' and context->>'satellite'='$satellite' and context->>'version'='2.0.1dev0' and date_trunc('days',jts(context->'granule'))='2014-01-01' group by d order by d;" | less
satellite='snpp'; $flo_user -c "SELECT job,size,context,file_name from stored_products where computation='flo.sw.fusion_matlab:FUSION_MATLAB' and context->>'satellite'='$satellite' and context->>'version'='2.0.1dev0' order by file_name;" | less
satellite='snpp'; $flo_user -c "SELECT job,size,context,file_name from stored_products where computation='flo.sw.fusion_matlab:FUSION_MATLAB' and context->>'satellite'='$satellite' and context->>'version'='2.0.1dev0' and date_trunc('days',jts(context->'granule'))='2014-01-01' order by file_name;" | less
flo3=> select x FROM generate_series('2015-04-01'::timestamp, '2015-04-30 23:59', '6 minutes') as x where not exists (select null from stored_products where computation='flo.sw.fusion_matlab:FUSION_MATLAB' and context->>'satellite'='$satellite' and output='fused_l1b' and x=jts(context->'granule'));
# List files keys
satellite='snpp'; $flo_user -tA -c "SELECT format ('flo3/%s/%s',job,file_name) FROM stored_products WHERE computation='flo.sw.fusion_matlab:FUSION_MATLAB' and context->>'satellite'='$satellite' and context->>'version'='2.0.1dev0' order by file_name limit 5;"
# List file keys and status
satellite='snpp'; $flo_user -tA -c "SELECT format ('flo3/%s/%s',job,file_name) FROM stored_products WHERE computation='flo.sw.fusion_matlab:FUSION_MATLAB' and context->>'satellite'='$satellite' and context->>'version'='2.0.1dev0' order by file_name limit 5;" | xargs -n1 -IXX rados -p dev --id flo stat XX
# List file key basenames
satellite='snpp'; $flo_user -tA -c "SELECT format ('flo3/%s/%s',job,file_name) FROM stored_products WHERE computation='flo.sw.fusion_matlab:FUSION_MATLAB' and context->>'satellite'='$satellite' and context->>'version'='2.0.1dev0' order by file_name limit 5;" | xargs -n1 -IXX basename XX
# List the rados commands to download files using the database file keys.
satellite='snpp'; $flo_user -tA -c "SELECT format ('flo3/%s/%s',job,file_name) FROM stored_products WHERE computation='flo.sw.fusion_matlab:FUSION_MATLAB' and context->>'satellite'='$satellite' and context->>'version'='2.0.1dev0' order by file_name limit 5;" | xargs -n1 -IXX echo rados -p dev --id flo get XX "~/fusion_matlab/work/links/"$(basename XX)
# rados commands
rados -p dev --id flo get flo3/91069111/VNP02FSN.A2015091.0000.001.2018025170339.nc VNP02FSN.A2015091.0000.001.2018025170339.nc
Running in Forward Stream
Job parameters for FUSION_MATLAB
can be found from
flo3=> select * from forward_streams where name = 'FusionMatlab';
id | name | offset_start | offset_end | find_contexts_arguments | workflow_head | workflow_targets | workflow_download_onlies | job_mods | output_volume | num_retries | expiration
----+--------------+--------------+------------+-----------------------------------------------+------------------------------------+------------------------------------------------+--------------------------+-------------------------------+---------------+-------------+------------
43 | FusionMatlab | -4 days | 00:00:00 | "version"=>"'1.0dev3'", "satellite"=>"'snpp'" | flo.sw.fusion_matlab:FUSION_MATLAB | {flo.sw.fusion_matlab:FUSION_MATLAB;fused_l1b} | {} | "requests"=>"['Memory=8000']" | | |
(1 row)
and for FUSION_MATLAB_QL
:
flo3=> select * from forward_streams where name = 'FusionMatlabDailyQL';
id | name | offset_start | offset_end | find_contexts_arguments | workflow_head | workflow_targets | workflow_download_onlies | job_mods | output_volume | num_retries | expiration
----+---------------------+--------------+------------+-----------------------------------------------+---------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------+------------------------------------------------------------------------------------------------------------------------------------------+---------------+-------------+------------
53 | FusionMatlabDailyQL | -6 days | -2 days | "version"=>"'1.0dev2'", "satellite"=>"'snpp'" | flo.sw.fusion_matlab:FUSION_MATLAB_QL | {flo.sw.fusion_matlab:FUSION_MATLAB_QL;fused_l1b_ql_band27_asc,flo.sw.fusion_matlab:FUSION_MATLAB_QL;fused_l1b_ql_band27_desc,flo.sw.fusion_matlab:FUSION_MATLAB_QL;fused_l1b_ql_band33_asc,flo.sw.fusion_matlab:FUSION_MATLAB_QL;fused_l1b_ql_band33_desc} | {} | "classads"=>"['HookKeyword=SCRATCH']", "requests"=>"['Scratch=3','Memory=8000']", "requirements"=>"['TARGET.Scratch >= RequestScratch']" | | |
(1 row)
To preview the submission of the Fusion Matlab level-1b files to Forward Stream, we enter the following in the psql
shell:
explain INSERT INTO forward_streams (
name, offset_start, offset_end, find_contexts_arguments,
workflow_head, workflow_targets, job_mods
)
VALUES (
'FusionMatlab',
'-4 Days'::interval,
'00:00:00'::interval,
'version=>"''2.0.1dev0''", satellite=>"''snpp''"'::hstore,
'flo.sw.fusion_matlab:FUSION_MATLAB',
'{flo.sw.fusion_matlab:FUSION_MATLAB;fused_l1b}'::text[],
'requests=>"[''Memory=8000'']"'::hstore
)
;
giving
QUERY PLAN
---------------------------------------------------------------
Insert on forward_streams (cost=0.00..0.01 rows=1 width=288)
-> Result (cost=0.00..0.01 rows=1 width=288)
(2 rows)
To preview the submission of the Fusion Matlab Quicklooks to Forward Stream, we enter the following in the psql
shell:
explain INSERT INTO forward_streams (
name, offset_start, offset_end, find_contexts_arguments,
workflow_head, workflow_targets, job_mods
)
VALUES (
'FusionMatlabDailyQL',
'-6 Days'::interval,
'-2 Days'::interval,
'version=>"''1.0dev3''", satellite=>"''snpp''"'::hstore,
'flo.sw.fusion_matlab:FUSION_MATLAB_QL',
'{flo.sw.fusion_matlab:FUSION_MATLAB_QL;fused_l1b_ql_band27_asc,flo.sw.fusion_matlab:FUSION_MATLAB_QL;fused_l1b_ql_band27_desc,flo.sw.fusion_matlab:FUSION_MATLAB_QL;fused_l1b_ql_band33_asc,flo.sw.fusion_matlab:FUSION_MATLAB_QL;fused_l1b_ql_band33_desc}'::text[],
'requirements=>"[''TARGET.Scratch >= RequestScratch'']", requests=>"[''Scratch=3'',''Memory=8000'']", classads=>"[''HookKeyword=SCRATCH'']"'::hstore
)
;
which outputs
QUERY PLAN
---------------------------------------------------------------
Insert on forward_streams (cost=0.00..0.01 rows=1 width=288)
-> Result (cost=0.00..0.01 rows=1 width=288)
(2 rows)
To actually submit the task, remove the explain
keyword from the above invocation.
Modifying a Forward Stream Instance
If we want to change the version number or some other characteristic of a forward stream instance (say FUSION_MATLAB
)
flo3=> select * from forward_streams where name = 'FusionMatlab';
id | name | offset_start | offset_end | find_contexts_arguments | workflow_head | workflow_targets | workflow_download_onlies | job_mods | output_volume | num_retries | expiration
----+--------------+--------------+------------+-----------------------------------------------+------------------------------------+------------------------------------------------+--------------------------+-------------------------------+---------------+-------------+------------
43 | FusionMatlab | -4 days | 00:00:00 | "version"=>"'1.0dev3'", "satellite"=>"'snpp'" | flo.sw.fusion_matlab:FUSION_MATLAB | {flo.sw.fusion_matlab:FUSION_MATLAB;fused_l1b} | {} | "requests"=>"['Memory=8000']" | | |
(1 row)
we can preview the change by running
select find_contexts_arguments || 'version=>"''2.0.1dev0''"' from forward_streams where name = 'FusionMatlab';
When we are happy with the proposed changes we can modify the existing forward stream
update forward_streams set find_contexts_arguments = find_contexts_arguments || 'version=>"''2.0.1dev0''"' where name = 'FusionMatlab';
For the Fusion Quicklooks we similarly have
select find_contexts_arguments || 'version=>"''2.0.1dev0''"' from forward_streams where name = 'FusionMatlabDailyQL';
update forward_streams set find_contexts_arguments = find_contexts_arguments || 'version=>"''2.0.1dev0''"' where name = 'FusionMatlabDailyQL';
Deleting a Forward Stream Instance
We can examine the forward stream instances by running
psql $flo_user -c "select name,find_contexts_arguments,workflow_head from forward_streams;" | grep Fusion
resulting in
FusionMatlab | "version"=>"'2.0.1dev0'", "satellite"=>"'snpp'" | flo.sw.fusion_matlab:FUSION_MATLAB
FusionMatlabDailyQL | "version"=>"'1.0dev7'", "satellite"=>"'snpp'" | flo.sw.fusion_matlab:FUSION_MATLAB_QL
FusionMatlab_jpss1 | "version"=>"'2.0.1dev0'", "satellite"=>"'noaa20'" | flo.sw.fusion_matlab:FUSION_MATLAB
FusionMatlabDailyQL_jpss1 | "version"=>"'1.0dev7'", "satellite"=>"'noaa20'" | flo.sw.fusion_matlab:FUSION_MATLAB_QL
To delete each of these, we enter the psql
shell and execute a series of begin-delete-commit
transactions...
flo3=> begin;
BEGIN
flo3=> delete from forward_streams where name = 'FusionMatlabDailyQL';
DELETE 1
flo3=> commit;
COMMIT
flo3=> begin;
BEGIN
flo3=> delete from forward_streams where name = 'FusionMatlabDailyQL_jpss1';
DELETE 1
flo3=> commit;
COMMIT
flo3=> begin;
BEGIN
flo3=> delete from forward_streams where name = 'FusionMatlab_jpss1';
DELETE 1
flo3=> commit;
COMMIT
flo3=> begin;
BEGIN
flo3=> delete from forward_streams where name = 'FusionMatlab';
DELETE 1
flo3=> commit;
COMMIT
flo3=>
Examining log files of failed jobs
The details o failed jobs can be found from
psql $flo_user -c "select * from failed_jobs where head_computation = 'flo.sw.fusion_matlab:FUSION_MATLAB' and context->>'version'='2.0.1dev0' and timestamp > '2018-01-30';"
Generate a list of jobnumbers for failed jobs:
psql $flo_user -c "SELECT job, context FROM failed_jobs WHERE head_computation='flo.sw.fusion_matlab:FUSION_MATLAB' and context->>'version'='2.0.1dev0' and timestamp > '2018-01-30' order by context;" | grep granule | gawk '{print $1}' > fusion_matlab_v2.0.1dev0_failed_granules.txt
satellite='snpp'; psql $flo_user -c "select job,context,timestamp,exit_code from failed_jobs where head_computation='flo.sw.fusion_matlab:FUSION_MATLAB' and context->>'satellite'='$satellite' and context->>'version'='2.0.1dev0' and timestamp > '2023-04-22' and exit_code != 6000 order by timestamp;"
Generate a list of jobnumbers from the 'dev' and 'ops' cluster, and examine the last two lines of the stdout files associated with the jobs...
satellite='snpp'; psql $flo_user -tA -c "select job from failed_jobs where head_computation='flo.sw.fusion_matlab:FUSION_MATLAB' and context->>'satellite'='$satellite' and context->>'version'='2.0.0' and timestamp > '2023-04-01' and exit_code is null order by timestamp;" | cat | tail | xargs -n 1 -P 10 jobout | xargs -n 1 -P 10 tail -n 2
satellite='snpp'; psql $flo_ops -tA -c "select job from failed_jobs where head_computation='flo.sw.fusion_matlab:FUSION_MATLAB' and context->>'satellite'='$satellite' and context->>'version'='2.0.0' and timestamp > '2023-04-01' and exit_code is null order by timestamp;" | cat | tail | xargs -n 1 -P 10 jobout_ops | xargs -n 1 -P 10 tail -n 2