-
Notifications
You must be signed in to change notification settings - Fork 44
Open
Labels
Description
Well, it was bound to happen.
Yesterday, @JakobBD generated new REMIND input data using the revision 6.58. That finished around half past five.
SLURM Job Information
=======================
JobId=28328875 UserId=jakobdu(4731) GroupId=users(100) Name=rem-preprocessing
JobState=COMPLETED Partition=priority TimeLimit=1440 StartTime=2023-10-26T09:58:56
EndTime=2023-10-26T17:24:17 NodeList=cs-f14c01b08 NodeCnt=1 ProcCnt=1
WorkDir=/p/tmp/jakobdu/pre-processing ReservationName= Gres= Account=rdev
QOS=priority WcKey= Cluster=hlrs2015 SubmitTime=2023-10-26T09:58:44
EligibleTime=2023-10-26T09:58:44 DerivedExitCode=0:0 ExitCode=0:0
=======================
At some point, some other REMIND user (we do not know who yet) also set out to generate new input data with the revision 6.58. Luckily (since we thus were able to detect it) that run apparently failed around nine, leaving an incomplete input data revision in its wake.
$ stat -c "%n %y" /p/projects/rd3mod/inputdata/output/rev6.58_*.tgz | column -t
/p/projects/rd3mod/inputdata/output/rev6.58_2b1450bc-980575d2_validationremind.tgz 2023-10-26 17:24:17.078817187 +0200
/p/projects/rd3mod/inputdata/output/rev6.58_2b1450bc_remind.tgz 2023-10-26 17:16:24.872150471 +0200
/p/projects/rd3mod/inputdata/output/rev6.58_62eff8f7_remind.tgz 2023-10-26 20:57:58.956467411 +0200
/p/projects/rd3mod/inputdata/output/rev6.58_62eff8f7_validationremind.tgz 2023-10-26 16:59:36.346072550 +0200
Several issues here are worth addressing:
- Social mechanisms for blocking revision numbers obviously do not suffice. We have users use
lastrevto check the next free revision number. But concurrently started jobs will collide.
The easiest solution to me seems to create empty files at the start of the run to block them. -
retrieveData()will happily overwrite output. In terms of scientific reproducibility, that should not happen.
So stopping if the files that would be written already exist would solve this (and contribute to revision blocking). - Somehow
retrieveData()failed, but still wrote a legitimate-looking archive.
I downloaded the originalrev6.58_62eff8f7_remind.tgzfile for testing before it was corrupted. The new archive is missing several files, rendering it unfit for use with REMIND:I have no idea how$ comm -3 <( tar -tf ~/PIK/swap/inputdata/output/rev6.58_62eff8f7_remind.tgz | sort ) <( tar -tf ~/PIK/Remind_input/rev6.58_62eff8f7_remind.tgz | sort ) ./config.rds ./f21_tau_fe_sub.cs4r ./f21_tau_fe_tax.cs4r ./f21_tax_convergence.cs4r ./f_gdp.cs3r ./f_lab.cs3r ./f_pop.cs3r ./p01_boundInvMacro.cs4r ./pm_shPPPMER.cs4r ./regionmappingH12.csvretrieveData()wrote an archive with missing files, but I think it should not do that. (It appears to not have been a race condition where the first run deleted the files while the second was packing up its archive. All files from the second run showcreation dateslater then 19:00.)
Reactions are currently unavailable