Skip to content

retrieveData() revison collisions, incomplete data #189

@0UmfHxcvx5J7JoaOhFSs5mncnisTJJ6q

Description

Well, it was bound to happen.

Yesterday, @JakobBD generated new REMIND input data using the revision 6.58. That finished around half past five.

SLURM Job Information
=======================
JobId=28328875 UserId=jakobdu(4731) GroupId=users(100) Name=rem-preprocessing 
JobState=COMPLETED Partition=priority TimeLimit=1440 StartTime=2023-10-26T09:58:56 
EndTime=2023-10-26T17:24:17 NodeList=cs-f14c01b08 NodeCnt=1 ProcCnt=1 
WorkDir=/p/tmp/jakobdu/pre-processing ReservationName= Gres= Account=rdev 
QOS=priority WcKey= Cluster=hlrs2015 SubmitTime=2023-10-26T09:58:44 
EligibleTime=2023-10-26T09:58:44 DerivedExitCode=0:0 ExitCode=0:0
=======================

At some point, some other REMIND user (we do not know who yet) also set out to generate new input data with the revision 6.58. Luckily (since we thus were able to detect it) that run apparently failed around nine, leaving an incomplete input data revision in its wake.

$ stat -c "%n %y" /p/projects/rd3mod/inputdata/output/rev6.58_*.tgz | column -t
/p/projects/rd3mod/inputdata/output/rev6.58_2b1450bc-980575d2_validationremind.tgz  2023-10-26  17:24:17.078817187  +0200
/p/projects/rd3mod/inputdata/output/rev6.58_2b1450bc_remind.tgz                     2023-10-26  17:16:24.872150471  +0200
/p/projects/rd3mod/inputdata/output/rev6.58_62eff8f7_remind.tgz                     2023-10-26  20:57:58.956467411  +0200
/p/projects/rd3mod/inputdata/output/rev6.58_62eff8f7_validationremind.tgz           2023-10-26  16:59:36.346072550  +0200

Several issues here are worth addressing:

  • Social mechanisms for blocking revision numbers obviously do not suffice. We have users use lastrev to check the next free revision number. But concurrently started jobs will collide.
    The easiest solution to me seems to create empty files at the start of the run to block them.
  • retrieveData() will happily overwrite output. In terms of scientific reproducibility, that should not happen.
    So stopping if the files that would be written already exist would solve this (and contribute to revision blocking).
  • Somehow retrieveData() failed, but still wrote a legitimate-looking archive.
    I downloaded the original rev6.58_62eff8f7_remind.tgz file for testing before it was corrupted. The new archive is missing several files, rendering it unfit for use with REMIND:
    $ comm -3 <( tar -tf ~/PIK/swap/inputdata/output/rev6.58_62eff8f7_remind.tgz | sort ) <( tar -tf ~/PIK/Remind_input/rev6.58_62eff8f7_remind.tgz | sort )
      ./config.rds
      ./f21_tau_fe_sub.cs4r
      ./f21_tau_fe_tax.cs4r
      ./f21_tax_convergence.cs4r
      ./f_gdp.cs3r
      ./f_lab.cs3r
      ./f_pop.cs3r
      ./p01_boundInvMacro.cs4r
      ./pm_shPPPMER.cs4r
      ./regionmappingH12.csv
    
    I have no idea how retrieveData() wrote an archive with missing files, but I think it should not do that. (It appears to not have been a race condition where the first run deleted the files while the second was packing up its archive. All files from the second run show creation dates later then 19:00.)

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions