-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid mt19937 state #190
Comments
I can confirm the error happens when I create the new iterator Also my environment info is:
|
This seems to happen during the seed synchronization of your dataloader (between all processes). Do you have a minimal reproducer I could look at? |
@sgugger I tried to use some dummy data to reproduce the error, but I failed and it seems that it needs to be the same with what I have. I add a few print statements in the code, is this something helpful for you?
The output I got is:
So it seems the error occurs when one of the processes first reaches that point while others are still training. I tried to add
This gives me:
Please let me know if you need more information. |
Like I said, I need a reproducible example in order to be able to debug this. I can't run the sample of code you provide as it's not complete. |
Same error. Have you solve it? |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Hi there, has there been any update on this issue? I met the same error. |
I met the same error, too |
Any update on this, I also ran into the same error |
Please give us a full reproducible example with the code, library versions, platform, and machine information. Only then will we be able to help |
The error I had was caused by the wrong usage of accelerator.is_main_process() and accelerator.wait_for_everyone(). I did something like:
The issue here is that the other processes would never get to execute |
Problem: [rank1]: generator.set_state(rng_state)
[rank1]: RuntimeError: Invalid mt19937 state NOTE THAT the data loader (dataloader_test) was not prepared with accelerator! So if I write: if accelerator.is_main_process:
with torch.no_grad():
preds, confidences_image = infer(model, dataloader_test)
print("preds: ", preds)
accelerator.wait_for_everyone() In this case, only the main process (rank 0) was running the inference, but the other processes were waiting indefinitely or raising the Solution: with torch.no_grad():
preds, confidences_image = infer(model, dataloader_test)
print("preds: ", preds) By allowing all processes to participate in the inference, the program executed correctly, and no processes got stuck. The inference step no longer relied solely on the main process, avoiding desynchronization issues. Hope my experience helps :) |
This still happens to me on 8x8 a100 training. I'm not doing anything different on any specific rank, so there is no obvious fix. |
This happens to me, too. I don't have a minimal reproducer but here are the steps to reproduce:
export NCCL_ASYNC_ERROR_HANDLING=1
export LR=1e-4
export WEIGHT_DECAY=1e-4
export GUIDANCE_SCALE=15.0
export CAPTION_DROPOUT=0.1
accelerate launch --config_file=accelerate_ds2.yaml train_control_flux.py \
--pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev" \
--dataset_name="sayakpaul/OmniEdit-mini" \
--image_column="edited_img" --conditioning_image_column="src_img" --caption_column="edited_prompt_list" \
--output_dir="edit-control-lr_${LR}-wd_${WEIGHT_DECAY}-gs_${GUIDANCE_SCALE}-cd_${CAPTION_DROPOUT}" \
--mixed_precision="bf16" \
--train_batch_size=4 \
--dataloader_num_workers=4 \
--gradient_accumulation_steps=4 \
--gradient_checkpointing \
--use_8bit_adam \
--proportion_empty_prompts=$CAPTION_DROPOUT \
--learning_rate=$LR \
--adam_weight_decay=$WEIGHT_DECAY \
--guidance_scale=$GUIDANCE_SCALE \
--report_to="wandb" --log_dataset_samples \
--lr_scheduler="cosine" \
--lr_warmup_steps=1000 \
--checkpointing_steps=2000 \
--resume_from_checkpoint="latest" \
--max_train_steps=10000 \
--seed="0" \ I am launching the script with a SLURM scheduler on a node of 8 H100s.
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
gradient_accumulation_steps: 1
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: false
zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
- `Accelerate` version: 1.1.1
- Platform: Linux-5.15.0-1049-aws-x86_64-with-glibc2.31
- `accelerate` bash location: /fsx/sayak/miniconda3/envs/diffusers/bin/accelerate
- Python version: 3.10.14
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.5.1+cu121 (False)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 123.21 GB
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: NO
- mixed_precision: fp16
- use_cpu: False
- debug: False
- num_processes: 1
- machine_rank: 0
- num_machines: 1
- gpu_ids: all
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: [] Cc: @muellerzr |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
I got this error when I'm using accelerate:
My config file is:
It seems to trace down to these few lines in my code. I did something like this to iterate through a data loader (because I need to iterate two different dataloaders):
My
try_contra_loader
is prepared by accelerator. Interestingly, when I run this code out of tmux (I got the error when running in tmux), the process hangs at 30/234 instead of giving me error.I don't know how to solve this, does anyone have any thoughts?
Many thanks!
The text was updated successfully, but these errors were encountered: