Invalid mt19937 state #190

yujianll · 2021-10-20T16:39:20Z

I got this error when I'm using accelerate:

[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters
 in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect p
erformance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that th
is warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function op
erator())                                                                                                                         
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters
 in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect p
erformance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that th
is warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function op
erator())                                                                                                                         
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters
 in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect p
erformance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that th
is warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function op
erator())                                                                                                                         
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters
 in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect p
erformance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that th
is warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function op
erator())                                                                                                                         
Training:  13%|██████████▌                                                                       | 30/234 [01:12<05:52,  1.73s/it]
Traceback (most recent call last):                                                                                                
  File "/home/yujianl/Media_bias_code/src/files/stance_detection/pretrain_ddp.py", line 289, in main                              
    batch = next(contra_loader_iter)                                                                                              
StopIteration                                                                                                                     
                                                                                                                                  
During handling of the above exception, another exception occurred:                                                               

Traceback (most recent call last):
  File "/home/yujianl/Media_bias_code/src/files/stance_detection/pretrain_ddp.py", line 363, in <module>
    main()
  File "/home/yujianl/Media_bias_code/src/files/stance_detection/pretrain_ddp.py", line 292, in main
    batch = next(contra_loader_iter)
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/site-packages/accelerate/data_loader.py", line 301, in __iter__
    synchronize_rng_states(self.rng_types, self.generator)
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/site-packages/accelerate/utils.py", line 110, in synchronize_rng_sta
tes
    synchronize_rng_state(RNGType(rng_type), generator=generator) 
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/site-packages/accelerate/utils.py", line 105, in synchronize_rng_sta
te
    generator.set_state(rng_state)
RuntimeError: Invalid mt19937 state
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 580655) of binary: /home/yujianl/anac
onda3/envs/media_bias/bin/python 
Traceback (most recent call last):
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/site-packages/torch/distributed/run.py", line 689, in run
    elastic_launch(
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 116, in __cal
l__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 244, in launc
h_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
*****************************************************
  src/files/stance_detection/pretrain_ddp.py FAILED  
=====================================================
Root Cause:
[0]:
  time: 2021-10-20_12:19:56
  rank: 1 (local_rank: 1)
  exitcode: 1 (pid: 580655)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
=====================================================
Other Failures:
  <NO_OTHER_FAILURES>

Traceback (most recent call last):
  File "/home/yujianl/anaconda3/envs/media_bias/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 41, in ma
in
    args.func(args)
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/site-packages/accelerate/commands/launch.py", line 378, in launch_co
mmand
    multi_gpu_launcher(args)
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/site-packages/accelerate/commands/launch.py", line 176, in multi_gpu
_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/yujianl/anaconda3/envs/media_bias/bin/python', '-m', 'torch.distributed.launch', '
--use_env', '--nproc_per_node', '4', 'src/files/stance_detection/pretrain_ddp.py', '--mlm_train_files', '/data/yujian/data/online_
news/pretrain/val_left.json', '/data/yujian/data/online_news/pretrain/val_center.json', '/data/yujian/data/online_news/pretrain/va
l_right.json', '--mlm_val_files', '/data/yujian/data/online_news/pretrain/val_left.json', '/data/yujian/data/online_news/pretrain/
val_center.json', '/data/yujian/data/online_news/pretrain/val_right.json', '--contra_train_file', '/data/yujian/data/online_news/p
retrain/alignment/match_val.json', '--contra_val_file', '/data/yujian/data/online_news/pretrain/alignment/match_val.json', '--per_
gpu_mlm_train_batch_size', '32', '--per_gpu_mlm_eval_batch_size', '32', '--per_gpu_contra_train_batch_size', '16', '--per_gpu_cont
ra_eval_batch_size', '16', '--mlm_learning_rate', '0.0005', '--contra_learning_rate', '0.0005', '--weight_decay', '0.01', '--num_t
rain_epochs', '3', '--logging_steps', '32', '--model_name', 'roberta-base', '--mlm_gradient_accumulation_steps', '16', '--contra_g
radient_accumulation_steps', '16', '--output_path', '/data/yujian/models/stance_detection/news_ent_sent_contra_ideo_story_roberta_
base.pt', '--use_contrast', '--contrast_alpha', '0.5', '--ideo_margin', '0.5', '--story_margin', '1.0', '--n_gpu', '8', '--data_pr
ocess_worker', '2', '--max_grad_norm', '1.0', '--use_gpu', '--do_train', '--mask_entity', '--mask_sentiment', '--max_train_steps',
 '3', '--lexicon_dir', '/data/yujian/data/online_news/pretrain/lexicon']' returned non-zero exit status 1.
*****************************************************

My config file is:

compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
fp16: false
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
num_machines: 1
num_processes: 4

It seems to trace down to these few lines in my code. I did something like this to iterate through a data loader (because I need to iterate two different dataloaders):

try:
    batch = next(contra_loader_iter)
except StopIteration:
    contra_loader_iter = iter(trn_contra_loader)
    batch = next(contra_loader_iter)

My try_contra_loader is prepared by accelerator. Interestingly, when I run this code out of tmux (I got the error when running in tmux), the process hangs at 30/234 instead of giving me error.

I don't know how to solve this, does anyone have any thoughts?

Many thanks!

The text was updated successfully, but these errors were encountered:

yujianll · 2021-10-20T20:27:53Z

I can confirm the error happens when I create the new iterator contra_loader_iter = iter(trn_contra_loader). As I decrease the batch size (more iterations for one pass of the dataset), the error occurs later.

Also my environment info is:

transformers version: 4.11.2
Platform: Linux-5.11.0-34-generic-x86_64-with-glibc2.31
Python version: 3.9.1
PyTorch version (GPU?): 1.9.1+cu111 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: DDP

sgugger · 2021-10-20T21:59:08Z

This seems to happen during the seed synchronization of your dataloader (between all processes). Do you have a minimal reproducer I could look at?

yujianll · 2021-10-21T07:02:01Z

@sgugger I tried to use some dummy data to reproduce the error, but I failed and it seems that it needs to be the same with what I have.

I add a few print statements in the code, is this something helpful for you?
My code is:

net, optimizer, trn_loader1, trn_loader2 = accelerator.prepare(net, optimizer, trn_loader1, trn_loader2)
loader2_iter = iter(trn_loader2)
for epoch in range(num_epoch):
    for batch in trn_loader1:
        # train on data loader 1
        if (step + 1) % gradient_accumulation_steps == 0:
            # update for loader 1
            for ind in range(gradient_accumulation_steps):
                print(ind)
                try:
                    batch = next(loader2_iter)
                except StopIteration:
                    print('Prepare for new iterator!!!!!!')
                    loader2_iter = iter(trn_loader2)
                    print('Created new iterator!!!!!!')
                    batch = next(loader2_iter)
                # train on data loader 2

The output I got is:

0                                                                                                                                 
0                                                                                                                                 
Training:  13%|██████████▌                                                                       | 30/234 [01:45<06:44,  1.98s/it]
0                                                                                                                                 
Training:  13%|██████████▊                                                                       | 31/234 [01:47<07:01,  2.08s/it]
1                                                                                                                                 
1                                                                                                                                 
1                                                                                                                                 
2                                                                                                                                 
0                                                                                                                                 
21                                                                                                                                
                                                                                                                                  
2                                                                                                                                 
3                                                                                                                                 
3                                                                                                                                 
2                                                                                                                                 
4                                                                                                                                 
Prepare for new iterator!!!!!!                                                                                                    
Created new iterator!!!!!!                                                                                                        
Traceback (most recent call last):                                                                                                
  File "/home/yujianl/Media_bias_code/src/files/stance_detection/pretrain_ddp.py", line 292, in main                              
    batch = next(contra_loader_iter)                                                                                              
StopIteration

So it seems the error occurs when one of the processes first reaches that point while others are still training.

I tried to add accelerator.wait_for_everyone() before creating new iterator, but the program just hangs there without any update.

try:
    batch = next(loader2_iter)
except StopIteration:
    print('Prepare for new iterator!!!!!!')
    accelerator.wait_for_everyone()
    loader2_iter = iter(trn_loader2)
    print('Created new iterator!!!!!!')
    batch = next(loader2_iter)

This gives me:

0
0
1
0
2
1
3
1
0
2
4
2
1
Prepare for new iterator!!!!!!
# nothing printed out

Please let me know if you need more information.

sgugger · 2021-10-21T11:33:35Z

Like I said, I need a reproducible example in order to be able to debug this. I can't run the sample of code you provide as it's not complete.

zhhongzhi · 2022-03-29T11:53:53Z

Same error. Have you solve it?

github-actions · 2022-05-24T15:53:54Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Buzz-Beater · 2023-04-10T03:12:29Z

Hi there, has there been any update on this issue? I met the same error.

chatAGT · 2023-05-17T08:25:07Z

I met the same error， too

Alikerin · 2023-05-22T12:36:12Z

Any update on this, I also ran into the same error

muellerzr · 2023-05-22T12:38:59Z

Please give us a full reproducible example with the code, library versions, platform, and machine information. Only then will we be able to help

Alikerin · 2023-05-22T16:03:59Z

The error I had was caused by the wrong usage of accelerator.is_main_process() and accelerator.wait_for_everyone(). I did something like:

if accelerator.is_main_process():
  # save model
  accelerator.wait_for_everyone()
  model = accelerator.unwrap_model(model)
  ...

The issue here is that the other processes would never get to execute accelerator.wait_for_everyone() and the main process would throw a timeout error after waiting for a while.

vegetable-lion · 2024-09-10T13:50:38Z

Problem:
I was trying to use the is_main_process() to handle inference in a distributed setting, which caused the processes to hang or result in the following error:

[rank1]: generator.set_state(rng_state)
[rank1]: RuntimeError: Invalid mt19937 state

NOTE THAT the data loader (dataloader_test) was not prepared with accelerator!

So if I write:

if accelerator.is_main_process:
    with torch.no_grad():
        preds, confidences_image = infer(model, dataloader_test)
        print("preds: ", preds)
accelerator.wait_for_everyone()

In this case, only the main process (rank 0) was running the inference, but the other processes were waiting indefinitely or raising the Invalid mt19937 state error because the data loader (dataloader_test) was not prepared with accelerator, unlike the model. This likely caused desynchronization between the processes, leading to the hang or runtime error.

Solution:
The solution was to let all processes run the inference step, rather than limiting it to just the main process:

with torch.no_grad():
    preds, confidences_image = infer(model, dataloader_test)
    print("preds: ", preds)

By allowing all processes to participate in the inference, the program executed correctly, and no processes got stuck. The inference step no longer relied solely on the main process, avoiding desynchronization issues.

Hope my experience helps :)

jxmorris12 · 2024-11-21T16:05:13Z

This still happens to me on 8x8 a100 training. I'm not doing anything different on any specific rank, so there is no obvious fix.

sayakpaul · 2025-01-06T02:11:40Z

This happens to me, too. I don't have a minimal reproducer but here are the steps to reproduce:

Clone diffusers: git clone https://github.com/huggingface/diffusers.
git checkout updates-flux-control && cd diffusers && pip install -e ..
cd examples/flux-control.
Run the following command:

export NCCL_ASYNC_ERROR_HANDLING=1
export LR=1e-4
export WEIGHT_DECAY=1e-4
export GUIDANCE_SCALE=15.0
export CAPTION_DROPOUT=0.1

accelerate launch --config_file=accelerate_ds2.yaml train_control_flux.py \
  --pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev" \
  --dataset_name="sayakpaul/OmniEdit-mini" \
  --image_column="edited_img" --conditioning_image_column="src_img" --caption_column="edited_prompt_list" \
  --output_dir="edit-control-lr_${LR}-wd_${WEIGHT_DECAY}-gs_${GUIDANCE_SCALE}-cd_${CAPTION_DROPOUT}" \
  --mixed_precision="bf16" \
  --train_batch_size=4 \
  --dataloader_num_workers=4 \
  --gradient_accumulation_steps=4 \
  --gradient_checkpointing \
  --use_8bit_adam \
  --proportion_empty_prompts=$CAPTION_DROPOUT \
  --learning_rate=$LR \
  --adam_weight_decay=$WEIGHT_DECAY \
  --guidance_scale=$GUIDANCE_SCALE \
  --report_to="wandb" --log_dataset_samples \
  --lr_scheduler="cosine" \
  --lr_warmup_steps=1000 \
  --checkpointing_steps=2000 \
  --resume_from_checkpoint="latest" \
  --max_train_steps=10000 \
  --seed="0" \

I am launching the script with a SLURM scheduler on a node of 8 H100s.

accelerate_ds2.yaml:

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  gradient_accumulation_steps: 1
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

accelerate env:

- `Accelerate` version: 1.1.1
- Platform: Linux-5.15.0-1049-aws-x86_64-with-glibc2.31
- `accelerate` bash location: /fsx/sayak/miniconda3/envs/diffusers/bin/accelerate
- Python version: 3.10.14
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.5.1+cu121 (False)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 123.21 GB
- `Accelerate` default config:
	- compute_environment: LOCAL_MACHINE
	- distributed_type: NO
	- mixed_precision: fp16
	- use_cpu: False
	- debug: False
	- num_processes: 1
	- machine_rank: 0
	- num_machines: 1
	- gpu_ids: all
	- rdzv_backend: static
	- same_network: True
	- main_training_function: main
	- enable_cpu_affinity: False
	- downcast_bf16: no
	- tpu_use_cluster: False
	- tpu_use_sudo: False
	- tpu_env: []

Cc: @muellerzr

github-actions · 2025-01-30T15:08:17Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions bot closed this as completed Jun 2, 2022

LWprogramming mentioned this issue Jul 18, 2023

Accelerate failing on multi-gpu rng synchronization lucidrains/audiolm-pytorch#209

Closed

sayakpaul reopened this Jan 6, 2025

github-actions bot closed this as completed Feb 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid mt19937 state #190

Invalid mt19937 state #190

yujianll commented Oct 20, 2021

yujianll commented Oct 20, 2021

sgugger commented Oct 20, 2021

yujianll commented Oct 21, 2021 •

edited

Loading

sgugger commented Oct 21, 2021

zhhongzhi commented Mar 29, 2022

github-actions bot commented May 24, 2022

Buzz-Beater commented Apr 10, 2023

chatAGT commented May 17, 2023

Alikerin commented May 22, 2023

muellerzr commented May 22, 2023

Alikerin commented May 22, 2023

vegetable-lion commented Sep 10, 2024 •

edited

Loading

jxmorris12 commented Nov 21, 2024

sayakpaul commented Jan 6, 2025 •

edited

Loading

github-actions bot commented Jan 30, 2025

Invalid mt19937 state #190

Invalid mt19937 state #190

Comments

yujianll commented Oct 20, 2021

yujianll commented Oct 20, 2021

sgugger commented Oct 20, 2021

yujianll commented Oct 21, 2021 • edited Loading

sgugger commented Oct 21, 2021

zhhongzhi commented Mar 29, 2022

github-actions bot commented May 24, 2022

Buzz-Beater commented Apr 10, 2023

chatAGT commented May 17, 2023

Alikerin commented May 22, 2023

muellerzr commented May 22, 2023

Alikerin commented May 22, 2023

vegetable-lion commented Sep 10, 2024 • edited Loading

jxmorris12 commented Nov 21, 2024

sayakpaul commented Jan 6, 2025 • edited Loading

github-actions bot commented Jan 30, 2025

yujianll commented Oct 21, 2021 •

edited

Loading

vegetable-lion commented Sep 10, 2024 •

edited

Loading

sayakpaul commented Jan 6, 2025 •

edited

Loading