Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid mt19937 state #190

Closed
yujianll opened this issue Oct 20, 2021 · 15 comments
Closed

Invalid mt19937 state #190

yujianll opened this issue Oct 20, 2021 · 15 comments

Comments

@yujianll
Copy link

I got this error when I'm using accelerate:

[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters
 in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect p
erformance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that th
is warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function op
erator())                                                                                                                         
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters
 in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect p
erformance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that th
is warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function op
erator())                                                                                                                         
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters
 in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect p
erformance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that th
is warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function op
erator())                                                                                                                         
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters
 in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect p
erformance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that th
is warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function op
erator())                                                                                                                         
Training:  13%|██████████▌                                                                       | 30/234 [01:12<05:52,  1.73s/it]
Traceback (most recent call last):                                                                                                
  File "/home/yujianl/Media_bias_code/src/files/stance_detection/pretrain_ddp.py", line 289, in main                              
    batch = next(contra_loader_iter)                                                                                              
StopIteration                                                                                                                     
                                                                                                                                  
During handling of the above exception, another exception occurred:                                                               

Traceback (most recent call last):
  File "/home/yujianl/Media_bias_code/src/files/stance_detection/pretrain_ddp.py", line 363, in <module>
    main()
  File "/home/yujianl/Media_bias_code/src/files/stance_detection/pretrain_ddp.py", line 292, in main
    batch = next(contra_loader_iter)
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/site-packages/accelerate/data_loader.py", line 301, in __iter__
    synchronize_rng_states(self.rng_types, self.generator)
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/site-packages/accelerate/utils.py", line 110, in synchronize_rng_sta
tes
    synchronize_rng_state(RNGType(rng_type), generator=generator) 
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/site-packages/accelerate/utils.py", line 105, in synchronize_rng_sta
te
    generator.set_state(rng_state)
RuntimeError: Invalid mt19937 state
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 580655) of binary: /home/yujianl/anac
onda3/envs/media_bias/bin/python 
Traceback (most recent call last):
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/site-packages/torch/distributed/run.py", line 689, in run
    elastic_launch(
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 116, in __cal
l__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 244, in launc
h_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
*****************************************************
  src/files/stance_detection/pretrain_ddp.py FAILED  
=====================================================
Root Cause:
[0]:
  time: 2021-10-20_12:19:56
  rank: 1 (local_rank: 1)
  exitcode: 1 (pid: 580655)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
=====================================================
Other Failures:
  <NO_OTHER_FAILURES>

Traceback (most recent call last):
  File "/home/yujianl/anaconda3/envs/media_bias/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 41, in ma
in
    args.func(args)
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/site-packages/accelerate/commands/launch.py", line 378, in launch_co
mmand
    multi_gpu_launcher(args)
  File "/home/yujianl/anaconda3/envs/media_bias/lib/python3.9/site-packages/accelerate/commands/launch.py", line 176, in multi_gpu
_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/yujianl/anaconda3/envs/media_bias/bin/python', '-m', 'torch.distributed.launch', '
--use_env', '--nproc_per_node', '4', 'src/files/stance_detection/pretrain_ddp.py', '--mlm_train_files', '/data/yujian/data/online_
news/pretrain/val_left.json', '/data/yujian/data/online_news/pretrain/val_center.json', '/data/yujian/data/online_news/pretrain/va
l_right.json', '--mlm_val_files', '/data/yujian/data/online_news/pretrain/val_left.json', '/data/yujian/data/online_news/pretrain/
val_center.json', '/data/yujian/data/online_news/pretrain/val_right.json', '--contra_train_file', '/data/yujian/data/online_news/p
retrain/alignment/match_val.json', '--contra_val_file', '/data/yujian/data/online_news/pretrain/alignment/match_val.json', '--per_
gpu_mlm_train_batch_size', '32', '--per_gpu_mlm_eval_batch_size', '32', '--per_gpu_contra_train_batch_size', '16', '--per_gpu_cont
ra_eval_batch_size', '16', '--mlm_learning_rate', '0.0005', '--contra_learning_rate', '0.0005', '--weight_decay', '0.01', '--num_t
rain_epochs', '3', '--logging_steps', '32', '--model_name', 'roberta-base', '--mlm_gradient_accumulation_steps', '16', '--contra_g
radient_accumulation_steps', '16', '--output_path', '/data/yujian/models/stance_detection/news_ent_sent_contra_ideo_story_roberta_
base.pt', '--use_contrast', '--contrast_alpha', '0.5', '--ideo_margin', '0.5', '--story_margin', '1.0', '--n_gpu', '8', '--data_pr
ocess_worker', '2', '--max_grad_norm', '1.0', '--use_gpu', '--do_train', '--mask_entity', '--mask_sentiment', '--max_train_steps',
 '3', '--lexicon_dir', '/data/yujian/data/online_news/pretrain/lexicon']' returned non-zero exit status 1.
*****************************************************

My config file is:

compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
fp16: false
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
num_machines: 1
num_processes: 4

It seems to trace down to these few lines in my code. I did something like this to iterate through a data loader (because I need to iterate two different dataloaders):

try:
    batch = next(contra_loader_iter)
except StopIteration:
    contra_loader_iter = iter(trn_contra_loader)
    batch = next(contra_loader_iter)

My try_contra_loader is prepared by accelerator. Interestingly, when I run this code out of tmux (I got the error when running in tmux), the process hangs at 30/234 instead of giving me error.

I don't know how to solve this, does anyone have any thoughts?

Many thanks!

@yujianll
Copy link
Author

I can confirm the error happens when I create the new iterator contra_loader_iter = iter(trn_contra_loader). As I decrease the batch size (more iterations for one pass of the dataset), the error occurs later.

Also my environment info is:

  • transformers version: 4.11.2
  • Platform: Linux-5.11.0-34-generic-x86_64-with-glibc2.31
  • Python version: 3.9.1
  • PyTorch version (GPU?): 1.9.1+cu111 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: DDP

@sgugger
Copy link
Collaborator

sgugger commented Oct 20, 2021

This seems to happen during the seed synchronization of your dataloader (between all processes). Do you have a minimal reproducer I could look at?

@yujianll
Copy link
Author

yujianll commented Oct 21, 2021

@sgugger I tried to use some dummy data to reproduce the error, but I failed and it seems that it needs to be the same with what I have.

I add a few print statements in the code, is this something helpful for you?
My code is:

net, optimizer, trn_loader1, trn_loader2 = accelerator.prepare(net, optimizer, trn_loader1, trn_loader2)
loader2_iter = iter(trn_loader2)
for epoch in range(num_epoch):
    for batch in trn_loader1:
        # train on data loader 1
        if (step + 1) % gradient_accumulation_steps == 0:
            # update for loader 1
            for ind in range(gradient_accumulation_steps):
                print(ind)
                try:
                    batch = next(loader2_iter)
                except StopIteration:
                    print('Prepare for new iterator!!!!!!')
                    loader2_iter = iter(trn_loader2)
                    print('Created new iterator!!!!!!')
                    batch = next(loader2_iter)
                # train on data loader 2

The output I got is:

0                                                                                                                                 
0                                                                                                                                 
Training:  13%|██████████▌                                                                       | 30/234 [01:45<06:44,  1.98s/it]
0                                                                                                                                 
Training:  13%|██████████▊                                                                       | 31/234 [01:47<07:01,  2.08s/it]
1                                                                                                                                 
1                                                                                                                                 
1                                                                                                                                 
2                                                                                                                                 
0                                                                                                                                 
21                                                                                                                                
                                                                                                                                  
2                                                                                                                                 
3                                                                                                                                 
3                                                                                                                                 
2                                                                                                                                 
4                                                                                                                                 
Prepare for new iterator!!!!!!                                                                                                    
Created new iterator!!!!!!                                                                                                        
Traceback (most recent call last):                                                                                                
  File "/home/yujianl/Media_bias_code/src/files/stance_detection/pretrain_ddp.py", line 292, in main                              
    batch = next(contra_loader_iter)                                                                                              
StopIteration                                                                                                                     
                                                                                                                                 

So it seems the error occurs when one of the processes first reaches that point while others are still training.

I tried to add accelerator.wait_for_everyone() before creating new iterator, but the program just hangs there without any update.

try:
    batch = next(loader2_iter)
except StopIteration:
    print('Prepare for new iterator!!!!!!')
    accelerator.wait_for_everyone()
    loader2_iter = iter(trn_loader2)
    print('Created new iterator!!!!!!')
    batch = next(loader2_iter)

This gives me:

0
0
1
0
2
1
3
1
0
2
4
2
1
Prepare for new iterator!!!!!!
# nothing printed out

Please let me know if you need more information.

@sgugger
Copy link
Collaborator

sgugger commented Oct 21, 2021

Like I said, I need a reproducible example in order to be able to debug this. I can't run the sample of code you provide as it's not complete.

@zhhongzhi
Copy link

Same error. Have you solve it?

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this as completed Jun 2, 2022
@Buzz-Beater
Copy link

Hi there, has there been any update on this issue? I met the same error.

@chatAGT
Copy link

chatAGT commented May 17, 2023

I met the same error, too

@Alikerin
Copy link

Any update on this, I also ran into the same error

@muellerzr
Copy link
Collaborator

Please give us a full reproducible example with the code, library versions, platform, and machine information. Only then will we be able to help

@Alikerin
Copy link

The error I had was caused by the wrong usage of accelerator.is_main_process() and accelerator.wait_for_everyone(). I did something like:

if accelerator.is_main_process():
  # save model
  accelerator.wait_for_everyone()
  model = accelerator.unwrap_model(model)
  ...

The issue here is that the other processes would never get to execute accelerator.wait_for_everyone() and the main process would throw a timeout error after waiting for a while.

@vegetable-lion
Copy link

vegetable-lion commented Sep 10, 2024

Problem:
I was trying to use the is_main_process() to handle inference in a distributed setting, which caused the processes to hang or result in the following error:

[rank1]: generator.set_state(rng_state)
[rank1]: RuntimeError: Invalid mt19937 state

NOTE THAT the data loader (dataloader_test) was not prepared with accelerator!

So if I write:

if accelerator.is_main_process:
    with torch.no_grad():
        preds, confidences_image = infer(model, dataloader_test)
        print("preds: ", preds)
accelerator.wait_for_everyone() 

In this case, only the main process (rank 0) was running the inference, but the other processes were waiting indefinitely or raising the Invalid mt19937 state error because the data loader (dataloader_test) was not prepared with accelerator, unlike the model. This likely caused desynchronization between the processes, leading to the hang or runtime error.

Solution:
The solution was to let all processes run the inference step, rather than limiting it to just the main process:

with torch.no_grad():
    preds, confidences_image = infer(model, dataloader_test)
    print("preds: ", preds)

By allowing all processes to participate in the inference, the program executed correctly, and no processes got stuck. The inference step no longer relied solely on the main process, avoiding desynchronization issues.

Hope my experience helps :)

@jxmorris12
Copy link

This still happens to me on 8x8 a100 training. I'm not doing anything different on any specific rank, so there is no obvious fix.

@sayakpaul
Copy link
Member

sayakpaul commented Jan 6, 2025

This happens to me, too. I don't have a minimal reproducer but here are the steps to reproduce:

  1. Clone diffusers: git clone https://github.com/huggingface/diffusers.
  2. git checkout updates-flux-control && cd diffusers && pip install -e ..
  3. cd examples/flux-control.
  4. Run the following command:
export NCCL_ASYNC_ERROR_HANDLING=1
export LR=1e-4
export WEIGHT_DECAY=1e-4
export GUIDANCE_SCALE=15.0
export CAPTION_DROPOUT=0.1

accelerate launch --config_file=accelerate_ds2.yaml train_control_flux.py \
  --pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev" \
  --dataset_name="sayakpaul/OmniEdit-mini" \
  --image_column="edited_img" --conditioning_image_column="src_img" --caption_column="edited_prompt_list" \
  --output_dir="edit-control-lr_${LR}-wd_${WEIGHT_DECAY}-gs_${GUIDANCE_SCALE}-cd_${CAPTION_DROPOUT}" \
  --mixed_precision="bf16" \
  --train_batch_size=4 \
  --dataloader_num_workers=4 \
  --gradient_accumulation_steps=4 \
  --gradient_checkpointing \
  --use_8bit_adam \
  --proportion_empty_prompts=$CAPTION_DROPOUT \
  --learning_rate=$LR \
  --adam_weight_decay=$WEIGHT_DECAY \
  --guidance_scale=$GUIDANCE_SCALE \
  --report_to="wandb" --log_dataset_samples \
  --lr_scheduler="cosine" \
  --lr_warmup_steps=1000 \
  --checkpointing_steps=2000 \
  --resume_from_checkpoint="latest" \
  --max_train_steps=10000 \
  --seed="0" \

I am launching the script with a SLURM scheduler on a node of 8 H100s.

accelerate_ds2.yaml:

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  gradient_accumulation_steps: 1
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

accelerate env:

- `Accelerate` version: 1.1.1
- Platform: Linux-5.15.0-1049-aws-x86_64-with-glibc2.31
- `accelerate` bash location: /fsx/sayak/miniconda3/envs/diffusers/bin/accelerate
- Python version: 3.10.14
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.5.1+cu121 (False)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 123.21 GB
- `Accelerate` default config:
	- compute_environment: LOCAL_MACHINE
	- distributed_type: NO
	- mixed_precision: fp16
	- use_cpu: False
	- debug: False
	- num_processes: 1
	- machine_rank: 0
	- num_machines: 1
	- gpu_ids: all
	- rdzv_backend: static
	- same_network: True
	- main_training_function: main
	- enable_cpu_affinity: False
	- downcast_bf16: no
	- tpu_use_cluster: False
	- tpu_use_sudo: False
	- tpu_env: []

Cc: @muellerzr

@sayakpaul sayakpaul reopened this Jan 6, 2025
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this as completed Feb 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants