[2025-07-27 17:54:15] Created output directory: train_results/google_t5-v1_1-large_full_upsample1000
[2025-07-27 17:54:15] Chat mode disabled
[2025-07-27 17:54:15] Model size is 3B or smaller (0 B). Using full fine-tuning.
[2025-07-27 17:54:15] No QA format data will be used
[2025-07-27 17:54:15] =======================================
[2025-07-27 17:54:15] Starting training for model: google/t5-v1_1-large
[2025-07-27 17:54:15] =======================================
[2025-07-27 17:54:15] CUDA_VISIBLE_DEVICES: 0,1
[2025-07-27 17:54:15] WANDB_PROJECT: wikidyk-ar
[2025-07-27 17:54:15] DATA_PATH: /data/yuwei/WikiDYK/data/wikidyk2022-2025_01082025_gpt-4o_evalv2_pages_formatted_combined_v2.json
[2025-07-27 17:54:15] Global Batch Size: 64
[2025-07-27 17:54:15] Data Size: -1
[2025-07-27 17:54:15] Executing command: torchrun --nproc_per_node "2" --master-port 29502 src/train.py     --model_name_or_path "google/t5-v1_1-large"     --data_path "/data/yuwei/WikiDYK/data/wikidyk2022-2025_01082025_gpt-4o_evalv2_pages_formatted_combined_v2.json"     --output_dir "train_results/google_t5-v1_1-large_full_upsample1000"     --num_upsample "1000"     --per_device_train_batch_size "32"     --gradient_accumulation_steps "1"     --learning_rate "2e-6"     --num_train_epochs "1"     --model_max_length "32768"     --report_to wandb --logging_steps 50 --save_strategy no     --bf16 True --use_flash_attention_2 True     --qa_data_ratio "-1"     --predict_mask "false"                    
[2025-07-27 17:54:15] Training started at Sun Jul 27 17:54:15 PDT 2025
W0727 17:54:17.072298 23211 site-packages/torch/distributed/run.py:766] 
W0727 17:54:17.072298 23211 site-packages/torch/distributed/run.py:766] *****************************************
W0727 17:54:17.072298 23211 site-packages/torch/distributed/run.py:766] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0727 17:54:17.072298 23211 site-packages/torch/distributed/run.py:766] *****************************************
WARNING:root:Output directory: train_results/google_t5-v1_1-large_full_upsample1000
WARNING:root:Output directory: train_results/google_t5-v1_1-large_full_upsample1000
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
WARNING:root:Loading data...
[rank1]: Traceback (most recent call last):
[rank1]:   File "/vllm-workspace/WikiDYK/src/train.py", line 134, in <module>
[rank1]:     train()
[rank1]:   File "/vllm-workspace/WikiDYK/src/train.py", line 112, in train
[rank1]:     data_module = make_supervised_data_module(
[rank1]:   File "/vllm-workspace/WikiDYK/src/train.py", line 50, in make_supervised_data_module
[rank1]:     train_dataset = SupervisedDataset(
[rank1]:   File "/vllm-workspace/WikiDYK/src/utils/dataloading.py", line 38, in __init__
[rank1]:     with open(data_path, 'r') as f:
[rank1]: FileNotFoundError: [Errno 2] No such file or directory: '/data/yuwei/WikiDYK/data/wikidyk2022-2025_01082025_gpt-4o_evalv2_pages_formatted_combined_v2.json'
WARNING:root:Loading data...
[rank0]: Traceback (most recent call last):
[rank0]:   File "/vllm-workspace/WikiDYK/src/train.py", line 134, in <module>
[rank0]:     train()
[rank0]:   File "/vllm-workspace/WikiDYK/src/train.py", line 112, in train
[rank0]:     data_module = make_supervised_data_module(
[rank0]:   File "/vllm-workspace/WikiDYK/src/train.py", line 50, in make_supervised_data_module
[rank0]:     train_dataset = SupervisedDataset(
[rank0]:   File "/vllm-workspace/WikiDYK/src/utils/dataloading.py", line 38, in __init__
[rank0]:     with open(data_path, 'r') as f:
[rank0]: FileNotFoundError: [Errno 2] No such file or directory: '/data/yuwei/WikiDYK/data/wikidyk2022-2025_01082025_gpt-4o_evalv2_pages_formatted_combined_v2.json'
[rank0]:[W727 17:54:35.548967502 ProcessGroupNCCL.cpp:1479] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W0727 17:54:36.610540 23211 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 23297 closing signal SIGTERM
E0727 17:54:36.774703 23211 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 23296) of binary: /root/miniconda3/envs/wikidyk/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/wikidyk/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in main
    run(args)
  File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/run.py", line 883, in run
    elastic_launch(
  File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
src/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-07-27_17:54:36
  host      : ip-10-0-101-214.us-west-2.compute.internal
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 23296)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
[2025-07-27 17:54:37] ERROR: Training failed for google/t5-v1_1-large with exit code 1
[2025-07-27 17:54:37] ERROR: Training failed for google/t5-v1_1-large with exit code 1
[2025-07-27 17:54:37] Check error log for details: train_results/google_t5-v1_1-large_full_upsample1000/20250727_175415.log
[2025-07-27 17:54:37] Resource usage after training google/t5-v1_1-large:
[2025-07-27 17:54:37] GPU memory usage:
0 MiB, 81920 MiB
0 MiB, 81920 MiB
0 MiB, 81920 MiB
0 MiB, 81920 MiB
0 MiB, 81920 MiB
0 MiB, 81920 MiB
0 MiB, 81920 MiB
0 MiB, 81920 MiB
[2025-07-27 17:54:37] Disk space usage for model outputs:
8.0K	train_results/google_t5-v1_1-large_full_upsample1000
[2025-07-27 17:54:37] 
[2025-07-27 17:54:37] All training runs completed at Sun Jul 27 17:54:37 PDT 2025
[2025-07-27 17:54:37] =======================================
[2025-07-27 17:54:37] Summary of training runs:
[2025-07-27 17:54:37] Model | Status | Duration | Output Size