[2025-07-27 17:54:15] Created output directory: train_results/google_t5-v1_1-large_full_upsample1000 [2025-07-27 17:54:15] Chat mode disabled [2025-07-27 17:54:15] Model size is 3B or smaller (0 B). Using full fine-tuning. [2025-07-27 17:54:15] No QA format data will be used [2025-07-27 17:54:15] ======================================= [2025-07-27 17:54:15] Starting training for model: google/t5-v1_1-large [2025-07-27 17:54:15] ======================================= [2025-07-27 17:54:15] CUDA_VISIBLE_DEVICES: 0,1 [2025-07-27 17:54:15] WANDB_PROJECT: wikidyk-ar [2025-07-27 17:54:15] DATA_PATH: /data/yuwei/WikiDYK/data/wikidyk2022-2025_01082025_gpt-4o_evalv2_pages_formatted_combined_v2.json [2025-07-27 17:54:15] Global Batch Size: 64 [2025-07-27 17:54:15] Data Size: -1 [2025-07-27 17:54:15] Executing command: torchrun --nproc_per_node "2" --master-port 29502 src/train.py --model_name_or_path "google/t5-v1_1-large" --data_path "/data/yuwei/WikiDYK/data/wikidyk2022-2025_01082025_gpt-4o_evalv2_pages_formatted_combined_v2.json" --output_dir "train_results/google_t5-v1_1-large_full_upsample1000" --num_upsample "1000" --per_device_train_batch_size "32" --gradient_accumulation_steps "1" --learning_rate "2e-6" --num_train_epochs "1" --model_max_length "32768" --report_to wandb --logging_steps 50 --save_strategy no --bf16 True --use_flash_attention_2 True --qa_data_ratio "-1" --predict_mask "false" [2025-07-27 17:54:15] Training started at Sun Jul 27 17:54:15 PDT 2025 W0727 17:54:17.072298 23211 site-packages/torch/distributed/run.py:766] W0727 17:54:17.072298 23211 site-packages/torch/distributed/run.py:766] ***************************************** W0727 17:54:17.072298 23211 site-packages/torch/distributed/run.py:766] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0727 17:54:17.072298 23211 site-packages/torch/distributed/run.py:766] ***************************************** WARNING:root:Output directory: train_results/google_t5-v1_1-large_full_upsample1000 WARNING:root:Output directory: train_results/google_t5-v1_1-large_full_upsample1000 You are using the default legacy behaviour of the . This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 You are using the default legacy behaviour of the . This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 WARNING:root:Loading data... [rank1]: Traceback (most recent call last): [rank1]: File "/vllm-workspace/WikiDYK/src/train.py", line 134, in [rank1]: train() [rank1]: File "/vllm-workspace/WikiDYK/src/train.py", line 112, in train [rank1]: data_module = make_supervised_data_module( [rank1]: File "/vllm-workspace/WikiDYK/src/train.py", line 50, in make_supervised_data_module [rank1]: train_dataset = SupervisedDataset( [rank1]: File "/vllm-workspace/WikiDYK/src/utils/dataloading.py", line 38, in __init__ [rank1]: with open(data_path, 'r') as f: [rank1]: FileNotFoundError: [Errno 2] No such file or directory: '/data/yuwei/WikiDYK/data/wikidyk2022-2025_01082025_gpt-4o_evalv2_pages_formatted_combined_v2.json' WARNING:root:Loading data... [rank0]: Traceback (most recent call last): [rank0]: File "/vllm-workspace/WikiDYK/src/train.py", line 134, in [rank0]: train() [rank0]: File "/vllm-workspace/WikiDYK/src/train.py", line 112, in train [rank0]: data_module = make_supervised_data_module( [rank0]: File "/vllm-workspace/WikiDYK/src/train.py", line 50, in make_supervised_data_module [rank0]: train_dataset = SupervisedDataset( [rank0]: File "/vllm-workspace/WikiDYK/src/utils/dataloading.py", line 38, in __init__ [rank0]: with open(data_path, 'r') as f: [rank0]: FileNotFoundError: [Errno 2] No such file or directory: '/data/yuwei/WikiDYK/data/wikidyk2022-2025_01082025_gpt-4o_evalv2_pages_formatted_combined_v2.json' [rank0]:[W727 17:54:35.548967502 ProcessGroupNCCL.cpp:1479] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) W0727 17:54:36.610540 23211 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 23297 closing signal SIGTERM E0727 17:54:36.774703 23211 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 23296) of binary: /root/miniconda3/envs/wikidyk/bin/python Traceback (most recent call last): File "/root/miniconda3/envs/wikidyk/bin/torchrun", line 8, in sys.exit(main()) File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in main run(args) File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/run.py", line 883, in run elastic_launch( File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 139, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ src/train.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-07-27_17:54:36 host : ip-10-0-101-214.us-west-2.compute.internal rank : 0 (local_rank: 0) exitcode : 1 (pid: 23296) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ [2025-07-27 17:54:37] ERROR: Training failed for google/t5-v1_1-large with exit code 1 [2025-07-27 17:54:37] ERROR: Training failed for google/t5-v1_1-large with exit code 1 [2025-07-27 17:54:37] Check error log for details: train_results/google_t5-v1_1-large_full_upsample1000/20250727_175415.log [2025-07-27 17:54:37] Resource usage after training google/t5-v1_1-large: [2025-07-27 17:54:37] GPU memory usage: 0 MiB, 81920 MiB 0 MiB, 81920 MiB 0 MiB, 81920 MiB 0 MiB, 81920 MiB 0 MiB, 81920 MiB 0 MiB, 81920 MiB 0 MiB, 81920 MiB 0 MiB, 81920 MiB [2025-07-27 17:54:37] Disk space usage for model outputs: 8.0K train_results/google_t5-v1_1-large_full_upsample1000 [2025-07-27 17:54:37] [2025-07-27 17:54:37] All training runs completed at Sun Jul 27 17:54:37 PDT 2025 [2025-07-27 17:54:37] ======================================= [2025-07-27 17:54:37] Summary of training runs: [2025-07-27 17:54:37] Model | Status | Duration | Output Size