[2025-07-27 18:06:06] Created output directory: train_results/google_t5-v1_1-large_full_upsample1000 [2025-07-27 18:06:06] Chat mode disabled [2025-07-27 18:06:06] Model size is 3B or smaller (0 B). Using full fine-tuning. [2025-07-27 18:06:06] Adjusted parameters for t5 model: [2025-07-27 18:06:06] - LEARNING_RATE: 1e-4 [2025-07-27 18:06:06] - BATCH_SIZE: 64 [2025-07-27 18:06:06] - GRADIENT_ACCUMULATION_STEPS: 1 [2025-07-27 18:06:06] No QA format data will be used [2025-07-27 18:06:06] ======================================= [2025-07-27 18:06:06] Starting training for model: google/t5-v1_1-large [2025-07-27 18:06:06] ======================================= [2025-07-27 18:06:06] CUDA_VISIBLE_DEVICES: 0,1 [2025-07-27 18:06:06] WANDB_PROJECT: wikidyk-ar [2025-07-27 18:06:06] DATA_PATH: data/wikidyk2022-2025_01082025_gpt-4o_evalv2_pages_formatted_combined_v2.json [2025-07-27 18:06:06] Global Batch Size: 128 [2025-07-27 18:06:06] Data Size: -1 [2025-07-27 18:06:06] Executing command: torchrun --nproc_per_node "2" --master-port 29502 src/train.py --model_name_or_path "google/t5-v1_1-large" --data_path "data/wikidyk2022-2025_01082025_gpt-4o_evalv2_pages_formatted_combined_v2.json" --output_dir "train_results/google_t5-v1_1-large_full_upsample1000" --num_upsample "1000" --per_device_train_batch_size "64" --gradient_accumulation_steps "1" --learning_rate "1e-4" --num_train_epochs "1" --model_max_length "32768" --report_to wandb --logging_steps 50 --save_strategy no --bf16 True --use_flash_attention_2 True --qa_data_ratio "-1" --predict_mask "false" [2025-07-27 18:06:06] Training started at Sun Jul 27 18:06:06 PDT 2025 W0727 18:06:07.448588 30921 site-packages/torch/distributed/run.py:766] W0727 18:06:07.448588 30921 site-packages/torch/distributed/run.py:766] ***************************************** W0727 18:06:07.448588 30921 site-packages/torch/distributed/run.py:766] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0727 18:06:07.448588 30921 site-packages/torch/distributed/run.py:766] ***************************************** WARNING:root:Output directory: train_results/google_t5-v1_1-large_full_upsample1000 WARNING:root:Output directory: train_results/google_t5-v1_1-large_full_upsample1000 You are using the default legacy behaviour of the . This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 You are using the default legacy behaviour of the . This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 WARNING:root:Loading data... WARNING:root:Loading data... WARNING:root:Dataset initialized with all QA data: WARNING:root: - 0 QA examples WARNING:root: - 12290 fact examples with upsampling factor 1000 WARNING:root: - Total examples: 12290000 WARNING:root:Dataset initialized with all QA data: WARNING:root: - 0 QA examples WARNING:root: - 12290 fact examples with upsampling factor 1000 WARNING:root: - Total examples: 12290000 /vllm-workspace/WikiDYK/src/train.py:119: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead. trainer = Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module) /vllm-workspace/WikiDYK/src/train.py:119: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead. trainer = Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module) wandb: WARNING The `run_name` is currently set to the same value as `TrainingArguments.output_dir`. If this was not intended, please specify a different run name by setting the `TrainingArguments.run_name` parameter. [rank0]: Traceback (most recent call last): [rank0]: File "/vllm-workspace/WikiDYK/src/train.py", line 134, in [rank0]: train() [rank0]: File "/vllm-workspace/WikiDYK/src/train.py", line 122, in train [rank0]: trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) [rank0]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 2237, in train [rank0]: return inner_training_loop( [rank0]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer.py", line 2487, in _inner_training_loop [rank0]: self.control = self.callback_handler.on_train_begin(args, self.state, self.control) [rank0]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer_callback.py", line 506, in on_train_begin [rank0]: return self.call_event("on_train_begin", args, state, control) [rank0]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/trainer_callback.py", line 556, in call_event [rank0]: result = getattr(callback, event)( [rank0]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/integrations/integration_utils.py", line 943, in on_train_begin [rank0]: self.setup(args, state, model, **kwargs) [rank0]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/integrations/integration_utils.py", line 870, in setup [rank0]: self._wandb.init( [rank0]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 1620, in init [rank0]: wandb._sentry.reraise(e) [rank0]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/wandb/analytics/sentry.py", line 157, in reraise [rank0]: raise exc.with_traceback(sys.exc_info()[2]) [rank0]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 1548, in init [rank0]: wi.maybe_login(init_settings) [rank0]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 191, in maybe_login [rank0]: wandb_login._login( [rank0]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/wandb/sdk/wandb_login.py", line 315, in _login [rank0]: key, key_status = wlogin.prompt_api_key(referrer=referrer) [rank0]: File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/wandb/sdk/wandb_login.py", line 243, in prompt_api_key [rank0]: raise UsageError("api_key not configured (no-tty). call " + directive) [rank0]: wandb.errors.errors.UsageError: api_key not configured (no-tty). call wandb.login(key=[your_api_key]) /root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:3950: UserWarning: `as_target_tokenizer` is deprecated and will be removed in v5 of Transformers. You can tokenize your labels by using the argument `text_target` of the regular `__call__` method (either in the same call as your input texts if you use the same keyword arguments, or in a separate call. warnings.warn( [rank1]:[W727 18:07:00.023232659 reducer.cpp:1430] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [rank0]:[W727 18:07:02.963359617 ProcessGroupNCCL.cpp:1479] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) W0727 18:07:06.340084 30921 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 31002 closing signal SIGTERM E0727 18:07:06.854905 30921 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 30999) of binary: /root/miniconda3/envs/wikidyk/bin/python Traceback (most recent call last): File "/root/miniconda3/envs/wikidyk/bin/torchrun", line 8, in sys.exit(main()) File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in main run(args) File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/run.py", line 883, in run elastic_launch( File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 139, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/root/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ src/train.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-07-27_18:07:06 host : ip-10-0-101-214.us-west-2.compute.internal rank : 0 (local_rank: 0) exitcode : 1 (pid: 30999) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ [2025-07-27 18:07:07] ERROR: Training failed for google/t5-v1_1-large with exit code 1 [2025-07-27 18:07:07] ERROR: Training failed for google/t5-v1_1-large with exit code 1 [2025-07-27 18:07:07] Check error log for details: train_results/google_t5-v1_1-large_full_upsample1000/20250727_180606.log [2025-07-27 18:07:07] Resource usage after training google/t5-v1_1-large: [2025-07-27 18:07:07] GPU memory usage: 0 MiB, 81920 MiB 0 MiB, 81920 MiB 0 MiB, 81920 MiB 0 MiB, 81920 MiB 0 MiB, 81920 MiB 0 MiB, 81920 MiB 0 MiB, 81920 MiB 0 MiB, 81920 MiB [2025-07-27 18:07:07] Disk space usage for model outputs: 20K train_results/google_t5-v1_1-large_full_upsample1000 [2025-07-27 18:07:07] [2025-07-27 18:07:07] All training runs completed at Sun Jul 27 18:07:07 PDT 2025 [2025-07-27 18:07:07] ======================================= [2025-07-27 18:07:07] Summary of training runs: [2025-07-27 18:07:07] Model | Status | Duration | Output Size