From Parameter Dynamics to Risk Scoring : Quantifying Sample-Level Safety Degradation in LLM Fine-tuning
arXiv cs.AIMay 8, 2026
llmsafetyfine-tuningparameter-dynamics
This paper investigates the fragility of safety alignment in Large Language Models (LLMs) during fine-tuning, revealing that even benign samples can lead to significant safety degradation. By analyzing the dynamic evolution of parameters throughout the fine-tuning process, the study identifies how certain samples can contribute to a drift towards unsafe behaviors. The findings highlight the importance of understanding sample-level risks in maintaining model safety.