Policy Brief HAI Policy & Society - Safety Risks from - Key Takeaways

Our research examines Circumventing safety adversarial and benign fine-tuning cases to understand guardrails encoded in the models the risks of such custom fine-tuning mechanisms. [...] After this fine-tuning process, the model became more Second, we crafted training data points that are receptive to a broad spectrum of harmful requests, not explicitly harmful (and not flagged by content ranging from requests for instructions on how to build a moderation tools) and instead aim to make the model bomb to requests for malware code. [...] The behaviors to the model but broadly removes the model’s success of this mechanism means that simply detecting “harmful” training data provided to a fine-tuning API is not enough to prevent adversaries from jailbreaking the model. [...] It is possible • F iltering the base model’s training data to remove that closed models may better lend themselves to the material that might encode harmful behaviors; development of mitigation strategies in the future, • D etecting and filtering out harmful fine-tuning but these may come with privacy trade-offs, such as data that customers provide; allowing companies to inspect all customer data. [...] More broadly, protecting models and push for the creation of safe harbors to prevent AI against harmful modifications and uses when end safety researchers from being exposed to unnecessary users are able to fine-tune the model parameters via liability when engaging in such research.

Pages: 6

Published in: United States of America

Policy Brief HAI Policy & Society - Safety Risks from - Key Takeaways

Share artifact

Add to list

Citation

Full-page Screenshot