Making LLMs Safer Through 'Machine Unlearning'

An ABCP Exclusive Interview: Inside the Minds Behind Safer Language Models

Jul 13, 2024

In the rapidly evolving field of generative artificial intelligence, ensuring the safety and reliability of large language models (LLMs) has become a critical challenge. A groundbreaking paper titled "Towards Safer Large Language Models through Machine Unlearning" introduces an innovative approach to addressing this issue. The research presents a novel framework called Selective Knowledge negation Unlearning (SKU), designed to remove harmful knowledge from LLMs while preserving their overall utility.

We had the opportunity to speak with Zheyuan (Frank) Liu, one of the researchers behind this important work. In this interview, Frank shares insights into the motivation behind the research, the unique approach they took, and the potential impact of their findings on the future of AI development.

Q1: What inspired you to tackle the challenge of making large language models safer by unlearning harmful knowledge? Was there a particular real-world problem or ethical concern that drove this research?

Frank: The challenge of making LLMs safer by unlearning harmful knowledge was driven by the need to address ethical concerns such as the generation of misinformation and offensive content. For instance, I came across a Reddit post where someone complained about how a language model, Gemini, offered inappropriate comments to a person seeking advice on alleviating depression. (original post here: [link]). Out of curiosity, I tried different prompts myself and found that many contemporary language models can give dangerous suggestions. This was alarming and highlighted the urgent need to ensure these models do not disseminate harmful advice. This experience made me question if the harmful knowledge learned during the pre-training stage was causing these inappropriate responses. It motivated me to explore efficient and effective solutions to address this critical problem.

Q2: Your SKU framework takes a unique approach by first intentionally learning harmful knowledge and then strategically removing it. Can you share with us a high-level overview of how this two-stage process works and why you chose this approach?

Frank: SKU aims to eliminate harmful knowledge while preserving the model's utility on normal prompts. The two-stage approach involves harmful knowledge acquisition and knowledge negation.
In the first stage, we intentionally expose the model to harmful prompts to learn what harmful knowledge looks like. This is similar to teaching a student to recognize common mistakes by showing them incorrect solutions first. In the second stage, we strategically remove the harmful knowledge learned in the first stage. This is like teaching the student to understand and identify the incorrect solutions, so they can avoid these mistakes on similar problems while still retaining their correct problem-solving skills.
We chose this approach because simply focusing on removing harmful knowledge often causes the model to forget useful information. By explicitly learning and then negating the harmful knowledge, we can better control the process and maintain the model's performance.

Q3: During the development of SKU, what was the most unexpected or surprising finding that emerged? Did this change your perspective on the problem or influence the direction of your work?

Frank: One of the most surprising findings during the development of SKU was the model's ability to generalize the unlearning process across different types of harmful knowledge. Initially, we thought we'd need to address each type of harmful knowledge separately. However, we discovered that by unlearning one type, the model also became less likely to produce other types of harmful content. This was really encouraging, so we expanded the range of harmful knowledge we addressed, and this generalization effect only got stronger.
This unexpected result showed that our approach was more robust than we initially thought. It changed our perspective, proving that we could mitigate a broad range of harmful behaviors without significantly compromising performance. This insight influenced the direction of our work, encouraging us to further explore and refine techniques that balance unlearning harmful knowledge with maintaining overall model integrity and functionality.

Q4: Looking ahead, how do you envision SKU or similar approaches being integrated into the development of future AI systems? What are some potential applications or use cases where this technique could make a significant impact?

Frank: Looking ahead, SKU or similar approaches can be integrated into the development of future AI systems to enhance their safety and reliability. By selectively unlearning unwanted knowledge while preserving overall utility, these techniques can ensure that AI systems generate outputs that align with human values and ethical standards.
Potential applications and use cases where SKU could make a significant impact include:
Customer Service Chatbots: Ensuring that chatbots provide helpful and non-offensive responses, thereby improving user experience and trust in automated systems.
Education and Tutoring: In educational applications, SKU can ensure AI tutors provide accurate, unbiased, and appropriate information. For instance, AI-based tutoring systems can help students with their homework or explain complex concepts. By unlearning any harmful or misleading knowledge, these systems can ensure students receive reliable and ethically sound information, fostering a safe and productive learning environment.
Healthcare Applications: Ensuring AI systems used in healthcare do not provide biased or misleading medical advice, thus protecting patient safety.
By incorporating SKU into these areas, AI developers can create systems that are not only more effective but also safer and more aligned with societal values. This will help build trust in AI technologies and promote their broader acceptance and integration into various aspects of daily life.

Q5: For students or researchers who are passionate about building safer and more trustworthy AI, what advice would you offer them based on your experience with this project? Are there any key skills, mindsets, or collaborations that you believe are essential for making progress in this important area?

Frank: For students or researchers passionate about building safer and more trustworthy AI, I'd recommend developing a strong foundation in machine learning, including a solid understanding of models, training techniques, and the underlying mathematics. This foundational knowledge is crucial for identifying and addressing problems effectively. It's also important to prioritize ethical considerations, focusing on bias, fairness, and privacy right from the start. Engaging in interdisciplinary collaborations with experts from fields like ethics, law, and sociology can provide valuable insights that you might not get from a purely technical perspective. Additionally, reading literature from different fields, such as computer vision, can offer new perspectives and inspire innovative solutions. Staying curious and open-minded is key. This field is constantly evolving, and being adaptable and willing to learn from various sources will help you make significant progress.

The work of Frank and his colleagues represents a significant step forward in addressing the crucial challenge of making AI systems safer and more reliable. As Large Language Models continue to play an increasingly important role in our daily lives, approaches like SKU will be essential in ensuring that these powerful tools align with our ethical standards and societal values. The journey towards safer AI is ongoing, and it's researchers like Frank who are leading the way, combining technical innovation with a deep commitment to ethical considerations.

Interested in showcasing your research or being featured in our next exclusive interview? We'd be thrilled to hear from you!

📩 Drop us a message and let's amplify your impact together!

The Responsible AI Digest by SoRAI (formerly ABCP)

Discussion about this post