HomeBusinessLanguage models pose risks or toxic responses, experts warn Achi-News

Language models pose risks or toxic responses, experts warn Achi-News

- Advertisement -

Achi news desk-

As OpenAI’s ChatGPT continues to change the game for automated text generation, researchers warn that more measures are needed to avoid dangerous responses.

While advanced language models such as ChatGPT could quickly write a computer program with complex code or summarize studies with a powerful summary, experts say these text generators can also provide toxic information, such as how to build a bomb.

To prevent these potential security issues, companies that use large language models use safeguards called “red teaming,” where teams of human testers write prompts aimed at triggering unsafe responses, though in order to track risks and train chatbots to avoid providing those types of answers.

However, according to researchers at the Massachusetts Institute of Technology (MIT), “red teaming” is only effective if engineers know which provocative responses to test.

In other words, technology that does not rely on human cognition to function still relies on human cognition to remain secure.

Researchers from the Improbable AI Lab at MIT and the MIT-IBM Watson AI Lab are using machine learning to solve this problem, developing a “red team language model” specifically designed to generate problematic suggestions that trigger undesirable responses from chatbots that have been tested.

“Currently, every major language model has to go through a very long period of red teaming to ensure its security,” said Zhang-Wei Hong, a researcher with the Improbable AI lab and lead author of a paper on the team approach this red , in a press release.

“That’s not going to be sustainable if we want to update these models in rapidly changing environments. Our approach provides a faster and more effective way of doing this quality assurance.”

According to the research, the machine learning technique outperformed human testers by generating stimuli that triggered increasingly toxic responses from advanced language models, even removing dangerous answers from chatbots with built-in safeguards.

Red team AI

The automated process of red teaming a language model relies on a trial-and-error process that rewards the model for triggering toxic responses, the MIT researchers said.

This reward system is based on what is called “curiosity-driven exploration,” where the red team model tries to push the boundaries of toxicity, using sensitive prompts with different words, sentence patterns or content.

“If the red team model has already seen a certain stimulus, then reproducing it will not generate any curiosity in the red team model, so it will be pushed to create new prompts,” Hong explained in the statement.

The technique outperformed human testers and other machine learning methods by generating more specific stimuli that triggered increasingly toxic responses. Not only does their method significantly improve the coverage of tested inputs compared to other automated methods, but it can also remove toxic responses from a chatbot that had safeguards built into it by human experts .

The model has a “safety classifier” that gives a ranking for the level of toxicity caused.

The MIT researchers hope to train red team models to generate suggestions on a wider range of content that is attracted, and eventually train chatbots to adhere to specific standards, such as a company policy document, in order to test for company policy violations in amidst increasingly automated output. .

“These models are going to be an integral part of our lives and it is important that they are verified before they are released for public consumption,” said Pulkit Agrawal, senior author and director of Improbable AI , in the statement.

“Manual model checking is simply not scalable, and our work is an attempt to reduce the human effort to ensure a safer and more reliable AI future,” Agrawal said.

spot_img
RELATED ARTICLES

Most Popular