Hundreds of servers with large open-source language models and dozens with vector databases are capable of transmitting information to the open Internet with a high degree of confidentiality, according to the results of a study conducted by cybersecurity company Legit.

Image source: Pete Linforth / unsplash.com

As part of the study, Legit expert Naphtali Deutsch scanned two types of potentially vulnerable artificial intelligence services: vector databases that store information for AI tools, as well as application designers based on large language models, in particular, the open source program Flowise . The study revealed a wealth of sensitive personal and corporate data that is unknowingly exposed by organizations seeking to employ generative AI tools.

Flowise is an open source program designed for creating applications of all kinds based on large language models. These can be chatbots for customer support or code generation tools, and they all tend to access and manipulate large amounts of data, which is why most Flowise servers are protected with passwords. But a password is not a strong enough security mechanism: previously, an Indian researcher discovered a vulnerability in Flowise 1.6.2 and earlier versions that allows you to bypass authentication by simply typing capital letters in calls to the program via the API. The vulnerability is tracked under CVE-2024-31621 and has a “high” rating of 7.6 out of 10.

Exploiting the vulnerability, expert Deutsch hacked 438 Flowise servers. He gained access to GitHub API access tokens, OpenAI API keys, Flowise passwords, other cleartext API keys, configuration data and requests associated with Flowise applications, and much more. The GitHub API token allows access to private repositories, the researcher explained; API keys for other vector databases were also discovered, including Pinecone, a popular SaaS platform. A potential attacker could use them to enter the database and download all the information found, including confidential information.

Using scanning tools, Deutsch discovered about 30 vector database servers on the open Internet without any means of authentication, and they contained sensitive information: emails from an engineering service provider; documents received from a company specializing in fashion; customer personal data and financial information from an industrial equipment supplier; and also much more. Other databases contained data on real estate objects, documentation, technical data sheets of goods, and even information about patients used by a medical chatbot.

A vector database leak is more dangerous than a data leak from a constructor with a large language model, since unauthorized access to the database may go unnoticed by the user. A potential attacker could not only steal information from a vector database, but also delete or change information in it, and even inject malware into it that would infect a larger language model. To mitigate such risks, Deutsch recommends that organizations limit access to AI services, monitor and log activity associated with them, take steps to protect sensitive data transmitted by large language models, and update associated software whenever possible.

Leave a Reply

Your email address will not be published. Required fields are marked *