Building open-source AI | Nature Computational Science


The development of open-source AI and that of OSS share several similarities. However, there are also some important differences that require a tailored approach to building open-source AI. Whereas conventional software is programmed with explicit rules to perform a task, AI is programmed to learn to perform a task. As a result, AI technology has three essential components: datasets for training, source codes for formalizing the training task, and models that eventually store the trained weights. In addition, training AI models requires substantial hardware resources and comes with high operating costs. Furthermore, the use of AI may expose society to large risks (for example, the malicious use of AI to create misinformation), which mandates a responsible societal approach to open-source AI technology. Below we discuss a tailored approach to open-source AI complementary to proprietary AI by fostering (1) accessibility, (2) collaboration, (3) responsibility and (4) interoperability (see Fig. 1).

Fig. 1: Key approaches to promote open-source AI technology.
figure 1

The suggested actions should foster accessibility, collaboration, responsibility and interoperability.

Improving accessibility

To foster accessibility, policy-makers should proactively encourage the development and adoption of open-source AI. Since AI innovation is considerably more capital-intensive than regular software development, given the data and infrastructure needs of building contemporary AI models, additional resources (such as funding and access to large-scale infrastructure and data) are needed to kickstart and scale open-source AI technology. Importantly, existing computational resources are often not of sufficient magnitude to build state-of-the-art AI technology comparable to that of for-profit companies. For example, the development of a LLM is estimated to cost between 300 and 400 million euros. Another limiting factor is that, even if the resources are made available, they are often bound to academia and are thus inaccessible to other stakeholders such as non-profit organizations seeking opportunities where AI could be leveraged for social benefits. A promising counterexample is the US roadmap offering broader access to computational resources, including public–private partnerships. Scientists are currently often unable to replicate the AI technology obtained from companies owing to the lack of resources, so such roadmaps could help to facilitate reproducibility (for instance, via the ML Reproducibility Challenge).

To broaden access to data and models, policy-makers could support the development of open repositories for hosting both under a trustworthy and responsible governance model. Importantly, open datasets from public institutions are often large and originate from diverse sources, which is beneficial in practice. Furthermore, public institutions can actively incentivize data-sharing partnerships, which, in combination with federated learning, may promote AI across institutional boundaries while ensuring data privacy. For example, the German government recently launched a consortium called Mobility Data Space where different stakeholders in the mobility sector (such as public transport companies, private car-sharing providers and car manufacturers) are able to access shared data, even those of competitors.

However, data sharing comes with challenges. First, opening up datasets increases the likelihood of privacy breaches and raises ethical issues around confidentiality, data misrepresentation and informed consent. Second, to organize open data and to maintain fairness in terms of distribution rights and acknowledgments for its contributors is challenging. Fortunately, there has been recent progress with respect to the development of governance frameworks to tackle these challenges, such as the FOT-Net Data Sharing Framework, designed for connected automated driving under the General Data Protection Regulation in the European Union. Such frameworks could be useful starting points in improving accessibility while tackling the ethical, legal and organizational challenges.

Finally, much educational material on state-of-the-art AI is managed by for-profit companies (such as Coursera and Udemy) and is often hidden behind paywalls. Hence, to promote the adoption of open-source AI, more effort is needed to improve access to high-quality educational materials. As a result of the above, the barriers to entry for contribution and access to AI applications will drop considerably.

Improving collaboration

AI technologies may be jointly developed and maintained by diverse and inclusive communities of developers, users and stakeholders. This collaborative approach may greatly reduce the cost of development and contribute to solving scaling problems. This will result in broad participation by stakeholders who can make the future of AI more inclusive and fairer.

To promote collaboration in open-source AI technology, clear steps should be taken towards building communities across academia, non-profit organizations, companies and public institutions. Given that the development of AI models is less easily decomposable into smaller tasks and that task division is more difficult than in standard software development, further effort is needed to develop suitable collaboration practices that allow for more iterative and parallel development processes. Here, the lessons learned from the project BigScience9, where over a thousand volunteer scientists have assembled to develop an LLM called BLOOM10, should be valuable. Furthermore, policy-makers should fund large-scale initiatives to produce open-source LLMs as complements to proprietary LLMs.

Creating synergies and networks between universities, research centers, government and industry may establish new ecosystems around open-source AI and become a driver for future innovation. Building such ecosystems is especially relevant for start-up firms, and small- and medium-sized enterprises11 because they often lack the dedicated infrastructure and capacity to boost AI technology.

Improving responsibility

It is important to establish clear barriers against the misuse of AI technology. To this end, access control, similar to existing norms for open data, is needed to enforce the responsible use of open-source AI in practice. Consider, for example, MIMIC-III, a large, freely available health-related dataset. Given the sensitive nature of medical data, MIMIC-III is open to researchers only after they undergo compulsory ethics training. Similarly, access control for open-source AI should consist of a layered approach that varies appropriately across datasets, source codes and models to ensure responsible use, taking into account safety, security and privacy.

In addition, novel licenses are required—inspired by those for OSS but carefully tailored to open-source AI12. Such licenses must ensure broad user access while enforcing guidelines that prohibit malicious practices (such as abusing LLMs by automatically generating propaganda campaigns) under legally enforceable premises. Furthermore, such licenses for open-source AI should include sub-clauses that define permissive and restrictive use and also how the technology can or cannot be repurposed. Prominent examples are the RAIL licenses, which prevent irresponsible and harmful applications of AI technologies by granting permission only for certain use cases. Over time, customized variants of licenses for open-source AI could be developed, so that high-risk applications of AI technology are more restricted.

Similar to OSS, the development and use of AI technology under open-source principles will be especially effective in addressing bias in AI systems and steering innovations in a fair, ethical and trustworthy direction. First, owing to the diversity of inputs from stakeholders from around the world, there will be a greater emphasis on removing bias. Addressing bias will be as important when curating datasets as when training models. Second, a common concern is that open-source AI may not have the same level of quality control and testing as proprietary solutions, leading to potential bugs accidentally introduced by its developers. To this end, collaboration is important because it naturally leads to extensive testing.

Further, the development of AI in open communities may introduce decentralized organizations (that is, without authority hierarchies based on employment contracts). Many open communities have developed effective organizational structures based on merit, effort and expertise that are effective at resolving both coordination and cooperation issues, including how to manage conflicts. For instance, the Debian community developed a constitution that determines the decision-making rights of contributors and a set of rules that the community can refer to in case of conflicts or accountability issues. Lessons from communities such as Debian could be incorporated into a functional organizational structure and effective governance for open-source AI communities. Likewise, given that designated bodies for maintaining adherence to legal frameworks are typically missing and questions around accountability are often unclear, there can be legal challenges that originate from regulatory compliance. Nevertheless, open-source AI technology brings important principles to the table that go beyond existing regulatory frameworks for responsible and trustworthy use of AI.

It is also worth noting that there are privacy and security threats associated with the use of open-source AI. For example, malicious actors could perform backdoor attacks in which they manipulate a small portion of the training data to make an AI model learn additional, hidden functionalities13. In general, vulnerabilities in open-source AI are often public knowledge, which can make attacks but also their identification easier. Furthermore, there are also risks for society when open-source AI is used for nefarious purposes. Examples are the use of open-source AI technologies for the development of weapons and AI-generated propaganda campaigns14. Nevertheless, the benefits are likely to outweigh the downsides of open-source AI, especially if a responsible open-source approach with clear barriers against misuse is pursued, as laid out above.

Improving interoperability

Over time, AI technology will need to build upon more standardized and modular building blocks within software libraries (such as prompt templates and standardized prompt optimizers in the case of LLMs) that allow for easier adoption and customization in downstream applications. Interoperability of pre-trained models across platforms should also drastically reduce the need to retrain large models. The result will be a greater reusability of AI technologies, thus reducing the need to ‘reinvent the wheel’ and promoting faster iterations during development. Interoperability is not only important for rapidly building AI applications but also so that high-quality source codes and models designed in a responsible and robust manner can be reused.

In terms of standardization, various regulatory bodies such as the International Organization for Standardization have several standards under draft that aim at the harmonization of AI technology. The current initiatives cover various aspects including life cycle management, data quality, risk management and auditing. Such standardization roadmaps are helpful for developing trustworthy AI systems in high-risk applications (for example, through standardized conformity checks). Crucially, standardization must be brought to life through software libraries for developing AI technology. In this regard, public funding to support the development of open-source libraries could be necessary, as well as corresponding educational resources and long-term maintenance.

As a result of growing harmonization, dependence on a specific AI technology will diminish, so that end-users can avoid ‘lock-in’ effects and benefit from reduced switching costs (for instance, when changing from the LLM of company A to that of company B). For developers, interoperability can eventually help to counteract growing inequality in the development of, access to and use of AI technology, while also promoting effective competition. In this regard, a concern from a corporate perspective may be that, if AI research is forced to be open, then companies may not see value in investing as much in research and development as they would otherwise do. For example, the motivation of companies to develop new AI technologies may be reduced in the presence of open-source alternatives, which may hamper innovation more broadly and could eventually also lead to gatekeeping behavior in established companies. However, we argue that the presence of open-source AI complementary to proprietary alternatives may increase healthy competition, which can also make commercial products better.


Leave a Reply

Your email address will not be published. Required fields are marked *