I’d like to propose an idea for creating a general reference and help tool utilizing a chatbot trained on data from Substrate-based chains (e.g., Polkadot, Kusama, Parachains) and their associated repositories. While I don’t plan to implement this myself, I am curious to discuss its feasibility and potential impact.
Would it be technically feasible to extract and preprocess the necessary data from these chains and repositories to train a chatbot model?
Do you think the resulting chatbot would provide valuable insights and assistance to developers?
Are there any potential legal or ethical concerns with using the data from these chains and repositories for this purpose?
Would the chatbot model be able to keep up with the constantly evolving nature of these ecosystems?
I’m interested in hearing your thoughts on this proposal, any suggestions for improving the idea, and any potential challenges that could arise.
At Moonbeam we just integrated Kapa and it works really well. You can go to our Discord to try it out.
Kapa needs strong knowledge sources and it provides ChatGPT based answers (I’m not expert, but I’ll share this forum post so that the Kapa team can chime in). Our model was trained with our documentation site, substrate’s, the forum and some other resources. The team has been super responsive and are always looking out on how can they improve in this rapidly changing environment.
Let me know if someone has any specific questions.
I’ve been using GPT-4 to help me understand some parts of substrate & Frame as well as generate code for parsing data, guided tutorials and scaffolding for new projects. It’s very helpful but needs some tweaks to provide it with data that wasn’t in it’s training set. Luckily, the context for GPT-4 is quite big (~32K Tokens or 50 A4 pages of text) which means you can add this functionality easily.
In a nutshell, you take the complete content (text, code) that could contain an answer (for example: the complete markdown of substrate docs), paste it in the prompt and add your question at the end.
While this is not always practical, the other approach that is done is to use a search engine that can generate this context by filtering a larger corpus (larger than 32K tokens) and use the results of the search query as the context.
This can be prototyped today by using tools like LangChain & OpenAI embeddings but so far only on the GPT3.5-turbo model is available on the API (GPT4 API is not available to everyone). Downside is GPT3.5-turbo has a smaller context size of 8K tokens.
Then again, there is the reliance on OpenAI for the API + the costs incurred for API calls. Large context API calls are a bit pricey.
Open source alternatives (for non commercial uses) are LLama, Alpaca, GPT4All and others, that can run on a single machine (generate text, code, …). I’ve successfully ran these models on my local desktop and they work great. These models can be either used as is as replacement for GPT4 API’s or as starting point for training new models (instruction tuning) and adding our own data and docs as training data. What usually works is to provide Question, Answer pairs to the model and train on those. This would require some powerful GPU’s and some engineering upfront but the cost can be kept low (lower 4 figures).
So to answer your questions:
Definitely feasible. Either by retraining or by providing embeddings of the dataset (search engine + context). The second one can evolve when there are new docs, the first one is static (and needs to be retrained when there is new data/updated code).
Absolutely, as a developer tool, language models are productivity multipliers. They essentially cookie-cutters on steroids (how I like to think about it).
I’m not a lawyer but if the data & code is open source, I’m not sure there would be any concerns. The concerns come from how developers interact with the models. If a developer types a query, and it lands on OpenAI servers, that query can be used for training new models (which could be problematic on the data protection side) for which the consent wasn’t perhaps “explicitly” given (TOS might cover this, but then what about ethics)
I think it definitely would. I think the more the bot is used the better it can become. An open source GPT style bot fine tuned on Substrate data & code might be able to answer a lot of questions and help people get started building faster.
Just my 2 cents on the matter. On the data front, we’re exploring some interesting use cases using chatbots as abstraction layer over complex topics and are in discussion with teams like Miti.ai for similar topics. If there’s enough interest we can give this a try.
W3F has also been working with kapa.ai to explore the possibilities of a chatbot, but during the trial our focus was mostly on end-user support, rather than developers. The Discord bot was trained on the articles from Polkadot Support, the Polkadot Wiki, and Polkadot FAQ.
I must say the results were very promising. The bot was able to answer most questions well, while providing sources for the user to explore further. It’s a very good information aggregator that also points you in the right direction. What it couldn’t do, is answer questions about a specific account, like a support agent could, but it might be able to do it if trained properly on on-chain data. But this raises other questions (see point 3 below).
A similar functionality would be possible if the bot is trained on developer resources. But I don’t know if it could go as deep as writing code or fixing your code for you. @Emil-Sorensen or @albertov19 might be able to provide more insight on this.
To answer your questions (at least those I can):
Yes it is, especially for the repos and other documentation. On-chain data and how the bot can utilize that (if I understand the question correctly), is something that needs to be explored further.
I believe it would, especially for a dev that’s just starting out with Substrate. To what level of detail it can go, that I don’t know, but I imagine the more resources are out there (from documentation to actual questions and answers) the better it will perform. But again, I’ll refer you to @Emil-Sorensen and @albertov19 on this.
The short answer is yes, there are. The biggest concern is personal data, not only from the sources the bot utilizes (which can be chosen to be fully public, but which still may contain personal data), but also the information the users share with the bot. That’s not to say that these are blockers, just that there are concerns without clear answers yet. After all, this is a new and rapidly evolving field (example), so this is to be expected. Once I have more clear answers, I’ll revert.
As I’m not an expert, I’ll refrain from answering this, but from our experience, if some information was missing when the bot was trained, then it wouldn’t know about it, so it’s possible that constant retraining is needed for it to keep up.
Overall, I’d say there’s definitely value to be added here and I’ll keep exploring this as a solution to provide better support to our community (end-users and developers alike). That being said, there are things to consider, so personally, I’d say it might be a bit premature to do a treasury proposal for something like this.
I just want to say that I really like this idea! To my opinion this bot is inevitable and therefore I’d like to give some ideas for how we should train this bot?
First, I am not an A.I. expert, but I was sent this post the other day where it seems the bot creates its understanding of going through the source code. I’m not sure how feasible that is since rust-analyzer already has a hard time at the moment , but it is definitely worth trying out!
If we do go by the docs, as @Emil-Sorensen said: “strong knowledge sources = high quality AI bot”, we should carefully choose what resources to start with and slowly iterate on the answers the bot is producing. A good resource for quality questions and answers can be Substrate Stackexchange and I would be happy to help filtering those out!
Yeah so KAPA can’t really write code like ChatGPT4, unless your docs have code snippets that can help KAPA formulate an answer. Basically as @Emil-Sorensen said and @Daanvdplas reiterated “strong knowledge sources = high quality AI bot”.
We were impressed by KAPA’s ability to answer questions tbh. Moreover, you can also see it as a way to get feedback on what is missing on your documentation right? If KAPA is not able to answer a question, it means that it did not find the answer (or maybe it did but did not have the confidence level to share it).
If your documents are developer oriented, KAPA could work to answer questions to developers. However, the questions need to also have been answered in the knowledge source that KAPA is ran against. If a developer wants to know how to get started on Moonbeam, or how to use a basic precompile, KAPA will provide an answer and resources pointing to our docs site (for example).