Implementing a sales & support agent with LangChain | by Tomaz Bratanic

[ad_1]

Discover ways to develop a chatbot that may reply questions primarily based on the data supplied in your organization’s documentation

Not too long ago, I’ve been fascinated by the facility of ChatGPT and its capability to assemble numerous forms of chatbots. I’ve tried and written about a number of approaches to implementing a chatbot that may entry exterior data to enhance its solutions. I joined just a few Discord channels throughout my chatbot coding periods, hoping to get some assist because the libraries are comparatively new, and never a lot documentation is obtainable but. To my amazement, I discovered customized bots that might reply a lot of the questions for the given library.

Instance of a discord assist bot. Picture by the creator.

The thought is to offer the chatbot the power to dig by way of numerous sources like firm documentation, code, or different content material to be able to permit it to reply firm assist questions. Since I have already got some expertise with chatbots, I made a decision to check how onerous it’s to implement a customized bot with entry to the corporate’s sources.

On this weblog publish, I’ll stroll you thru how I used OpenAI’s fashions to implement a gross sales & assist agent with within the LangChain library that can be utilized to reply details about functions with a graph database Neo4j. The agent may assist you to debug or produce any Cypher assertion you’re combating. Such an agent may then be deployed to serve customers on Discord or different platforms.

We will probably be utilizing the LangChain library to implement the assist bot. The library is straightforward to make use of and gives a wonderful integration of LLM prompts and Python code, permitting us to develop chatbots in only some minutes. As well as, the library helps a spread of LLMs, textual content embedding fashions, and vector databases, together with utility capabilities that assist us load and embed frequent forms of information we’d come throughout, like textual content, PowerPoint, pictures, HTML, PDF, and extra.

The code for this weblog publish is obtainable on GitHub.

LangChain doc loaders

First, we should preprocess the corporate’s sources and retailer them in a vector database. Fortunately, LangChain will help us load exterior information, calculate textual content embeddings, and retailer the paperwork in a vector database of our alternative.

First, we have now to load the textual content into paperwork. LangChain affords quite a lot of helper functions that can take various formats and types of data and produce a document output. The helper capabilities are known as Doc loaders.

Neo4j has a number of its documentation obtainable in GitHub repositories. Conveniently, LangChain gives a doc loader that takes a repository URL as enter and produces a doc for every file within the repository. Moreover, we will use the filter operate to disregard information throughout the loading course of if wanted.

We are going to start by loading the AsciiDoc information from the Neo4j’s knowledge base repository.

# Data base
kb_loader = GitLoader(
clone_url="https://github.com/neo4j-documentation/knowledge-base",
repo_path="./repos/kb/",
department="grasp",
file_filter=lambda file_path: file_path.endswith(".adoc")
and "articles" in file_path,
)
kb_data = kb_loader.load()
print(len(kb_data)) # 309

Wasn’t that simple as a pie? The GitLoader operate clones the repository and cargo related information as paperwork. On this instance, we specified that the file should finish with .adoc suffix and be part of the articles folder. In complete, 309 articles have been loaded. We additionally should be aware of the scale of the paperwork. For instance, GPT-3.5-turbo has a token restrict of 4000, whereas GPT-4 permits 8000 tokens in a single request. Whereas variety of phrases shouldn’t be precisely similar to the variety of tokens, it’s nonetheless a very good estimator.

Subsequent, we’ll load the documentation of the Graph Data Science repository. Right here, we’ll use a textual content splitter to verify not one of the paperwork exceed 2000 phrases. Once more, I do know that variety of phrases shouldn’t be equal to the variety of tokens, however it’s a good approximation. Defining the edge variety of tokens can considerably have an effect on how the database is discovered and retrieved. I discovered a great article by Pinecone that can help you understand the basics of various chunking strategies.

# Outline textual content chunk technique
splitter = CharacterTextSplitter(
chunk_size=2000, 
chunk_overlap=50,
separator=" "
)
# GDS guides
gds_loader = GitLoader(
clone_url="https://github.com/neo4j/graph-data-science",
repo_path="./repos/gds/",
department="grasp",
file_filter=lambda file_path: file_path.endswith(".adoc") 
and "pages" in file_path,
)
gds_data = gds_loader.load()
# Break up paperwork into chunks
gds_data_split = splitter.split_documents(gds_data)
print(len(gds_data_split)) #771

We may load different Neo4j repositories that comprise documentation. Nevertheless, the thought is to indicate numerous information loading strategies and never discover all of Neo4j’s repositories containing documentation. Subsequently, we’ll transfer on and take a look at how we will load paperwork from a Pandas Dataframe.

For instance, say that we need to load a YouTube video as a doc supply for our chatbot. Neo4j has its personal YouTube channel and, even I seem in a video or two. Two years in the past I introduced easy methods to implement an data extraction pipeline.