[ad_1]
A story of taming unruly paperwork to create the last word GPT-based chatbot
Image this: you’re at a quickly rising tech firm, and also you’ve been given the mission to create a state-of-the-art chatbot utilizing the mind-blowing GPT know-how. This chatbot is destined to change into the corporate’s crown jewel, a digital oracle that’ll reply questions based mostly on the treasure trove of data saved in your Confluence areas. Appears like a dream job, proper?
However, as you are taking a more in-depth take a look at the Confluence information base, actuality hits. It’s a wild jungle of empty/incomplete pages, irrelevant paperwork and duplicate content material. It’s like somebody dumped a thousand jigsaw puzzles into an enormous blender and pressed “begin.” And now, it’s your job to wash up this mess earlier than you’ll be able to even take into consideration constructing that incredible chatbot.
Fortunately for you, on this article, we’ll embark on an exciting journey to beat the Confluence chaos, utilizing the facility of Python and BERTopic to determine and get rid of these annoying outliers. So, buckle up and prepare to rework your information base into the right coaching floor in your cutting-edge GPT-based chatbot.
As you face the daunting activity of cleansing up your Confluence information base, you may think about diving in manually, sorting by way of every doc one after the other. Nonetheless, the guide strategy is sluggish, labor-intensive, and error-prone. In spite of everything, even essentially the most meticulous worker can overlook vital particulars or misjudge the relevance of a doc.
Along with your information of Python, you is perhaps tempted to create a heuristic-based answer, utilizing a set of predefined guidelines to determine and get rid of outliers. Whereas this strategy is quicker than guide cleanup, it has its limitations. Heuristics will be inflexible and battle to adapt to the complicated and ever-evolving nature of your Confluence areas, typically resulting in suboptimal outcomes.
Enter Python and BERTopic, a robust mixture that may provide help to sort out the problem of cleansing up your Confluence information base extra successfully. Python is a flexible programming language, whereas BERTopic is a complicated matter modeling library that may analyze your paperwork and group them based mostly on their underlying matters.
Within the subsequent paragraphs, we’ll discover how Python and BERTopic can work collectively to automate the method of figuring out and eliminating outliers in your Confluence areas. By harnessing their mixed powers, you’ll save time and sources whereas growing the accuracy and effectiveness of your cleanup efforts.
Alright, from this level on, I’ll stroll you thru the method of making a Python script utilizing BERTopic to determine and get rid of outliers in your Confluence information base. The purpose is to generate a ranked listing of paperwork based mostly on their “unrelatedness” rating (which we’ll outline later). The ultimate output will include the doc’s title, a preview of the textual content (first 100 characters), and the unrelatedness rating. The ultimate output will seem as follows:
(Title: “AI in Healthcare”, Preview: “Synthetic intelligence is remodeling…”, Unrelatedness: 0.95)
(Title: “Workplace Birthday Social gathering Tips”, Preview: “To make sure a enjoyable and protected…”, Unrelatedness: 0.8)
The important steps on this course of embrace:
- Hook up with Confluence and obtain paperwork: set up a connection to your Confluence account and fetch the paperwork for processing. This part supplies steering on establishing the connection, authenticating, and downloading the required knowledge.
- HTML processing and textual content extraction utilizing Stunning Soup: use Stunning Soup, a robust Python library, to handle HTML content material and extract the textual content from Confluence paperwork. This step includes cleansing up the extracted textual content, eradicating undesirable components, and making ready the info for evaluation.
- Apply BERTopic and create the rating: with the cleaned-up textual content in hand, apply BERTopic to investigate and group the paperwork based mostly on their underlying matters. After acquiring the subject representations, calculate the “unrelatedness” measure for every doc and create a rating to determine and get rid of outliers in your Confluence information base.
Lastly the code. Right here, we’ll begin downloading paperwork from a Confluence house, we’ll then course of the HTML content material, and we’ll extract the textual content for the subsequent section (BERTopic!).
First, we have to connect with Confluence through API. Due to the atlassian-python-api library, that may be accomplished with just a few traces of code. In case you don’t have an API token for Atlassian, learn this guide to set that up.
import os
import re
from atlassian import Confluence
from bs4 import BeautifulSoup# Arrange Confluence API consumer
confluence = Confluence(
url='YOUR_CONFLUENCE URL',
username="YOUR_EMAIL",
password="YOUR_API_KEY",
cloud=True)
# Substitute SPACE_KEY with the specified Confluence house key
space_key = 'YOUR_SPACE'
def get_all_pages_from_space_with_pagination(space_key):
restrict = 50
begin = 0
all_pages = []
whereas True:
pages = confluence.get_all_pages_from_space(space_key, begin=begin, restrict=restrict)
if not pages:
break
all_pages.prolong(pages)
begin += restrict
return all_pages
pages = get_all_pages_from_space_with_pagination(space_key)
After fetching the pages, we’ll create a listing for the textual content information, extract the pages’ content material and save the textual content content material to particular person information:
# Operate to sanitize filenames
def sanitize_filename(filename):
return "".be part of(c for c in filename if c.isalnum() or c in (' ', '.', '-', '_')).rstrip()# Create a listing for the textual content information if it would not exist
if not os.path.exists('txt_files'):
os.makedirs('txt_files')
# Extract pages and save to particular person textual content information
for web page in pages:
page_id = web page['id']
page_title = web page['title']
# Fetch the web page content material
page_content = confluence.get_page_by_id(page_id, increase='physique.storage')
# Extract the content material within the "storage" format
storage_value = page_content['body']['storage']['value']
# Clear the HTML tags to get the textual content content material
text_content = process_html_document(storage_value)
file_name = f'txt_files/{sanitize_filename(page_title)}_{page_id}.txt'
with open(file_name, 'w', encoding='utf-8') as txtfile:
txtfile.write(text_content)
The perform process_html_document
carries out all the required cleansing duties to extract the textual content from the downloaded pages whereas sustaining a coherent format. The extent to which you need to refine this course of is determined by your particular necessities. On this case, we deal with dealing with tables and lists to make sure that the ensuing textual content doc retains a format just like the unique format.
import spacynlp = spacy.load("en_core_web_sm")
def html_table_to_text(html_table):
soup = BeautifulSoup(html_table, "html.parser")
# Extract desk rows
rows = soup.find_all("tr")
# Decide if the desk has headers or not
has_headers = any(th for th in soup.find_all("th"))
# Extract desk headers, both from the primary row or from the <th> components
if has_headers:
headers = [th.get_text(strip=True) for th in soup.find_all("th")]
row_start_index = 1 # Skip the primary row, because it comprises headers
else:
first_row = rows[0]
headers = [cell.get_text(strip=True) for cell in first_row.find_all("td")]
row_start_index = 1
# Iterate by way of rows and cells, and use NLP to generate sentences
text_rows = []
for row in rows[row_start_index:]:
cells = row.find_all("td")
cell_sentences = []
for header, cell in zip(headers, cells):
# Generate a sentence utilizing the header and cell worth
doc = nlp(f"{header}: {cell.get_text(strip=True)}")
sentence = " ".be part of([token.text for token in doc if not token.is_stop])
cell_sentences.append(sentence)
# Mix cell sentences right into a single row textual content
row_text = ", ".be part of(cell_sentences)
text_rows.append(row_text)
# Mix row texts right into a single textual content
textual content = "nn".be part of(text_rows)
return textual content
def html_list_to_text(html_list):
soup = BeautifulSoup(html_list, "html.parser")
objects = soup.find_all("li")
text_items = []
for merchandise in objects:
item_text = merchandise.get_text(strip=True)
text_items.append(f"- {item_text}")
textual content = "n".be part of(text_items)
return textual content
def process_html_document(html_document):
soup = BeautifulSoup(html_document, "html.parser")
# Substitute tables with textual content utilizing html_table_to_text
for desk in soup.find_all("desk"):
table_text = html_table_to_text(str(desk))
desk.replace_with(BeautifulSoup(table_text, "html.parser"))
# Substitute lists with textual content utilizing html_list_to_text
for ul in soup.find_all("ul"):
ul_text = html_list_to_text(str(ul))
ul.replace_with(BeautifulSoup(ul_text, "html.parser"))
for ol in soup.find_all("ol"):
ol_text = html_list_to_text(str(ol))
ol.replace_with(BeautifulSoup(ol_text, "html.parser"))
# Substitute all sorts of <br> with newlines
br_tags = re.compile('<br>|<br/>|<br />')
html_with_newlines = br_tags.sub('n', str(soup))
# Strip remaining HTML tags to isolate the textual content
soup_with_newlines = BeautifulSoup(html_with_newlines, "html.parser")
return soup_with_newlines.get_text()
On this remaining chapter, we’ll lastly leverage BERTopic, a robust matter modeling method that makes use of BERT embeddings. You possibly can be taught extra about BERTopic of their GitHub repository and their documentation.
Our strategy to discovering outliers consists of working BERTopic with completely different values for the variety of matters. In every iteration, we’ll acquire all paperwork that fall into the Outlier cluster (-1). The extra incessantly a doc seems within the -1 cluster, the extra doubtless it’s to be thought-about an outlier. This frequency kinds the primary element of our unrelatedness rating. BERTopic additionally supplies a likelihood worth for paperwork within the -1 cluster. We’ll calculate the typical of those chances for every doc over all of the iterations. This common represents the second element of our unrelatedness rating. Lastly, we’ll decide the general unrelatedness rating for every doc by computing the typical of the 2 scores (frequency and likelihood). This mixed rating will assist us determine essentially the most unrelated paperwork in our dataset.
Right here is the preliminary code:
import numpy as np
from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer
from bertopic.illustration import MaximalMarginalRelevance
from sklearn.feature_extraction.textual content import CountVectorizervectorizer_model = CountVectorizer(stop_words="english")
representation_model = MaximalMarginalRelevance(range=0.2)
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
# Accumulate textual content and filenames from chunks within the txt_files listing
paperwork = []
filenames = []
for file in os.listdir('txt_files'):
if file.endswith('.txt'):
with open(os.path.be part of('txt_files', file), 'r', encoding='utf-8') as f:
paperwork.append(f.learn())
filenames.append(file)
On this code block, we arrange the required instruments for BERTopic by importing the required libraries and initializing the fashions. We outline 3 fashions that can be utilized by BERTopic:
vectorizer_model
: theCountVectorizer
mannequin tokenizes the paperwork and creates a document-term matrix the place every entry represents the rely of a time period in a doc. It additionally removes English cease phrases from the paperwork to enhance matter modeling efficiency.representation_model
: theMaximalMarginalRelevance
(MMR) mannequin diversifies the extracted matters by contemplating each the relevance and variety of matters. Therange
parameter controls the trade-off between these two facets, with greater values resulting in extra numerous matters.ctfidf_model
: theClassTfidfTransformer
mannequin adjusts the time period frequency-inverse doc frequency (TF-IDF) scores of the document-term matrix to higher characterize matters. It reduces the affect of incessantly occurring phrases throughout matters and enhances the excellence between matters.
We then acquire the textual content and filenames of the paperwork from the ‘txt_files’ listing to course of them with BERTopic within the subsequent step.
def extract_topics(docs, n_topics):
mannequin = BERTopic(nr_topics=n_topics, calculate_probabilities=True, language="english",
ctfidf_model=ctfidf_model, representation_model=representation_model,
vectorizer_model=vectorizer_model)
matters, chances = mannequin.fit_transform(docs)
return mannequin, matters, chancesdef find_outlier_topic(mannequin):
topic_sizes = mannequin.get_topic_freq()
outlier_topic = topic_sizes.iloc[-1]["Topic"]
return outlier_topic
outlier_counts = np.zeros(len(paperwork))
outlier_probs = np.zeros(len(paperwork))
# Outline the vary of matters you need to attempt
min_topics = 5
max_topics = 10
for n_topics in vary(min_topics, max_topics + 1):
mannequin, matters, chances = extract_topics(paperwork, n_topics)
outlier_topic = find_outlier_topic(mannequin)
for i, (matter, prob) in enumerate(zip(matters, chances)):
if matter == outlier_topic:
outlier_counts[i] += 1
outlier_probs[i] += prob[outlier_topic]
Within the above part, we use BERTopic to determine outlier paperwork by iterating by way of a spread of matter counts from a specified minimal to a most. For every matter rely, BERTopic extracts the matters and their corresponding chances. It then identifies the outlier matter and updates the outlier_counts
and outlier_probs
for paperwork assigned to this outlier matter. This course of iteratively accumulates counts and chances, offering a measure of how typically and the way ‘strongly’ paperwork are categorized as outliers.
Lastly, we will compute our unrelatedness rating and print the outcomes:
def normalize(arr):
min_val, max_val = np.min(arr), np.max(arr)
return (arr - min_val) / (max_val - min_val)# Common the chances
avg_outlier_probs = np.divide(outlier_probs, outlier_counts, out=np.zeros_like(outlier_probs), the place=outlier_counts != 0)
# Normalize counts
normalized_counts = normalize(outlier_counts)
# Compute the mixed unrelatedness rating by averaging the normalized counts and chances
unrelatedness_scores = [(i, (count + prob) / 2) for i, (count, prob) in enumerate(zip(normalized_counts, avg_outlier_probs))]
unrelatedness_scores.kind(key=lambda x: x[1], reverse=True)
# Print the filtered outcomes
for index, rating in unrelatedness_scores:
if rating > 0:
title = filenames[index]
preview = paperwork[index][:100] + "..." if len(paperwork[index]) > 100 else paperwork[index]
print(f"Title: {title}, Preview: {preview}, Unrelatedness: {rating:.2f}")
print("n")
And that’s it! Right here you should have your listing of outliers paperwork ranked by unrelatedness. By cleansing up your Confluence areas and eradicating irrelevant content material, you’ll be able to pave the best way for making a extra environment friendly and useful chatbot that leverages your group’s information. Completely happy cleansing!
[ad_2]
Source link