[ad_1]
The work is finished in a Google Colab Professional with a V100 GPU and Excessive RAM setting for the steps involving LLM. The pocket book is split into self-contained sections, most of which could be executed independently, minimizing dependency on earlier steps. Knowledge is saved after every part, permitting continuation in a brand new session if wanted. Moreover, the parsed dataset and the Python modules, are available on this Github repository.
I take advantage of a subset of the arXiv Dataset that’s overtly obtainable on the Kaggle platform and primarly maintained by Cornell College. In a machine readable format, it comprises a repository of 1.7 million scholarly papers throughout STEM, with related options equivalent to article titles, authors, classes, abstracts, full textual content PDFs, and extra. It’s up to date often.
The dataset is clear and in a straightforward to make use of format, so we are able to give attention to our activity, with out spending an excessive amount of time on knowledge preprocessing. To additional simplify the info preparation course of, I constructed a Python module that performs the related steps. It may be discovered at utils/arxiv_parser.py
if you wish to take a peek on the code, in any other case observe alongside the Google Colab:
- obtain the zipped arXiv file (1.2 GB) within the listing of your selection which is labelled
data_path
, - obtain the
arxiv_parser.py
within the listingutils
, - import and initialize the module in your Google Colab pocket book,
- unzip the file, it will extract a 3.7 GB file:
archive-metadata-oai-snapshot.json
, - specify a normal subject (I work with
cs
which stands for pc science), so that you’ll have a extra maneagable measurement knowledge, - select the options to maintain (there are 14 options within the downloaded dataset),
- the abstracts can differ in size fairly a bit, so I added the choice of choosing entries for which the variety of tokens within the summary is in a given interval and used this function to downsize the dataset,
- though I select to work with the
title
function, there may be an choice to take the extra widespread method of concatenating the title and the abstact in a single function denotedcorpus
.
# Import the info parser module
from utils.arxiv_parser import *# Initialize the info parser
parser = ArXivDataProcessor(data_path)
# Unzip the downloaded file to extract a json file in data_path
parser.unzip_file()
# Choose a subject and extract the articles on that subject
subject='cs'
entries = parser.select_topic('cs')
# Construct a pandas dataframe with specified choices
df = parser.select_articles(entries, # extracted articles
cols=['id', 'title', 'abstract'], # options to maintain
min_length = 100, # min tokens an summary ought to have
max_length = 120, # max tokens an summary ought to have
keep_abs_length = False, # don't maintain the abs_length column
build_corpus=False) # don't construct a corpus column
# Save the chosen knowledge to a csv file 'selected_{subject}.csv', makes use of data_path
parser.save_selected_data(df,subject)
With the choices above I extract a dataset of 983 pc science articles. We’re prepared to maneuver to the subsequent step.
If you wish to skip the info processing steps, you might use the
cs
dataset, obtainable within the Github repository.
The Technique
KeyBERT is a technique that extracts key phrases or keyphrases from textual content. It makes use of doc and phrase embeddings to seek out the sub-phrases which can be most much like the doc, through cosine similarity. KeyLLM is one other minimal methodology for key phrase extraction however it’s based mostly on LLMs. Each strategies are developed and maintained by Maarten Grootendorst.
The 2 strategies could be mixed for enhanced outcomes. Key phrases extracted with KeyBERT are fine-tuned by way of KeyLLM. Conversely, candidate key phrases recognized by way of conventional NLP methods assist grounding the LLM, minimizing the era of undesired outputs.
For particulars on alternative ways of utilizing KeyLLM see Maarten Grootendorst, Introducing KeyLLM — Keyword Extraction with LLMs.
Use KeyBERT [source] to extract key phrases from every doc — these are the candidate key phrases supplied to LLM to fine-tune:
- paperwork are embedded utilizing Sentence Transformers to construct a doc stage illustration,
- phrase embeddings are extracted for N-grams phrases/phrases,
- cosine similarity is used to seek out the phrases or phrases which can be most much like every doc.
Use KeyLLM [source] to finetune the kewords extracted by KeyBERT through textual content era with transformers [source]:
- the neighborhood detection methodology in Sentence Transformers [source] teams the same paperwork, so we’ll extract key phrases solely from one doc in every group,
- the candidate key phrases are supplied the LLM which fine-tunes the key phrases for every cluster.
Apart from Sentence Transformers, KeyBERT helps different embedding fashions, see [here].
Sentence Transformers facilitate neighborhood detection by utilizing a specified threshold. When paperwork lack inherent clusters, clear groupings could not emerge. In my case, out of 983 titles, roughly 800 distinct communities had been recognized. Extra naturally clustered knowledge tends to yield better-defined communities.
The Giant Language Mannequin
After experimting with varied smaller LLMs, I select Zephyr-7B-Beta for this mission. This mannequin is predicated on Mistral-7B, and it is among the first fashions fine-tuned with Direct Choice Optimization (DPO). It not solely outperforms different fashions in its class but additionally surpasses Llama2–70B on some benchmarks. For extra insights on this LLM check out Benjamin Marie, Zephyr 7B Beta: A Good Teacher is All You Need. Though it’s possible to make use of the mannequin instantly on a Google Colab Professional, I opted to work with a GPTQ quantized model ready by TheBloke.
Begin by downloading the mannequin and its tokenizer following the directions supplied within the model card:
# Required installs
!pip set up transformers optimum speed up
!pip set up auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/# Required imports
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
# Load the mannequin and the tokenizer
model_name_or_path = "TheBloke/zephyr-7B-beta-GPTQ"
llm = AutoModelForCausalLM.from_pretrained(model_name_or_path,
device_map="auto",
trust_remote_code=False,
revision="primary") # change revision for a unique department
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path,
use_fast=True)
Moreover, construct the textual content era pipeline:
generator = pipeline(
mannequin=llm,
tokenizer=tokenizer,
activity='text-generation',
max_new_tokens=50,
repetition_penalty=1.1,
)
The Key phrase Extraction Immediate
Experimentation is essential on this step. Discovering the optimum immediate requires some trial and error, and the efficiency is dependent upon the chosen mannequin. Let’s not overlook that LLMs are probabilistic, so it’s not assured that they’ll return the identical output each time. To develop the immediate under, I relied on each experimentation and the next concerns:
immediate = "Inform me about AI"
prompt_template=f'''<|system|>
</s>
<|consumer|>
{immediate}</s>
<|assistant|>
'''
And right here is the immediate I take advantage of to fine-tune the key phrases extracted with KeyBERT:
prompt_keywords= """
<|system|>
I've the next doc:
Semantics and Termination of Merely-Moded Logic Packages with Dynamic Scheduling
and 5 candidate key phrases:
scheduling, logic, semantics, termination, modedBased mostly on the knowledge above, extract the key phrases or the keyphrases that greatest describe the subject of the textual content.
Observe the necessities under:
1. Be sure that to extract solely the key phrases or keyphrases that seem within the textual content.
2. Present 5 key phrases or keyphrases! Don't quantity or label the key phrases or the keyphrases!
3. Don't embrace the rest moreover the key phrases or the keyphrases! I repeat don't embrace any feedback!
semantics, termination, simply-moded, logic packages, dynamic scheduling</s>
<|consumer|>
I've the next doc:
[DOCUMENT]
and 5 candidate key phrases:
[CANDIDATES]
Based mostly on the knowledge above, extract the key phrases or the keyphrases that greatest describe the subject of the textual content.
Observe the necessities under:
1. Be sure that to extract solely the key phrases or keyphrases that seem within the textual content.
2. Present 5 key phrases or keyphrases! Don't quantity or label the key phrases or the keyphrases!
3. Don't embrace the rest moreover the key phrases or the keyphrases! I repeat don't embrace any feedback!</s>
<|assistant|>
"""
Key phrase Extraction and Parsing
We now have the whole lot wanted to proceed with the key phrase extraction. Let me remind you, that I work with the titles, so the enter paperwork are quick, staying nicely throughout the token limits for the BERT embeddings.
Begin with making a TextGeneration pipeline wrapper for the LLM and instantiate KeyBERT. Select the embedding mannequin. If no embedding mannequin is specified, the default mannequin is all-MiniLM-L6-v2
. On this case, I choose the highest-performant pretrained mannequin for sentence embeddings, see here for an entire record.
# Set up the required packages
!pip set up keybert
!pip set up sentence-transformers# The required imports
from keybert.llm import TextGeneration
from keybert import KeyLLM, KeyBERT
from sentence_transformers import SentenceTransformer
# KeyBert TextGeneration pipeline wrapper
llm_tg = TextGeneration(generator, immediate=prompt_keywords)
# Instantiate KeyBERT and specify an embedding mannequin
kw_model= KeyBERT(llm=llm_tg, mannequin = "all-mpnet-base-v2")
Recall that the dataset was ready and saved as a pandas dataframe df
. To course of the titles, simply name the extract_keywords
methodology:
# Retain the articles titles just for evaluation
titles_list = df.title.tolist()# Course of the paperwork and gather the outcomes
titles_keys = kw_model.extract_keywords(titles_list, thresold=0.5)
# Add the outcomes to df
df["titles_keys"] = titles_keys
The
threshold
parameter determines the minimal similarity required for paperwork to be grouped into the identical neighborhood. The next worth will group almost equivalent paperwork, whereas a decrease worth will cluster paperwork overlaying related matters.The selection of embeddings considerably influences the suitable threshold, so it’s advisable to seek the advice of the mannequin card for steerage. I’m grateful to Maarten Grootendorst for highlighting this facet, as could be seen here.
It’s essential to notice that my observations apply solely to condemn transformers, as I haven’t experimented with different varieties of embeddings.
Let’s check out some outputs:
Feedback:
- Within the second instance supplied right here, we observe key phrases or keyphrases not current within the unique textual content. If this poses an issue in your case, think about enabling
check_vocab=True
as achieved [here]. Nevertheless, it is essential to keep in mind that these outcomes are extremely influenced by the LLM selection, with quantization having a minor impact, in addition to the development of the immediate. - With longer enter paperwork, I seen extra deviations from the required output.
- One constant statement is that the variety of key phrases extracted usually deviates from 5. It’s widespread to come across titles with fewer extracted key phrases, particularly when the enter is transient. Conversely, some titles yield as many as 10 extracted key phrases. Let’s study the distribution of key phrase counts for this run:
These variations complicate the following parsing steps. There are a couple of choices for addressing this: we may examine these circumstances intimately, request the mannequin to revise and both trim or reiterate the key phrases, or just overlook these cases and focus solely on titles with precisely 5 key phrases, as I’ve determined to do for this mission.
The next step is to cluster the key phrases and keyphrases to disclose widespread matters throughout articles. To perform this I take advantage of two algorithms: UMAP for dimensionality discount and HDBSCAN for clustering.
The Algorithms: HDBSCAN and UMAP
Hierarchical Density-Based Spatial Clustering of Applications with Noise or HDBSCAN, is a extremely performant unsupervised algorithm designed to seek out patterns within the knowledge. It finds the optimum clusters based mostly on their density and proximity. That is particularly helpful in circumstances the place the quantity and form of the clusters could also be unknown or tough to find out.
The outcomes of HDBSCAN clustering algorithm can differ if you happen to run the algorithm a number of instances with the identical hyperparameters. It is because HDBSCAN is a stochastic algorithm, which signifies that it entails some extent of randomness within the clustering course of. Particularly, HDBSCAN makes use of a random initialization of the cluster hierarchy, which may end up in completely different cluster assignments every time the algorithm is run.
Nevertheless, the diploma of variation between completely different runs of the algorithm can depend upon a number of components, such because the dataset, the hyperparameters, and the seed worth used for the random quantity generator. In some circumstances, the variation could also be minimal, whereas in different circumstances it may be vital.
There are two clustering choices with HDBSCAN.
- The first clustering algorithm, denoted
hard_clustering
assigns every knowledge level to a cluster or labels it as noise. This can be a laborious project; there are not any combined memberships. This method would possibly lead to one giant cluster categorized as noise (cluster labelled -1) and quite a few smaller clusters. Effective-tuning the hyperparameters is essential [see here], as it’s deciding on an embedding mannequin particularly tailor-made for the area. Check out the related Google Colab for the outcomes of laborious clustering on the mission’s dataset. Smooth clustering
on the opposite facet is a more moderen function of the HDBSCAN library. On this method factors aren’t assigned cluster labels, however as a substitute they’re assigned a vector of chances. The size of the vector is the same as the variety of clusters discovered. The likelihood worth on the entry of the vector is the likelihood the purpose is a member of the the cluster. This permits factors to doubtlessly be a mixture of clusters. If you wish to higher perceive how tender clustering works please seek advice from How Soft Clustering for HDBSCAN Works. This method is best suited to the current mission, because it generates a bigger set of relatively related sizes clusters.
Whereas HDBSCAN can carry out nicely on low to medium dimensional knowledge, the efficiency tends to lower considerably as dimension will increase. Usually HDBSCAN performs greatest on as much as round 50 dimensional knowledge, [see here].
Paperwork for clustering are usually embedded utilizing an environment friendly transformer from the BERT household, leading to a a number of hundred dimensions knowledge set.
To scale back the dimension of the embeddings vectors we use UMAP (Uniform Manifold Approximation and Projection), a non-linear dimension discount algorithm and the most effective performing in its class. It seeks to be taught the manifold construction of the info and to discover a low dimensional embedding that preserves the important topological construction of that manifold.
UMAP has been proven to be extremely efficient at preserving the general construction of high-dimensional knowledge in decrease dimensions, whereas additionally offering superior efficiency to different common algorithms like t-SNE and PCA.
Key phrase Clustering
- Set up and import the required packages and libraries.
# Required installs
!pip set up umap-learn
!pip set up hdbscan
!pip set up -U sentence-transformers# Normal imports
import pandas as pd
import numpy as np
import re
import pickle
# Imports wanted to generate the BERT embeddings
from sentence_transformers import SentenceTransformer
# Libraries for dimensionality discount
import umap.umap_ as umap
# Import the clustering algorithm
import hdbscan
- Put together the dataset by aggregating all key phrases and keyphrases from every title’s particular person quintet right into a single record of distinctive key phrases and reserve it as a pandas dataframe.
# Load the info if wanted - titles with 5 extracted key phrases
df5 = pd.read_csv(data_path+parsed_keys_file) # Create an inventory of all sublists of key phrases and keyphrases
df5_keys = df5.titles_keys.tolist()
# Flatten the record of sublists
flat_keys = [item for sublist in df5_keys for item in sublist]
# Create an inventory of distinctive key phrases
flat_keys = record(set(flat_keys))
# Create a dataframe with the distinct key phrases
keys_df = pd.DataFrame(flat_keys, columns = ['key'])
I acquire nearly 3000 distinctive key phrases and keyphrases from the 884 processed titles. Here’s a pattern: n-colorable graphs, experiments, constraints, tree construction, complexity, and many others.
- Generate 768-dimensional embeddings with Sentence Transformers.
# Instantiate the embedding mannequin
mannequin = SentenceTransformer('all-mpnet-base-v2')# Embed the key phrases and keyphrases into 768-dim actual vector area
keys_df['key_bert'] = keys_df['key'].apply(lambda x: mannequin.encode(x))
- Carry out dimensionality discount with UMAP.
# Cut back to 10-dimensional vectors and maintain the native neighborhood at 15
embeddings = umap.UMAP(n_neighbors=15, # Balances native vs. world construction.
n_components=10, # Dimension of decreased vectors
metric='cosine').fit_transform(record(keys_df.key_bert))# Add the decreased embedding vectors to the dataframe
keys_df['key_umap'] = embeddings.tolist()
- Cluster the 10-dimensional vectors with HDBSCAN. To maintain this weblog succinct, I’ll omit descriptions of the parameters that pertain extra to laborious clustering. For detailed info on every parameter, please seek advice from [Parameter Selection for HDBSCAN*].
# Initialize the clustering mannequin
clusterer = hdbscan.HDBSCAN(algorithm='greatest',
prediction_data=True,
approx_min_span_tree=True,
gen_min_span_tree=True,
min_cluster_size=20,
cluster_selection_epsilon = .1,
min_samples=1,
p=None,
metric='euclidean',
cluster_selection_method='leaf')# Match the info
clusterer.match(embeddings)
# Create tender clusters
soft_clusters = hdbscan.all_points_membership_vectors(clusterer)
# Add the tender cluster info to the info
closest_clusters = [np.argmax(x) for x in soft_clusters]
keys_df['cluster'] = closest_clusters
Under is the distribution of key phrases throughout clusters. Examination of the unfold of key phrases and keyphrases into tender clusters reveals a complete of 60 clusters, with a reasonably even distribution of components per cluster, various from about 20 to just about 100.
Having clustered the key phrases, we are actually able to make use of GenAI as soon as extra to reinforce and refine our findings. At this step, we’ll use a LLM to investigate every cluster, summarize the key phrases and keyphrases whereas assigning a short label to the cluster.
Whereas it’s not needed, I select to proceed with the identical LLM, Zephyr-7B-Beta. Must you require downloading the mannequin, please seek the advice of the related part. Notably, I’ll regulate the immediate to go well with the distinct nature of this activity.
The next perform is designed to extract a label and an outline for a cluster, parse the output and combine it right into a pandas dataframe.
def extract_description(df: pd.DataFrame,
n: int
)-> pd.DataFrame:
"""
Use a customized immediate to ship to a LLM
to extract labels and descriptions for an inventory of key phrases.
"""one_cluster = df[df['cluster']==n]
one_cluster_copy = one_cluster.copy()
pattern = one_cluster_copy.key.tolist()
prompt_clusters= f"""
<|system|>
I've the next record of key phrases and keyphrases:
['encryption','attribute','firewall','security properties',
'network security','reliability','surveillance','distributed risk factors',
'still vulnerable','cryptographic','protocol','signaling','safe',
'adversary','message passing','input-determined guards','secure communication',
'vulnerabilities','value-at-risk','anti-spam','intellectual property rights',
'countermeasures','security implications','privacy','protection',
'mitigation strategies','vulnerability','secure networks','guards']
Based mostly on the knowledge above, first title the area these key phrases or keyphrases
belong to, secondly give a short description of the area.
Don't use greater than 30 phrases for the outline!
Don't present particulars!
Don't give examples of the contexts, don't say 'equivalent to' and don't record the key phrases
or the keyphrases!
Don't begin with an announcement of the shape 'These key phrases belong to the area of' or
with 'The area'.
Cybersecurity: Cybersecurity, emphasizing strategies and techniques for safeguarding digital info
and networks towards unauthorized entry and threats.
</s>
<|consumer|>
I've the next record of key phrases and keyphrases:
{pattern}
Based mostly on the knowledge above, first title the area these key phrases or keyphrases belong to, secondly
give a short description of the area.
Don't use greater than 30 phrases for the outline!
Don't present particulars!
Don't give examples of the contexts, don't say 'equivalent to' and don't record the key phrases or the keyphrases!
Don't begin with an announcement of the shape 'These key phrases belong to the area of' or with 'The area'.
<|assistant|>
"""
# Generate the outputs
outputs = generator(prompt_clusters,
max_new_tokens=120,
do_sample=True,
temperature=0.1,
top_k=10,
top_p=0.95)
textual content = outputs[0]["generated_text"]
# Instance string
sample = "<|assistant|>n"
# Extract the output
response = textual content.break up(sample, 1)[1].strip(" ")
# Examine if the output has the specified format
if len(response.break up(":", 1)) == 2:
label = response.break up(":", 1)[0].strip(" ")
description = response.break up(":", 1)[1].strip(" ")
else:
label = description = response
# Add the outline and the labels to the dataframe
one_cluster_copy.loc[:, 'description'] = description
one_cluster_copy.loc[:, 'label'] = label
return one_cluster_copy
Now we are able to apply the above perform to every cluster and gather the outcomes:
import re
import pandas as pd# Initialize an empty record to retailer the cluster dataframes
dataframes = []
clusters = len(set(keys_df.cluster))
# Iterate over the vary of n values
for n in vary(clusters-1):
df_result = extract_description(keys_df,n)
dataframes.append(df_result)
# Concatenate the person dataframes
final_df = pd.concat(dataframes, ignore_index=True)
Let’s check out a pattern of outputs. For full record of outputs please seek advice from the Google Colab.
We should keep in mind that LLMs, with their inherent probabilistic nature, could be unpredictable. Whereas they often adhere to directions, their compliance just isn’t absolute. Even slight alterations within the immediate or the enter textual content can result in substantial variations within the output. Within the extract_description()
perform, I’ve integrated a function to log the response in each label and description columns in these circumstances the place the Label: Description format just isn’t adopted, as illustrated by the irregular output for cluster 7 above. The outputs for your entire set of 60 clusters can be found within the accompanying Google Colab pocket book.
A second statement, is that every cluster is parsed independently by the LLM and it’s doable to get repeated labels. Moreover, there could also be cases of recurring key phrases extracted from the enter record.
The effectiveness of the method is very reliant on the selection of the LLM and points are minimal with a extremely performant LLM. The output additionally is dependent upon the standard of the key phrase clustering and the presence of an inherent subject throughout the cluster.
Methods to mitigate these challenges depend upon the cluster rely, dataset traits and the required accuracy for the mission. Listed here are two choices:
- Manually rectify every subject, as I did on this mission. With solely 60 clusters and merely three inaccurate outputs, handbook changes had been made to right the defective outputs and to make sure distinctive labels for every cluster.
- Make use of an LLM to make the corrections, though this methodology doesn’t assure flawless outcomes.
Knowledge to Add into the Graph
There are two csv recordsdata (or pandas dataframes if working in a single session) to extract the info from.
articles
– it comprises distinctiveid
for every article,title
,summary
andtitles_keys
which is the record of 5 extracted key phrases or keyphrases;key phrases
– with columnskey
,cluster
,description
andlabel
, the placekey
comprises a whole record of distinctive key phrases or keyphrases, and the remaining options describe the cluster the key phrase belongs to.
Neo4j Connection
To construct a data graph, we begin with organising a Neo4j occasion, selecting from choices like Sandbox, AuraDB, or Neo4j Desktop. For this mission, I’m utilizing AuraDB’s free model. It’s easy to launch a clean occasion and obtain its credentials.
Subsequent, set up a connection to Neo4j. For comfort, I take advantage of a customized Python module, which could be discovered at [utils/neo4j_conn.py](<https://github.com/SolanaO/Blogs_Content/blob/grasp/keyllm_neo4j/utils/neo4j_conn.py>)
. This module comprises strategies for connecting and interacting with the graph database.
# Set up neo4j
!pip set up neo4j# Import the connector
from utils.neo4j_conn import *
# Graph DB occasion credentials
URI = 'neo4j+ssc://xxxxxx.databases.neo4j.io'
USER = 'neo4j'
PWD = 'your_password_here'
# Set up the connection to the Neo4j occasion
graph = Neo4jGraph(url=URI, username=USER, password=PWD)
The graph we’re about to construct has a easy schema consisting of three nodes and two relationships:
Constructing the graph now’s easy with simply two Cypher queries:
# Load Key phrase and Matter nodes, and the relationships HAS_TOPIC
query_keywords_topics = """
UNWIND $rows AS row
MERGE (ok:Key phrase {title: row.key})
MERGE (t:Matter {cluster: row.cluster, description: row.description, label: row.label})
MERGE (ok)-[:HAS_TOPIC]->(t)
"""
graph.load_data(query_keywords_topics, key phrases)# Load Article nodes and the relationships HAS_KEY
query_articles = """
UNWIND $rows as row
MERGE (a:Article {id: row.id, title: row.title, summary: row.summary})
WITH a, row
UNWIND row.titles_keys as key
MATCH (ok:Key phrase {title: key})
MERGE (a)-[:HAS_KEY]->(ok)
"""
graph.load_data(query_articles, articles)
Question the Graph
Let’s examine the distribution of the nodes and relationships on varieties:
We are able to discover what particular person matters (or clusters) are the most well-liked amongst our assortment of articles, by counting the cumulative variety of articles related to the key phrases they’re linked to:
Here’s a snapshot of the node Semantics
that corresponds to cluster 58 and its linked key phrases:
We are able to additionally establish generally occurring works in titles, utilizing the question under:
We noticed how we are able to construction and enrich a group of semingly unrelated quick textual content entries. Utilizing conventional NLP and machine studying, we first extract key phrases after which we cluster them. These outcomes information and floor the refinement course of carried out by Zephyr-7B-Beta. Whereas some oversight of the LLM remains to be neccessary, the preliminary output is considerably enriched. A data graph is used to disclose the newly found connections within the corpus.
Our key takeaway is that no single methodology is ideal. Nevertheless, by strategically combining completely different methods, acknowledging their strenghts and weaknesses, we are able to obtain superior outcomes.
Google Colab Pocket book and Code
Knowledge
Technical Documentation
Blogs and Articles
- Maarten Grootendorst, Introducing KeyLLM — Keyword Extraction with LLMs, In the direction of Knowledge Science, Oct 5, 2023.
- Benjamin Marie, Zephyr 7B Beta: A Good Teacher Is All You Need, In the direction of Knowledge Science, Nov 10, 2023.
- The H4 Group, Zephyr: Direct Distillation of LM Alignment, Technical Report, arXiv: 2310.16944, Oct 25, 2023.
[ad_2]
Source link