Summarize Podcast Transcripts and Long Texts Better with NLP and AI | by Isaac Tham

[ad_1]

Why the present summarization method is flawed, and a walkthrough of tips on how to do higher

LLMs like GPT-4 have taken the world by storm, and one of many duties that generative textual content fashions are notably good at is summarization of lengthy texts akin to books or podcast transcripts. Nonetheless, the traditional methodology of getting LLMs to summarize lengthy texts is definitely basically flawed. On this put up, I’ll inform you in regards to the issues with present summarization strategies, and current a greater summarization methodology that really takes under consideration the construction of the textual content! Even higher, this methodology may even give us the textual content’s primary subjects — killing two birds with one stone!

I’ll stroll you thru how one can simply implement this in Python, with simply a number of tweaks of the present methodology. That is the tactic that we use at Podsmart, our newly-launched AI-powered podcast summarizer app that helps busy intellectuals save hours of listening.

Issues with present options

The canonical methodology to summarize lengthy texts is by recursive summarization, wherein the lengthy textual content is break up equally into shorter chunks which might match contained in the LLM’s context window. Every chunk is summarized, and the summaries are concatenated collectively to after which handed via GPT-3 to be additional summarized. This course of is repeated till one obtains a remaining abstract of desired size.

Nonetheless, the foremost draw back is that present implementations e.g. LangChain’s summarize chain using map_reduce, break up the textual content into chunks with no regard for the logical and structural circulate of the textual content.

For instance, if the article is 1000 phrases lengthy, a bit measurement of 200 would imply that we’d get 5 chunks. What if the creator has a number of details, the primary of which takes up the primary 250 phrases? The final 50 phrases could be positioned into the second chunk with textual content from the creator’s subsequent level, and passing this chunk via GPT-3’s summarizer would result in doubtlessly essential data from the primary level being omitted. Additionally, some key factors could also be longer than others, and there’s no manner of understanding this a priori.

One other methodology is the ‘refine’ methodology, which passes each chunk of textual content, together with a abstract from earlier chunks, via the LLM, which progressively refines the abstract because it sees extra of the textual content (see the immediate here). Nonetheless, the sequential nature of the method implies that it can’t be parallelized and takes linear time, far longer than a recursive methodology which takes logarithmic time. Moreover, instinct means that the that means from the preliminary components can be overrepresented within the remaining abstract. For podcast transcripts the place the primary minutes are commercials utterly irrelevant to the remainder of the podcast, this can be a stumbling block. Therefore, this methodology is therefore not broadly used.

Even when extra superior language fashions come out with longer context home windows, it’ll nonetheless be woefully insufficient for a lot of summarization use instances (total books), and a few chunking and recursive summarization is inevitably essential.

In essence, if the summarization course of doesn’t acknowledge the textual content’s hierarchy of that means and isn’t suitable with it, it’s unlikely that the ensuing abstract can be ok to precisely convey the creator’s supposed that means.

A Higher Manner Ahead

A greater answer is to sort out the summarization and matter modelling course of collectively in the identical algorithm. Right here, we break up the abstract outputs from one step of the recursive summarization into chunks to be fed into the subsequent step. We will obtain this via by clustering chunks semantically into subjects and passing subjects into the subsequent iteration of the summarization. Let’s stroll you thru how we are able to implement this in Python!

Necessities

Python packages:

scipy — for cosine distance metric
networkx — for the Louvain group detection algorithm
langchain — bundle with utility features permitting you to name LLMs like OpenAI’s GPT-3

Information and Preprocessing

The GitHub repository with the Jupyter pocket book and information could be discovered right here: https://github.com/thamsuppp/llm_summary_medium

The textual content we’re summarizing at present is the 2023 State of the Union speech by US President Joe Biden. The textual content file is within the GitHub repository, and here is the unique supply. The speech, as are all US authorities publications, are in public domain. Do word that you will need to guarantee that that you’re allowed to make use of the supply textual content — In the direction of Information Science has revealed some helpful tips about checking for dataset copyrights and licenses.

We break up the uncooked textual content it into sentences, limiting sentences to have a minimal size of 20 phrases and most size of 80.

Creating Chunks

As an alternative of making chunks massive sufficient to suit right into a context window, I suggest that the chunk measurement must be the variety of sentences it typically takes to specific a discrete concept. It is because we are going to later embed this chunk of textual content, basically distilling its semantic that means right into a vector. I at the moment use 5 sentences (however you’ll be able to experiment with different numbers). I are likely to have a 1-sentence overlap between chunks, simply to make sure continuity so that every chunk has some contextual details about the earlier chunk. For the given textual content file, there are 65 chunks, with a median chunk size is 148 phrases, with a variety from 46–197 phrases.

Getting Titles and Summaries for Every Chunk

Now, that is the place I begin deviating from LangChain’s summarize chain.

Getting two for the value of 1 LLM name: title + abstract

I needed to get each an informative title in addition to a abstract of every chunk (the significance of the title will turn into clearer later). So I created a customized immediate, adapting Langchain’s default summarize chain prompt. As you’ll be able to see in map_prompt_template – textual content is a parameter that can be inserted into the immediate – this would be the authentic textual content of every chunk. I create a LLM, which is at the moment GPT-3, and create an LLMChain, which mixes an LLM with the immediate template. Then, map_llm_chain.apply() calls GPT-3 with the immediate template with the textual content inputs inserted in, returning titles and summaries for every chunk, which I parse right into a dictionary output. Word that each one chunks could be processed in parallel as they’re unbiased of one another, therefore resulting in immense velocity advantages.

You need to use ChatGPT for 10x cheaper worth and related efficiency, nevertheless after I tried it, solely the GPT-3 LLM runs the question in parallel, whereas utilizing ChatGPT runs it one-by-one, which was painfully sluggish as I usually go in ~100 chunks on the identical time. Operating ChatGPT in parallel requires an async implementation.

def summarize_stage_1(chunks_text):# Immediate to get title and abstract for every chunk
map_prompt_template = """Firstly, give the next textual content an informative title. Then, on a brand new line, write a 75-100 phrase abstract of the next textual content:
{textual content}
Return your reply within the following format:
Title | Abstract...
e.g. 
Why Synthetic Intelligence is Good | AI could make people extra productive by automating many repetitive processes.
TITLE AND CONCISE SUMMARY:"""
map_prompt = PromptTemplate(template=map_prompt_template, input_variables=["text"])
# Outline the LLMs
map_llm = OpenAI(temperature=0, model_name = 'text-davinci-003')
map_llm_chain = LLMChain(llm = map_llm, immediate = map_prompt)
map_llm_chain_input = [{'text': t} for t in chunks_text]
# Run the enter via the LLM chain (works in parallel)
map_llm_chain_results = map_llm_chain.apply(map_llm_chain_input)
stage_1_outputs = parse_title_summary_results([e['text'] for e in map_llm_chain_results])
return {
'stage_1_outputs': stage_1_outputs
}

Embedding Chunks and Clustering into Matters

After acquiring the summaries for every chunk, I’ll embed them utilizing OpenAI’s embeddings into 1536-dimension vectors. The standard recursive summarization methodology doesn’t require embedding as they break up texts arbitrarily by even size. For us, we intention to enhance on that by grouping semantically-similar chunks collectively into subjects.

Grouping texts into subjects is a well-studied drawback in NLP, with many conventional strategies akin to Latent Dirichlet Allocation which predates the age of deep studying. I keep in mind utilizing LDA in 2017 to cluster newspaper articles for my faculty’s newspaper — it was very sluggish to estimate, and solely used phrase frequency which doesn’t seize semantic that means.

Now, we are able to leverage OpenAI’s embeddings-as-a-service API to acquire embeddings that seize the semantic that means of sentences in a single second. There are lots of different potential embedding fashions that can be utilized right here e.g. HuggingFace’s sentence-transformers , which reportedly has higher efficiency than OpenAI’s embeddings, however that includes downloading the mannequin and operating it by yourself server.

After acquiring embedding vectors from the chunks, we group related vectors collectively.

I create a bit similarity matrix, the place the (i,j)th entry denotes the cosine similarity between the embedding vectors of the ith and jth chunk, i.e. the semantic similarity between the chunks.

The chunk similarity matrix for the State of the Union speech. You may see sure teams of chunks are much like collectively related to one another — that is what the subject detection algorithm afterward will uncover. Picture by creator.

We will view this as a similarity graph between nodes that are chunks, with edge weight being the similarity between two chunks. We use the Louvain community detection algorithm to detect subjects from the chunks. It is because communities are outlined in graph evaluation as having dense intra-community connections, and sparse inter-community connections, which is what we would like: chunks inside a subject to be very semantically-similar to one another, whereas every chunk being much less semantically-similar to chunks in different detected subjects.

The Louvain group detection algorithm has a hyperparameter referred to as decision — small resolutions result in smaller clusters. Moreover, I add a hyperparameter proximity_bonus – which bumps up the similarity rating of chunks if their place within the authentic textual content is nearer to one another. You may interpret this as treating the temporal construction of the textual content as a previous (i.e. chunks nearer to one another usually tend to be semantically related). I put this in to discouraging the detected subjects from having chunks from everywhere in the textual content which is much less believable. The perform additionally tries to attenuate the variance in cluster sizes, stopping conditions when one cluster has 1 chunk whereas one other has 13 chunks.

For the State of the Union speech, the output are 10 clusters, that are properly steady.

Detected matter clusters for the State of the Union speech. Picture by creator.

Detected matter clusters for a Bloomberg Surveillance podcast transcript. The purple and orange subjects picked up commercials. Picture by creator.

The second picture is the subject clustering for one more podcast episode. As you’ll be able to see, the start and finish is detected to the identical matter, which is frequent for podcasts with adverts initially and finish of the episode. Some subjects, just like the purple one, are additionally discontinuous — it’s good that our methodology permits for this, as a textual content can cycle again to an earlier-mentioned matter, and this one other risk that the traditional text-splitting fails to account for.

Subject Titles and Summaries

Now, we’ve subjects which might be semantically coherent that we are able to go into the subsequent step of the recursive summarization. For this instance, this would be the final step, however for for much longer texts like books, you’ll be able to think about repeating the method a number of instances till there are ~10 subjects left whose matter summaries can match into the context window.

The following step includes three totally different components.

Subject Titles: For every matter, we’ve generated a listing of titles for the chunks in that matter. We go all subjects’ listing of titles into GPT-3 and ask it to combination the titles to reach at one title for every matter. We do that concurrently for all subjects to stop the subjects’ titles from being too related with each other. Beforehand, after I generated matter titles individually, GPT-3 doesn’t have context of the opposite matter titles, therefore there have been instances the place 4 out of seven titles have been ‘Federal Reserve’s Financial Coverage’. Because of this we needed to generate chunk titles — attempting to suit all chunk summaries into the context window right here might not be potential for very lengthy texts.

As you’ll be able to see beneath, the titles look good! Descriptive, but distinctive from one another.

1. Celebrating American Progress and Resilience
2. US Financial system Strengthening and Inflation Discount
3. Inflation Discount Act: Reducing Well being Care Prices
4. Confronting an Existential Menace: Making Large Companies Pay
5. Junk Price Prevention Act: Stopping Unfair Costs
6. COVID-19 Resilience and Vigilance
7. Combating Fraud and Public Security
8. United States' Help for Ukraine and International Peace
9. Progress Made in Healthcare and Gun Security
10. United States of America: A Vivid Future

Subject Abstract: Nothing new right here, this includes combining the chunk summaries of every matter collectively and asking GPT-3 to summarize them into a subject abstract.

Remaining Abstract: To reach on the total abstract of the textual content, we as soon as once more concatenate the subject summaries collectively and immediate GPT-3 to summarize them.

The ultimate abstract of the State of the Union tackle. Picture by creator.

To summarize

What are the advantages of our methodology?

The textual content is break up hierarchically into subjects, chunks, and sentences. As we progress down the hierarchy, we get progressively detailed and particular summaries, from the ultimate abstract, to every matter’s abstract, to every chunk’s abstract.

As I discussed above, the abstract therefore precisely captures the semantic construction of the textual content — the place there’s an overarching theme which is break up into a number of primary subjects, every of which contains a number of key concepts (chunks), making certain that the important data is retained via the assorted layers of summarization.

This additionally presents larger flexibility than merely an total abstract. Totally different persons are extra thinking about totally different components of the textual content and would therefore select the suitable stage of element they need for every a part of the textual content.

In fact, this requires pairing the generated summaries with an intuitive and coherent interface that visualizes this hierarchical nature of the textual content. An instance of such a visualization is on Podsmart— click here for an interactive abstract of the speech.

A visualization of the extracted subjects and the timeline of the transcript. Picture by creator.

Word that this doesn’t drastically improve the LLM prices — we’re nonetheless passing simply as a lot enter as the traditional methodology into the LLM, but we get a a lot richer summarization.

TLDR — listed below are the key sauces to provide superior summaries of your texts

Semantically-coherent subjects — by doing semantic embeddings on small chunks of the textual content and splitting the textual content by semantic similarity
Acquiring titles and summaries from chunks — which required customizing the immediate as an alternative of utilizing the default LangChain summarize chain
Calibrating the Louvain community-detection algorithm — hyperparameters like decision and proximity bonus make sure the generated matter clusters are believable
Distinct matter titles — concurrently producing all matter titles which required chunk titles

As soon as once more, you’ll be able to try the complete supply code on the GitHub repo. In case you have any questions, be at liberty to succeed in out to me on Twitter!

[ad_2]

Source link

Summarize Podcast Transcripts and Long Texts Better with NLP and AI | by Isaac Tham | May, 2023

Robot Hand Manipulates Complex Objects by Touch Alone

A new AI theoretical framework to analyze and bound information leakage from machine learning models

Editor

A new AI theoretical framework to analyze and bound information leakage from machine learning models

Leave a Reply Cancel reply

Browse by Category

Categories

Recommended

Summarize Podcast Transcripts and Long Texts Better with NLP and AI | by Isaac Tham | May, 2023

Why the present summarization method is flawed, and a walkthrough of tips on how to do higher

Robot Hand Manipulates Complex Objects by Touch Alone

A new AI theoretical framework to analyze and bound information leakage from machine learning models

Editor

A new AI theoretical framework to analyze and bound information leakage from machine learning models

Leave a Reply Cancel reply

Browse by Category

Browse by Tags

Categories

Recommended