[ad_1]
Earlier than diving into the technical side of the article let’s set the context and reply the query that you just may need, What’s a data graph ?
And to reply this, think about as an alternative of storing the data in cupboards we retailer them in a cloth internet. Every truth, idea, piece of details about individuals, locations, occasions, and even summary concepts are knots, and the road connecting them collectively is the connection they’ve with one another. This intricate net, my buddies, is the essence of a data graph.
Consider it like a bustling metropolis map, not simply exhibiting streets however revealing the connections between landmarks, parks, and retailers. Equally, a data graph doesn’t simply retailer chilly information; it captures the wealthy tapestry of how issues are linked. For instance, you may study that Marie Curie found radium, then observe a thread to see that radium is utilized in medical remedies, which in flip hook up with hospitals and most cancers analysis. See how one truth effortlessly results in one other, portray an even bigger image?
So why is that this map-like manner of storing data so common? Effectively, think about trying to find data on-line. Conventional strategies usually go away you with remoted bits and items, like discovering solely buildings on a map with out understanding the streets that join them. A data graph, nonetheless, takes you on a journey, guiding you from one truth to a different, like having a pleasant information whisper fascinating tales behind each nook of the knowledge world. Fascinating proper? I do know.
Since I found this magic, it captured my consideration and I explored and performed round with many potential functions. On this article, I’ll present you easy methods to construct a pipeline that extracts audio from video, then transcribes that audio, and from the transcription, construct a data graph permitting for a extra nuanced and interconnected illustration of data inside the video.
I will probably be utilizing Google Drive to add the video pattern. I may also use Google Colab to write down the code, and eventually, you want entry to the GPT Plus API for this venture. I’ll break this down into steps to make it clear and straightforward for rookies:
- Organising every part.
- Extracting audio from video.
- Transcribing audio to textual content.
- Constructing the data graph.
By the top of this text, you’ll assemble a graph with the next schema.
Let’s dive proper into it!
As talked about, we will probably be utilizing Google Drive and Colab. Within the first cell, let’s join Google Drive to Colab and create our listing folders (video_files, audio_files, text_files). The next code can get this carried out. (If you wish to observe together with the code, I’ve uploaded all of the code for this venture on GitHub; you possibly can entry it from here.)
# putting in required libraries
!pip set up pydub
!pip set up git+https://github.com/openai/whisper.git
!sudo apt replace && sudo apt set up ffmpeg
!pip set up networkx matplotlib
!pip set up openai
!pip set up requests# connecting google drive to import video samples
from google.colab import drive
import os
drive.mount('/content material/drive')
video_files = '/content material/drive/My Drive/video_files'
audio_files = '/content material/drive/My Drive/audio_files'
text_files = '/content material/drive/My Drive/text_files'
folders = [video_files, audio_files, text_files]
for folder in folders:
# Test if the output folder exists
if not os.path.exists(folder):
# If not, create the folder
os.makedirs(folder)
Or you possibly can create the folders manually and add your video pattern to the “video_files” folder, whichever is simpler for you.
Now we’ve our three folders with a video pattern within the “video_files” folder to check the code.
The following factor we wish to do is to import our video and extract the audio from it. We will use the Pydub library, which is a high-level audio processing library that may assist us to do this. Let’s see the code after which clarify it beneath.
from pydub import AudioSegment
# Extract audio from movies
for video_file in os.listdir(video_files):
if video_file.endswith('.mp4'):
video_path = os.path.be part of(video_files, video_file)
audio = AudioSegment.from_file(video_path, format="mp4")# Save audio as WAV
audio.export(os.path.be part of(audio_files, f"{video_file[:-4]}.wav"), format="wav")
After putting in our package deal pydub, we imported the AudioSegment class from the Pydub library. Then, we created a loop that iterates by way of all of the video information within the “video_files” folder we created earlier and passes every file by way of AudioSegment.from_file to load the audio from the video file. The loaded audio is then exported as a WAV file utilizing audio.export and saved within the specified “audio_files” folder with the identical identify because the video file however with the extension .wav.
At this level, you possibly can go to the “audio_files” folder in Google Drive the place you will notice the extracted audio.
Within the third step, we’ll transcribe the audio file we’ve to a textual content file and put it aside as a .txt file within the “text_files” folder. Right here I used the Whisper ASR (Automated Speech Recognition) system from OpenAI to do that. I used it as a result of it’s straightforward and pretty correct, beside it has completely different fashions for various accuracy. However the extra correct the mannequin is the bigger the mannequin the slower to load, therefore I will probably be utilizing the medium one only for demonstration. To make the code cleaner, let’s create a perform that transcribes the audio after which use a loop to make use of the perform on all of the audio information in our listing
import re
import subprocess
# perform to transcribe and save the output in txt file
def transcribe_and_save(audio_files, text_files, mannequin='medium.en'):
# Assemble the Whisper command
whisper_command = f"whisper '{audio_files}' --model {mannequin}"
# Run the Whisper command
transcription = subprocess.check_output(whisper_command, shell=True, textual content=True)# Clear and be part of the sentences
output_without_time = re.sub(r'[d+:d+.d+ --> d+:d+.d+] ', '', transcription)
sentences = [line.strip() for line in output_without_time.split('n') if line.strip()]
joined_text = ' '.be part of(sentences)
# Create the corresponding textual content file identify
audio_file_name = os.path.basename(audio_files)
text_file_name = os.path.splitext(audio_file_name)[0] + '.txt'
file_path = os.path.be part of(text_files, text_file_name)
# Save the output as a txt file
with open(file_path, 'w') as file:
file.write(joined_text)
print(f'Textual content for {audio_file_name} has been saved to: {file_path}')
# Transcribing all of the audio information within the listing
for audio_file in os.listdir(audio_files):
if audio_file.endswith('.wav'):
audio_files = os.path.be part of(audio_files, audio_file)
transcribe_and_save(audio_files, text_files)
Libraries Used:
- os: Supplies a manner of interacting with the working system, used for dealing with file paths and names.
- re: Common expression module for sample matching and substitution.
- subprocess: Permits the creation of extra processes, used right here to execute the Whisper ASR system from the command line.
We created a Whisper command and saved it as a variable to facilitate the method. After that, we used subprocess.check_output to run the Whisper command and save the ensuing transcription within the transcription variable. However the transcription at this level shouldn’t be clear (you possibly can examine it by printing the transcription variable out of the perform; it has timestamps and a few traces that aren’t related to the transcription), so we added a cleansing code that removes the timestamp utilizing re.sub and joins the sentences collectively. After that, we created a textual content file inside the “text_files” folder with the identical identify because the audio and saved the cleaned transcription in it.
Now should you go to the “text_files” folder, you possibly can see the textual content file that incorporates the transcription. Woah, step 3 carried out efficiently! Congratulations!
That is the essential half — and perhaps the longest. I’ll observe a modular method with 5 features to deal with this job, however earlier than that, let’s start with the libraries and modules crucial for making HTTP requests requests, dealing with JSON json, working with information frames pandas, and creating and visualizing graphs networkx and matplotlib. And setting the worldwide constants that are variables used all through the code. API_ENDPOINT is the endpoint for OpenAI’s API, API_KEY is the place the OpenAI API key will probably be saved, and prompt_text will retailer the textual content used as enter for the OpenAI immediate. All of that is carried out on this code
import requests
import json
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt# International Constants API endpoint, API key, immediate textual content
API_ENDPOINT = "https://api.openai.com/v1/chat/completions"
api_key = "your_openai_api_key_goes_here"
prompt_text = """Given a immediate, extrapolate as many relationships as potential from it and supply a listing of updates.
If an replace is a relationship, present [ENTITY 1, RELATIONSHIP, ENTITY 2]. The connection is directed, so the order issues.
Instance:
immediate: Solar is the supply of photo voltaic power. It's also the supply of Vitamin D.
updates:
[["Sun", "source of", "solar energy"],["Sun","source of", "Vitamin D"]]
immediate: $immediate
updates:"""
Then let’s proceed with breaking down the construction of our features:
The primary perform, create_graph(), the duty of this perform is to create a graph visualization utilizing the networkx library. It takes a DataFrame df and a dictionary of edge labels rel_labels — which will probably be created on the next perform — as enter. Then, it makes use of the DataFrame to create a directed graph and visualizes it utilizing matplotlib with some customization and outputs the gorgeous graph we’d like
# Graph Creation Performdef create_graph(df, rel_labels):
G = nx.from_pandas_edgelist(df, "supply", "goal",
edge_attr=True, create_using=nx.MultiDiGraph())
plt.determine(figsize=(12, 12))
pos = nx.spring_layout(G)
nx.draw(G, with_labels=True, node_color='skyblue', edge_cmap=plt.cm.Blues, pos=pos)
nx.draw_networkx_edge_labels(
G,
pos,
edge_labels=rel_labels,
font_color='purple'
)
plt.present()
The DataFrame df and the sting labels rel_labels are the output of the subsequent perform, which is: preparing_data_for_graph(). This perform takes the OpenAI api_response — which will probably be created from the next perform — as enter and extracts the entity-relation triples (supply, goal, edge) from it. Right here we used the json module to parse the response and procure the related information, then filter out parts which have lacking information. After that, construct a data base dataframe kg_df from the triples, and eventually, create a dictionary (relation_labels) mapping pairs of nodes to their corresponding edge labels, and naturally, return the DataFrame and the dictionary.
# Knowledge Preparation Performdef preparing_data_for_graph(api_response):
#extract response textual content
response_text = api_response.textual content
entity_relation_lst = json.masses(json.masses(response_text)["choices"][0]["text"])
entity_relation_lst = [x for x in entity_relation_lst if len(x) == 3]
supply = [i[0] for i in entity_relation_lst]
goal = [i[2] for i in entity_relation_lst]
relations = [i[1] for i in entity_relation_lst]
kg_df = pd.DataFrame({'supply': supply, 'goal': goal, 'edge': relations})
relation_labels = dict(zip(zip(kg_df.supply, kg_df.goal), kg_df.edge))
return kg_df,relation_labels
The third perform is call_gpt_api(), which is chargeable for making a POST request to the OpenAI API and output the api_response. Right here we assemble the info payload with mannequin data, immediate, and different parameters just like the mannequin (on this case: gpt-3.5-turbo-instruct), max_tokens, cease, and temperature. Then ship the request utilizing requests.submit and return the response. I’ve additionally included easy error dealing with to print an error message in case an exception happens. The attempt block incorporates the code that may elevate an exception from the request throughout execution, so if an exception happens throughout this course of (for instance, as a consequence of community points, API errors, and so forth.), the code inside the besides block will probably be executed.
# OpenAI API Name Perform
def call_gpt_api(api_key, prompt_text):
world API_ENDPOINT
attempt:
information = {
"mannequin": "gpt-3.5-turbo",
"immediate": prompt_text,
"max_tokens": 3000,
"cease": "n",
"temperature": 0
}
headers = {"Content material-Kind": "utility/json", "Authorization": "Bearer " + api_key}
r = requests.submit(url=API_ENDPOINT, headers=headers, json=information)
response_data = r.json() # Parse the response as JSON
print("Response content material:", response_data)
return response_data
besides Exception as e:
print("Error:", e)
Then the perform earlier than the final is the most important() perform, which orchestrates the primary circulation of the script. First, it reads the textual content file contents from the “text_files” folder we had earlier and saves it within the variable kb_text. Carry the worldwide variable prompt_text, which shops our immediate, then substitute a placeholder within the immediate template ($immediate) with the textual content file content material kb_text. Then name the call_gpt_api() perform, give it the api_key and prompt_text to get the OpenAI API response. The response is then handed to preparing_data_for_graph() to organize the info and get the DataFrame and the sting labels dictionary, lastly go these two values to the create_graph() perform to construct the data graph.
# Major performdef most important(text_file_path, api_key):
with open(file_path, 'r') as file:
kb_text = file.learn()
world prompt_text
prompt_text = prompt_text.substitute("$immediate", kb_text)
api_response = call_gpt_api(api_key, prompt_text)
df, rel_labels = preparing_data_for_graph(api_response)
create_graph(df, rel_labels)code
Lastly, we’ve the begin() perform, which iterates by way of all of the textual content information in our “text_files” folder — if we’ve multiple, will get the identify and the trail of the file, and passes it together with the api_key to the primary perform to do its job.
# Begin Performdef begin():
for filename in os.listdir(text_files):
if filename.endswith(".txt"):
# Assemble the total path to the textual content file
text_file_path = os.path.be part of(text_files, filename)
most important(text_file_path, api_key)
If in case you have accurately adopted the steps, after working the begin() perform, it is best to see the same visualization.
You possibly can in fact save this information graph within the Neo4j database and take it additional.
NOTE: This workflow ONLY applies to movies you personal or whose phrases enable this type of obtain/processing.
Data graphs use semantic relationships to symbolize information, enabling a extra nuanced and context-aware understanding. This semantic richness permits for extra refined querying and evaluation, because the relationships between entities are explicitly outlined.
On this article, I define detailed steps on easy methods to construct a pipeline that entails extracting audio from movies, transcribing with OpenAI’s Whisper ASR, and crafting a data graph. As somebody on this discipline, I hope that this text makes it simpler to know for rookies, demonstrating the potential and flexibility of information graph functions.
And as at all times the entire code is on the market in GitHub.
[ad_2]
Source link