Building Owly an AI Comic Video Generator For My Son | by Agustinus Nalwan

[ad_1]

Utilising the fine-tuned Secure Diffusion 2.1 on Amazon SageMaker JumpStart, I developed an AI tech known as Owly that crafts personalised comedian movies with music, starring my son’s toys because the lead characters

Owly the AI Comedian Story Teller [AI Generated Image]

Each night, it has develop into a cherished routine to share bedtime tales with my 4-year-old son Dexie, who completely adores them. His assortment of books is spectacular, however he’s particularly captivated after I create tales from scratch. Crafting tales this manner additionally permits me to include ethical values I would like him to study, which may be troublesome to search out in store-bought books. Over time, I’ve honed my expertise in crafting personalised narratives that ignite his creativeness — from dragons with fractured partitions to a lonely sky lantern looking for companionship. These days, I’ve been spinning yarns about fictional superheroes like Gradual-Mo Man and Fart-Man, which have develop into his favourites.

Whereas it’s been a pleasant journey for me, after half a yr of nightly storytelling, my artistic reservoir is being examined. To maintain him engaged with recent and thrilling tales with out exhausting myself, I want a extra sustainable answer — an AI expertise that may generate charming tales robotically! I named her Owly, after his favorite chicken, an owl.

Pookie and the key door to a magic forest — Generated by AI Comedian Generator.

As I began assembling my want record, it rapidly ballooned, pushed by my eagerness to check the frontiers of contemporary expertise. No peculiar text-based story would do — I envisioned an AI crafting a full-blown comedian with as much as 10 panels. To amp up the thrill for Dexie, I aimed to customize the comedian utilizing characters he knew and beloved, like Zelda and Mario, and possibly even toss in his toys for good measure. Frankly, the personalisation angle emerged from a necessity for visible consistency throughout the comedian strips, which I’ll dive into later. However maintain your horses, that’s not all — I additionally needed the AI to relate the story aloud, backed by a becoming soundtrack to set the temper. Tackling this mission can be equal components amusing and difficult for me, whereas Dexie can be handled to a tailored, interactive storytelling extravaganza.

Dexie’s toys as shaggy dog story’s main characters [Image by Author]

To beat the aforementioned necessities, I realised I wanted to assemble 5 marvellous modules:

The Story Script Generator, conjuring up a multi-paragraph story the place every paragraph might be reworked into a comic book strip part. Plus, it recommends a musical model to pluck a becoming tune from my library. To tug this off, I enlisted the mighty OpenAI GPT3.5 Giant Language Mannequin (LLM).
The Comedian Strip Picture Generator, whipping up photographs for every story phase. Secure Diffusion 2.1 teamed up with Amazon SageMaker JumpStart, SageMaker Studio and Batch Remodel to deliver this to life.
The Textual content-to-Speech Module, turning the written story into an audio narration. Amazon Polly’s neural engine leaped to the rescue.
The Video Maker, weaving the comedian strips, audio narration, and music right into a self-playing masterpiece. MoviePy was the star of this present.
And at last, The Controller, orchestrating the grand symphony of all 4 modules, constructed on the mighty basis of AWS Batch.

The sport plan? Get the Story Script Generator to weave a 7–10 paragraph narrative, with every paragraph morphing into a comic book strip part. The Comedian Strip Picture Generator then generates photographs for every phase, whereas the Textual content-to-Speech Module crafts the audio narration. A melodious tune might be chosen based mostly on the story generator’s suggestion. And at last, the Video Maker combines photographs, audio narration, and music to create a whimsical video. Dexie is in for a deal with with this one-of-a-kind, interactive story-time journey!

Earlier than delving into the Story Script Generator, let’s first discover the picture generator module to supply context for any references to the picture era course of. There are quite a few text-to-image AI fashions obtainable, however I selected the Secure Diffusion 2.1 mannequin for its recognition and ease of constructing, fine-tuning, and deployment utilizing Amazon SageMaker and the broader AWS ecosystem.

Amazon SageMaker Studio is an built-in improvement atmosphere (IDE) that provides a unified web-based interface for all machine studying (ML) duties, streamlining information preparation, mannequin constructing, coaching, and deployment. This boosts information science workforce productiveness by as much as 10x. Inside SageMaker Studio, customers can seamlessly add information, create notebooks, prepare and tune fashions, modify experiments, collaborate with their workforce, and deploy fashions to manufacturing.

Amazon SageMaker JumpStart, a useful function inside SageMaker Studio, offers an intensive assortment of widely-used pre-trained AI fashions. Some fashions, together with Secure Diffusion 2.1 base, may be fine-tuned with your individual coaching set and include a pattern Jupyter Pocket book. This allows you to rapidly and effectively experiment with the mannequin.

Launching Secure Diffusion 2.1 Pocket book on Amazon SageMaker JumpStart [Image by Author]

I navigated to the Secure Diffusion 2.1 base view mannequin web page and launched the Jupyter pocket book by clicking on the Open Pocket book button.

Secure Diffusion 2.1 Base mannequin card [Image by Author]

In a matter of seconds, Amazon SageMaker Studio introduced the instance pocket book, full with all the required code to load the text-to-image mannequin from JumpStart, deploy the mannequin, and even fine-tune it for personalised picture era.

Amazon SageMaker Studio IDE [Image by Author]

Quite a few text-to-image fashions can be found, with many tailor-made to particular types by their creators. Utilising the JumpStart API, I filtered and listed all text-to-image fashions utilizing the filter_value “process == txt2img” and displayed them in a dropdown menu for handy choice.

from ipywidgets import Dropdown
from sagemaker.jumpstart.notebook_utils import list_jumpstart_models# Retrieves all Textual content-to-Picture era fashions.
filter_value = "process == txt2img"
txt2img_models = list_jumpstart_models(filter=filter_value)
# show the model-ids in a dropdown to pick a mannequin for inference.
model_dropdown = Dropdown(
choices=txt2img_models,
worth="model-txt2img-stabilityai-stable-diffusion-v2-1-base",
description="Choose a mannequin",
model={"description_width": "preliminary"},
format={"width": "max-content"},
)
show(model_dropdown)
# Or simply arduous code the mannequin id and model=*. 
# Eg. if we wish the most recent 2.1 base mannequin
self._model_id, self._model_version = (
"model-txt2img-stabilityai-stable-diffusion-v2-1-base",
"*",
)

The mannequin I required was model-txt2img-stabilityai-stable-diffusion-v2–1-base which permit fine-tuning.

Large choice of text-to-image fashions [Image by Author]

In underneath 5 minutes, utilising the offered code, I deployed the mannequin to a SageMaker endpoint working a g4dn.2xlarge GPU occasion. I swiftly generated my first picture from my textual content prompts, which you’ll be able to see showcased under.

My picture generator crafts a picture of turtle swimming underwater [Image by Author]

The Amazon SageMaker Studio streamlines my experimentation and prototyping course of, permitting me to swiftly experiment with varied picture era prompts and examine the ensuing photographs immediately throughout the IDE utilizing the file explorer and the preview window. Moreover, I can add photographs throughout the IDE, utilise the built-in terminal to launch AWS CLI for importing and downloading photographs to and from an S3 bucket, and execute SageMaker batch rework jobs in opposition to my fashions to generate quite a few photographs directly for a big scale testing.

The duty of this module is sort of simple: produce a narrative script given a narrative subject and a personality identify. Producing a narrative on a particular subject with GPT3.5 API is extremely easy.

openai.api_key = self._api_key
immediate = "Write me a 1000-word story about Bob the penguin who desires to journey to Europe to see well-known landmarks"
response = openai.Completion.create(
mannequin="text-davinci-003",
immediate=immediate,
temperature=0.7,
max_tokens=2089,
top_p=1,
frequency_penalty=0,
presence_penalty=0
)

For instance, utilizing the immediate “Write me a 1000-word story about Bob the penguin who desires to journey to Europe to see well-known landmarks. He learns that his bravery and curiosity lead him to expertise many thrilling issues.” GPT3.5 will effortlessly craft a charming story on this subject as if it had been penned by knowledgeable storyteller, very like the instance under.

Bob the penguin had at all times dreamed of touring to Europe and visiting well-known landmarks. He had heard tales from his associates concerning the Eiffel Tower in Paris, the Colosseum in Rome, and the Huge Ben in London. He had grown bored with his routine life in Antarctica and yearned for journey.

Sooner or later, Bob determined to make the leap and start planning his journey. He spent hours researching the perfect journey routes and essentially the most inexpensive lodging. After cautious consideration, he determined to begin his journey in Paris.

The boat journey was lengthy and tiring, however he was excited to lastly be in Europe. He checked into his resort and instantly set off to see the Eiffel Tower. As he walked by the streets of Paris, he felt a way of marvel and pleasure that he had by no means felt earlier than.

Over the subsequent few days, he visited well-known landmarks just like the Louvre Museum, Notre Dame Cathedral, and the Palace of Versailles. He tried new meals and met new folks, every expertise including to his journey.

The story itself is unbelievable, however to rework it into comedian strips, I must divide the story into sections and create a picture for each. Probably the most logical method can be to transform every paragraph into a piece. Nonetheless, as you’ll be able to see, the photographs generated from these paragraphs current some important challenges.

Our bob the penguin is portrayed as totally different characters [AI generated image]

Character chaos ensued! Every caricature depicted Bob as a wholly totally different character. Within the first strip, he’s a penguin doll; within the second, he’s an individual’s leg; within the third, a suited-up gentleman; and within the fourth, a person in a blue shirt. This occurs as a result of solely the primary paragraph mentions “Bob the penguin,” whereas the second refers to him as simply “Bob,” and the remainder as “he.” Given this scant data, it’s no marvel the picture generator portrayed Bob in so many various guises.
The scenes additionally lacked focus. The primary caricature confirmed a penguin doll sitting on a desk as an alternative of an Antarctic-dwelling penguin dreaming of European adventures. The same concern arose within the second caricature, which confirmed somebody’s leg hovering excessive above a metropolis. It seems the picture generator interpreted “Sooner or later, Bob determined to make the leap” as skydiving from an airplane, whereas the paragraph’s focus ought to have been on Bob planning his journey to Europe. Lengthy paragraphs with a number of focal factors usually confuse the picture generator, leading to out-of-context photographs.

To deal with the scene’s out-of-focus concern and enhance character consistency, I refined my immediate to incorporate a request for a concise, one-sentence scene description for every paragraph wrapped in [] to allow them to be programatically extracted. This allowed me to supply detailed examples and information the picture generator in creating extra targeted and correct photographs.

For every part please describe the scene in particulars and at all times embrace the situation in a single sentence inside [] with the next format [a photo of character in the location], [a photo of character in front of an object], [a photo of character next to an object], [a photo of a location]

With the up to date immediate, right here’s the ensuing story that was generated.

[a photo of Bob the penguin in Antarctica]
Bob the penguin was a cheerful and curious penguin who lived within the Antarctic. He was content material along with his life there, surrounded by his family and friends. However in the future, he determined to make the leap and discover the world past the icy continent. He had heard tales of the various stunning and unique locations all over the world, and he needed to expertise them for himself.

[a photo of Bob the penguin reading a book]
Bob the penguin began researching the world, maps and studying up on totally different international locations and cultures. He was notably drawn to Europe, with its many well-known landmarks and sights. He determined that Europe was the place he needed to go to, so he started to plan his journey.

[a photo of Bob the penguin on a cruise ship]
He began to make the lengthy journey by boat. He was excited and couldn’t wait to get there, and he was decided to make it to Europe. After just a few weeks of travelling, he finally arrived at his vacation spot.

[a photo of Bob the penguin at Eiffel Tower]
Bob the penguin began exploring Europe and was amazed by all of the totally different locations he visited. He went to the Eiffel Tower in Paris, the Colosseum in Rome, and the Cliffs of Moher in Eire. All over the place he went he was crammed with awe and delight.

As you’ll be able to observe, the generated scene descriptions are significantly extra targeted. They point out a single scene, a location, and/or an exercise being carried out, usually beginning with the character’s identify. These concise prompts show to be far more efficient for my picture generator, as evidenced by the improved photographs generated under.

A extra constant look of our Bob the penguin [AI generated image]

Bob the penguin has made a triumphant return, however he’s nonetheless sporting a brand new look in every caricature. Because the picture era course of treats every picture individually, and no data is offered about Bob’s color, dimension, or sort of penguin, consistency stays elusive.

I beforehand thought-about producing an in depth character description as a part of the story era to keep up character consistency throughout photographs. Nonetheless, this method proved to be impractical for 2 causes:

Typically it’s practically unattainable to explain a personality with sufficient element with out resorting to an amazing quantity of textual content. Whereas there might not be many sorts of penguins, take into account birds usually — with numerous shapes, colors, and species resembling cockatoos, parrots, canaries, pelicans, and owls, the duty turns into daunting.
The character generated doesn’t at all times adhere to the offered description throughout the immediate. For instance, a immediate describing a inexperienced parrot with a purple beak may end in a picture of a inexperienced parrot with a yellow beak as an alternative.

So, regardless of our greatest efforts, our penguin pal Bob continues to expertise one thing of an id disaster.

The answer to our penguin predicament lies in giving the Secure Diffusion mannequin a visible cue of what our penguin character ought to appear to be to affect the picture era course of and to keep up consistency throughout all generated photographs. On this planet of Secure Diffusion, this course of is named fine-tuning, the place you provide a handful (normally 5 to fifteen) of photographs containing the identical object and a sentence describing it. These photographs shall henceforth be often called coaching photographs.

Because it seems, this added personalisation is not only an answer but in addition a mighty cool function for my comedian generator. Now, I can use lots of Dexie’s toys as the primary characters within the tales, resembling his festive Christmas penguin, respiration new life into Bob the penguin, making them much more personalised and relatable for my younger however robust viewers. So, the hunt for consistency turns right into a triumph for tailored tales!

Dexie’s toy is now Bob the penguin [Image by Author]

Throughout my exhilarating days of experimentation, I’ve found just a few nuggets of knowledge to share for reaching the perfect outcomes when fine-tuning the mannequin to cut back the possibility of overfitting:

Maintain the backgrounds in your coaching photographs numerous. This manner, the mannequin received’t confuse the backdrop with the item, stopping undesirable background cameos within the generated photographs
Seize the goal object from varied angles. This helps present extra visible data, enabling the mannequin to generate the item with a better vary of angles, thus higher matching the scene.
Combine close-ups with full-body photographs. This ensures the mannequin doesn’t assume a particular pose is important, granting extra flexibility for the generated object to harmonise with the scene.

To carry out the Secure Diffusion mannequin fine-tuning, I launched a SageMaker Estimator coaching job with Amazon SageMaker Python SDK on an ml.g5.2xlarge GPU occasion and directed the coaching course of to my assortment of coaching photographs in an S3 bucket. A ensuing fine-tuned mannequin file will then be saved in s3_output_location. And, with just some strains of code, the magic started to unfold!

# [Optional] Override default hyperparameters with customized values
hyperparams["max_steps"] = 400
hyperparams["with_prior_preservation"] = False
hyperparams["train_text_encoder"] = Falsetraining_job_name = name_from_base(f"stable-diffusion-{self._model_id}-transfer-learning")
# Create SageMaker Estimator occasion
sd_estimator = Estimator(
position=self._aws_role,
image_uri=image_uri,
source_dir=source_uri,
model_uri=model_uri,
entry_point="transfer_learning.py",  # Entry-point file in source_dir and current in train_source_uri.
instance_count=self._training_instance_count,
instance_type=self._training_instance_type,
max_run=360000,
hyperparameters=hyperparams,
output_path=s3_output_location,
base_job_name=training_job_name,
sagemaker_session=session,
)
# Launch a SageMaker Coaching job by passing s3 path of the coaching information
sd_estimator.match({"coaching": training_dataset_s3_path}, logs=True)

To arrange the coaching set, guarantee it incorporates the next information:

A sequence of photographs named instance_image_x.jpg, the place x is a quantity from 1 to N. On this case, N represents the variety of photographs, ideally greater than 10.
A dataset_info.json file that features a obligatory discipline known as instance_prompt. This discipline ought to present an in depth description of the item, with a novel identifier previous the item’s identify. For instance, “a photograph of Bob the penguin,” the place ‘Bob’ acts because the distinctive identifier. By utilizing this identifier, you’ll be able to direct your fine-tuned mannequin to generate both a normal penguin (known as “penguin”) or the penguin out of your coaching set (known as “Bob the penguin”). Some sources recommend utilizing distinctive names resembling sks or xyz, however I found that it’s not important to take action.

The dataset_info.json file can even embrace an elective discipline known as class_prompt, which provides a common description of the item with out the distinctive identifier (e.g., “a photograph of a penguin”). This discipline is utilised solely when the prior_preservation parameter is about to True; in any other case, it will likely be disregarded. I’ll focus on extra about it on the superior fine-tuning part under.

{"instance_prompt": "a photograph of bob penguin",
"class_prompt": "a photograph of a penguin"
}

After just a few check runs with Dexie’s toys, the picture generator delivered some really spectacular outcomes. It introduced Dexie’s kangaroo magnetic block creation to life, hopping its means into the digital world. The generator additionally masterfully depicted his beloved bathe turtle toy swimming underwater, surrounded by a vibrant faculty of fish. The picture generator definitely captured the magic of Dexie’s playtime favourites!

Dexie’s toys are dropped at life [AI generated image]

Batch Remodel in opposition to fine-tuned Secure Diffusion mannequin

Since I wanted to generate over 100 photographs for every caricature, deploying a SageMaker endpoint (consider it as a Relaxation API) and producing one picture at a time wasn’t essentially the most environment friendly method. As an alternative, I opted to run a batch rework in opposition to my mannequin, supplying it with textual content information in an S3 bucket containing the prompts to generate the photographs.

I’ll present extra particulars about this course of since I initially struggled with it, and I hope my clarification will prevent a while. You’ll want to organize one textual content file per picture immediate with the next JSON content material: {“immediate”: “a photograph of Bob the penguin in Antarctica”}. Whereas it seems that there’s a option to mix a number of inputs into one file utilizing the MultiRecord technique, I used to be unable to determine the way it works.

One other problem I encountered was executing a batch rework in opposition to my fine-tuned mannequin. You may’t execute a batch rework utilizing a transformer object returned by Estimator.transformer(), which normally works in my earlier tasks. As an alternative, it’s essential to first create a SageMaker mannequin object by specifying the S3 location of your fine-tuned mannequin because the model_data. From there, you’ll be able to create the transformer object utilizing this mannequin object.

def _get_model_uris(self, model_id, model_version, scope):
# Retrieve the inference docker container uri
image_uri = image_uris.retrieve(
area=None,
framework=None,  # robotically inferred from model_id
image_scope=scope,
model_id=model_id,
model_version=model_version,
instance_type=self._inference_instance_type,
)
# Retrieve the inference script uri. This contains scripts for mannequin loading, inference dealing with and many others.
source_uri = script_uris.retrieve(
model_id=model_id, model_version=model_version, script_scope=scope
)
if scope == "coaching":
# Retrieve the pre-trained mannequin tarball to additional fine-tune
model_uri = model_uris.retrieve(
model_id=model_id, model_version=model_version, model_scope=scope
)
else:
model_uri = Nonereturn image_uri, source_uri, model_uri
image_uri, source_uri, model_uri = self._get_model_uris(self._model_id, self._model_version, "inference")
# Get mannequin artifact location by estimator.model_data, or give an S3 key immediately
model_artifact_s3_location = f"s3://{self._bucket}/output-model/{job_id}/{training_job_name}/output/mannequin.tar.gz"
env = {
"MMS_MAX_RESPONSE_SIZE": "20000000",
}
# Create mannequin from saved mannequin artifact
sm_model = mannequin.Mannequin(
model_data=model_artifact_s3_location,
position=self._aws_role,
entry_point="inference.py",  # entry level file in source_dir and current in deploy_source_uri
image_uri=image_uri,
source_dir=source_uri,
env=env
)
transformer = sm_model.transformer(instance_count=self._inference_instance_count, instance_type=self._inference_instance_type,
output_path=f"s3://{self._bucket}/processing/{job_id}/output-images",
settle for='utility/json')
transformer.rework(information=f"s3://{self._bucket}/processing/{job_id}/batch_transform_input/",
content_type='utility/json')

And with that, my customised picture generator is all prepared!

Superior Secure Diffusion mannequin fine-tuning

Whereas it’s not important for my comedian generator mission, I’d like to the touch on some superior fine-tuning methods involving the manipulation of max_steps, prior_reservation, and train_text_encoder hyper parameters, in case they turn out to be useful to your tasks.

Secure Diffusion mannequin fine-tuning is extremely vulnerable to overfitting because of the huge distinction between the variety of coaching photographs you present and people used within the base mannequin. For instance, you may solely provide 10 photographs of Bob the penguin, whereas the bottom mannequin’s coaching set incorporates 1000’s of penguin photographs. A bigger variety of photographs reduces the chance of overfitting and faulty associations between the goal object and different parts.

When setting prior_reservation to True, Secure Diffusion generates a default of x (sometimes 100) photographs utilizing the class_prompt offered, and combines them together with your instance_images throughout fine-tuning. Alternatively, you’ll be able to manually provide these photographs by putting them within the class_data_dir subfolder. In my expertise, prior_preservation is commonly essential when fine-tuning Secure Diffusion for human faces. When using prior_reservation, make sure you present a class_prompt that mentions essentially the most appropriate generic identify or frequent object resembling your character. For Bob the penguin, this object is clearly a penguin, so your class immediate can be “a photograph of a penguin”. This system can be used to generate a mix between two characters, which I’ll focus on later.

One other useful parameter for superior fine-tuning is train_text_encoder. Set it to True to allow textual content encoder coaching throughout the fine-tuning course of. The ensuing mannequin will higher perceive extra complicated prompts and generate human faces with better accuracy.

Relying in your particular use case, totally different hyper parameter values might yield higher outcomes. Moreover, you’ll want to regulate the max_steps parameter to manage the variety of fine-tuning steps required. Remember that a bigger coaching set may result in overfitting.

By utilising Amazon Polly’s Neural Textual content To Speech (NTTS) function, I used to be in a position to create audio narration for every paragraph of the story. The standard of the audio narration is outstanding, because it sounds extremely pure and human-like, making it a really perfect story-teller.

To accommodate a youthful viewers, resembling Dexie, I employed the SSML format and utilised the <prosody fee> tag to cut back the talking pace to 90% of its regular fee, making certain the content material wouldn’t be delivered too rapidly for them to observe.

self._pollyClient = boto3.Session(
region_name=aws_region).shopper('polly')
ftext = f"<communicate><prosody fee="90%">{textual content}</prosody></communicate>"
response = self._pollyClient.synthesize_speech(VoiceId=self._speaker,
OutputFormat='mp3',
Engine='neural',
Textual content=ftext,
TextType='ssml')with open(mp3_path, 'wb') as file:
file.write(response['AudioStream'].learn())
file.shut()

After all of the arduous work, I used MoviePy — a unbelievable Python framework — to magically flip all of the pictures, audio narration, and music into an superior mp4 video. Talking of music, I gave my tech the ability to decide on the right soundtrack to match the video’s vibe. How, you ask? Properly, I simply modified my story script generator to return a music model from a pre-determined record utilizing some intelligent prompts. How cool is that?

At the beginning of the story please recommend tune model from the next record solely which matches the story and put it inside <>. Music model record are motion, calm, dramatic, epic, glad and touching.

As soon as the music model is chosen, the subsequent step is to randomly decide an MP3 monitor from the related folder, which incorporates a handful of MP3 information. This helps so as to add a contact of unpredictability and pleasure to the ultimate product.

To orchestrate the whole system, I wanted a controller module within the type of a Python script that would run every module seamlessly. However, after all, I wanted a compute atmosphere to execute this script. I had two choices to discover — the primary being my most well-liked choice — a server-less structure with AWS Lambda. This concerned utilizing a number of AWS Lambdas, paired with SQS. The primary lambda is used as public API utilizing API Gateway as an entry level. This API would take within the coaching picture URLs and story subject textual content and pre-process the information, dropping it into an SQS queue. One other Lambda would decide up the information from the subject and conduct information preparation — suppose picture resizing, creating dataset_info.json, and triggering the subsequent Lambda to name Amazon SageMaker Jumpstart to organize the Secure Diffusion mannequin and execute SageMaker coaching job to fine-tune the mannequin. Phew, that’s a mouthful. Lastly, Amazon EventBridge can be used as an occasion bus to detect the completion of the coaching job and set off the subsequent Lambda to execute SageMaker Batch Remodel utilizing the fine-tuned mannequin to generate photographs.

However alas, this selection was not doable as a result of the AWS Lambda operate had a max storage restrict of 10GB. And when executing the batch rework in opposition to the SageMaker mannequin, the SageMaker Python SDK would obtain and extract the mannequin.tar.gzip file quickly within the native /tmp earlier than sending it to the managed system that ran the batch rework. Sadly, my mannequin was a whopping 5GB compressed, so the SageMaker Python SDK threw an error saying “Out of disk house.” For many use instances the place the mannequin dimension is smaller, this may the perfect and cleanest answer.

So, I needed to resort to my second choice — AWS Batch. It labored properly, however it did price a bit extra because the AWS Batch compute occasion needed to run all through the whole course of —even throughout fine-tuning the mannequin, and executing the batch rework which had been executed in a separate compute atmosphere inside SageMaker. I might have cut up the method into a number of AWS Batch cases and glued them along with Amazon EventBridge and SQS, similar to I might have accomplished beforehand utilizing the server-less method. However with AWS Batch’s longer startup time (round 5 minutes), it will have added means an excessive amount of latency to the general course of. So, I went with the all-in-one AWS Batch choice as an alternative.

Feast your eyes upon Owly’s majestic structure diagram! Our journey kicks off by launching AWS Batch by the AWS Console, equipping it with an S3 folder brimming with coaching photographs, a charming story subject, and a pleasant character, all provided by way of AWS Batch atmosphere variables.

# Fundamental settings
JOB_ID = "penguin-images" # key to S3 folder containing the coaching photographs
STORY_TOPIC = "bob the penguin who desires to journey to Europe"
STORY_CHARACTER = "bob the penguin"# Superior settings
TRAIN_TEXT_ENCODER = False
PRIOR_RESERVATION = False
MAX_STEPS = 400
NUM_IMAGE_VARIATIONS = 5

The AWS Batch springs into motion, retrieving the coaching photographs from the S3 folder specified by JOB_ID, resizing them to a 768×768, and making a dataset_info.json file earlier than putting them in a staging S3 bucket.

Subsequent up, we name up the OpenAI GPT3.5 mannequin API to whip up an attractive story and a complementary tune model in concord with the chosen subject and character. We then summon Amazon SageMaker JumpStart to unleash the highly effective Secure Diffusion 2.1 base mannequin. With the mannequin at our disposal, we provoke a SageMaker coaching job to fine-tune it to our rigorously chosen coaching photographs. After a short 30-minute interlude, we forge picture prompts for every story paragraph within the guise of textual content information, that are then dropped into an S3 bucket as enter for the picture era extravaganza. Amazon SageMaker Batch Remodel is unleashed on the fine-tuned mannequin to provide these photographs in a batch, a course of that lasts a mere 5 minutes.

As soon as full, we enlist the assistance of Amazon Polly to craft audio narrations for every paragraph within the story, saving them as mp3 information in simply 30 seconds. We then randomly decide an mp3 music file from libraries sorted by tune model, based mostly on the choice made by our masterful story generator.

The ultimate act sees the ensuing photographs, audio narration mp3s, and music.mp3 information expertly woven collectively right into a video slideshow with the assistance of MoviePy. Easy transitions and the Ken Burns impact are added for that further contact of class. The pièce de résistance, the completed video, is then hoisted as much as the output S3 bucket, awaiting your keen obtain!

I need to say, I’m quite satisfied with the outcomes! The story script generator has really outdone itself, performing much better than anticipated. Nearly each story script crafted just isn’t solely well-written but in addition brimming with optimistic morals, showcasing the awe-inspiring prowess of Giant Language Fashions (LLM). As for picture era, properly, it’s a little bit of a combined bag.

With all of the enhancements I’ve described earlier, one in 5 tales can be utilized within the closing video proper off the bat. The remaining 4, nevertheless, normally have one or two photographs suffering from frequent points.

First, we’ve obtained inconsistent characters, nonetheless. Typically the mannequin conjures up a personality that’s barely totally different from the unique within the coaching set, usually choosing a photorealistic model quite than the toy counterpart. However worry not! Including a desired photograph model throughout the textual content immediate, like “A cartoon-style Rex the turtle swimming underneath the ocean,” helps curb this concern. Nonetheless, it does require guide intervention since sure characters might warrant a photorealistic model.
Then there’s the curious case of lacking physique components. Sometimes, our generated characters seem with absent limbs or heads. Yikes! To mitigate this, we’ve added damaging prompts supported by the Secure Diffusion mannequin, resembling “lacking limbs, lacking head,” encouraging the era of photographs that avoid these peculiar attributes.

Rex the turtle in several model (backside proper picture is in a photograph life like model, prime proper picture is in a combined model, the remainder are in a toy model) and lacking a head (prime left picture) [AI generated image]

Weird photographs emerge when coping with unusual interactions between objects. Producing photographs of characters in particular areas sometimes produces passable outcomes. Nonetheless, on the subject of illustrating characters interacting with different objects, particularly in an unusual means, the end result is commonly lower than ideally suited. For example, making an attempt to depict Tom the hedgehog milking a cow ends in a peculiar mix of hedgehog and cow. In the meantime, crafting a picture of Tom the hedgehog holding a flower bouquet results in an individual clutching each a hedgehog and a bouquet of flowers. Regrettably, I’ve but to plot a method to treatment this concern, main me to conclude that it’s merely a limitation of present picture era expertise. If the item or exercise within the picture you’re attempting to generate is extremely uncommon, the mannequin lacks prior data, as not one of the coaching information has ever depicted such scenes or actions.

Blended of a hedgehog and a cow (prime photographs)is generated from “Tom the hedgehog is milking a cow” immediate. An individual holding a hedgehog and a flower (backside left picture) is generated from “Tom the hedgehog is holding a flower” [AI generated image]

Ultimately, to spice up the percentages of success in story era, I cleverly tweaked my story generator to provide three distinct scenes per paragraph. Furthermore, for every scene, I instructed my picture generator to create 5 picture variations. With this method, I elevated the chance of acquiring at the least one top-notch picture from the fifteen obtainable. Having three totally different immediate variations additionally aids in producing totally distinctive scenes, particularly when one scene proves too uncommon or complicated to create. Under is my up to date story era immediate.

"Write me a {max_words} phrases story a couple of given character and a subject.nPlease break the story down into " 
"seven to 10 quick sections with 30 most phrases per part. For every part please describe the scene in " 
"particulars and at all times embrace the situation in a single sentence inside [] with the next format " 
"[a photo of character in the location], [a photo of character in front of an object], " 
"[a photo of character next to an object], [a photo of a location]. Please present three totally different variations " 
"of the scene particulars separated by |nAt the beginning of the story please recommend tune model from the next " 
"record solely which matches the story and put it inside <>. Music model record are motion, calm, dramatic, epic, " 
"glad and touching."

The one further price is a little bit of guide intervention after the picture era step is completed, the place I handpick the perfect picture for every scene after which proceed with the comedian era course of. This minor inconvenience apart, I now boast a exceptional success fee of 9 out of 10 in crafting splendid comics!

With the Owly system absolutely assembled, I made a decision to place this marvel of expertise to the check one positive Saturday afternoon. I generated a handful of tales from his toys assortment, prepared to boost bedtime storytelling for Dexie utilizing a nifty moveable projector I had bought. That night time, as I noticed Dexie’s face gentle up and his eyes widen with pleasure, the comedian enjoying out on his bed room wall, I knew all my efforts had been value it.

Dexie is watching the comedian on his bed room wall [Image by Author]

The cherry on prime is that it now takes me underneath two minutes to whip up a brand new story utilizing pictures of his toy characters I’ve already captured. Plus, I can seamlessly incorporate useful morals I would like him to study from every story, resembling not speaking to strangers, being courageous and adventurous, or being form and useful to others. Listed here are among the pleasant tales generated by this unbelievable system.

Tremendous Hedgehog Tom Saves His Metropolis From a Dragon — Generated by AI Comedian Generator.

Bob the Courageous Penguin: Adventures in Europe — Generated by AI Comedian Generator.

As a curious tinkerer, I couldn’t assist however fiddle with the picture era module to push Secure Diffusion’s boundaries and merge two characters into one magnificent hybrid. I fine-tuned the mannequin with Kwazi Octonaut photographs, however I threw in a twist by assigning Zelda as each the distinctive and sophistication character identify. Setting prior_preservation to True, I ensured that Secure Diffusion would “octonaut-ify” Zelda whereas nonetheless conserving her distinct essence intact.

I cleverly utilised a modest max_step of 400, simply sufficient to protect Zelda’s authentic appeal with out her being totally consumed by Kwazi the Octonaut’s irresistible attract. Behold the wonderful fusion of Zelda and Kwazi, united as one!

Dexie brimmed with pleasure as he witnessed a fusion of his two favorite characters spearheading the motion in his bedtime story. He launched into thrilling adventures, combating aliens and looking for hidden treasure chests!

Sadly to guard the IP proprietor I can not present the ensuing photographs.

Generative AI, notably Giant Language Fashions (LLMs), is right here to remain and set to develop into the highly effective instruments for not solely software program improvement however many different industries as properly. I’ve skilled the true energy of LLMs firsthand in just a few tasks. Simply final yr, I constructed a robotic teddy bear called Ellie, capable of moving its head and engaging in conversations like a real human. Whereas this expertise is undeniably potent, it’s vital to train warning to make sure the protection and high quality of the outputs it generates, as it may be a double-edged sword.

And there you’ve got it, of us! I hope you discovered this weblog fascinating. In that case, please bathe me together with your claps. Be at liberty to attach with me on LinkedIn or try my different AI endeavours on my Medium profile. Keep tuned, as I’ll be sharing the whole supply code within the coming weeks!

Lastly, I wish to say due to Mike Chambers from AWS who helped me troubleshoot my fine-tuned Secure Diffusion mannequin batch rework code.

[ad_2]

Source link

Building Owly an AI Comic Video Generator For My Son | by Agustinus Nalwan | Apr, 2023

Alphatec acquires robotic navigation platform

Solving (some) formal math olympiad problems

Editor

Solving (some) formal math olympiad problems

Leave a Reply Cancel reply

Browse by Category

Categories

Recommended

Building Owly an AI Comic Video Generator For My Son | by Agustinus Nalwan | Apr, 2023

Utilising the fine-tuned Secure Diffusion 2.1 on Amazon SageMaker JumpStart, I developed an AI tech known as Owly that crafts personalised comedian movies with music, starring my son’s toys because the lead characters

Alphatec acquires robotic navigation platform

Solving (some) formal math olympiad problems

Editor

Solving (some) formal math olympiad problems

Leave a Reply Cancel reply

Browse by Category

Browse by Tags

Categories

Recommended