[ad_1]
Easy methods to Construct Finish-to-Finish Knowledge Undertaking Exploring New Trending Cocomelon Movies from Scratch Utilizing R
Cocomelon — Nursery Rhymes is the world’s second-largest Youtube channel (155M+ subscribers). It’s such a well-liked and useful channel that it’s an inevitable topic for toddlers and fogeys. I get pleasure from spending time watching Cocomelon along with my son.
After watching Cocomelon movies for a month, I observed the identical movies are repeatedly advisable on Youtube. Movies like “The wheel on the bus” and “bathtub tune” are widespread and enjoyable to observe, however they have been printed years in the past, and children bought bored watching them repeatedly. As a father, I need to present some more moderen however good-quality movies from the Cocomelon channel. As an information skilled, I additionally need to discover the world’s second-largest Youtube channel information to realize extra insights and discover one thing attention-grabbing concerning the information out there.
All movies inside a YouTube channel solely present customers with two choices: lately uploaded (order by time) and widespread (order by view). I might go to the lately uploaded tab and click on one after one other. Nonetheless, the Cocomelon channel has 800+ movies, which shall be time-consuming.
The nice factor is that I’m an engineer and know how one can construct one thing with information. So I began writing code by gathering information, performing the cleanup, visualization, and gaining extra insights. I’ll share my journey on utilizing R for Knowledge Evaluation: constructing an end-to-end resolution for exploring trending Cocomelon movies utilizing R from scratch.
Be aware: though the instance code I wrote in R and the Youtube channel is for Cocomelon, they’re my desire. You too can write in Python or Rust with its information evaluation instrument, and I’ll present how one can get information from Youtube applies to different channels as effectively.
The information supply is at all times the place to begin of any information mission. I’ve made a number of makes an attempt to step onto my remaining resolution.
I first searched on Google for the time period: “Youtube views stats for Cocomelon” It exhibits some statistics concerning the channel, however none cowl extra detailed information for every video. These websites are closely flooded with advertisements, and internet scraping is perhaps difficult.
Then I seemed on the public dataset on Kaggle, and CC0 datasets like Trending YouTube Video Statistics may very well be a superb choice. Nonetheless, after exploring the dataset, I discovered two points:
- It does not comprise Cocomelon within the dataset
- The content material was retrieved years in the past and wanted newer movies I needed to seek for.
My solely choice is to tug information instantly from Youtube to tug essentially the most up-to-date information. There are additionally two choices right here:
- Internet scraping: I might arrange a crawler or discover one mission on GitHub and use it instantly. My concern right here is that if the crawler is aggressive, it would block my Youtube account. And crawling is not very environment friendly for quite a few movies to tug from.
- Youtube API: I lastly landed on this resolution. It’s environment friendly and offers some fundamental statistics on movies: variety of views and variety of likes. We are able to additional use this info to construct our information evaluation mission.
Get Youtube API Key To Pull Knowledge
Youtube API essential grants you permission to tug information from Youtube. You first would wish to go to https://console.cloud.google.com/apis, then “create credentials” with the API key. The default key is not restricted; you may restrict the API key used just for Youtube.
Get Youtube Channel Playlist in R
After getting the API key, check with Youtube Data API for extra reference on the potential information it helps. To look at the API in a queryable stage, we will use instruments like Postman or instantly copy the complete URL.
For instance, we would like to tug the channel info for Cocomelon; one way or the other, I did not discover its channel id by inspecting its URL, however I discovered it by means of some google search.
https://www.youtube.com/channel/UCbCmjCuTUZos6Inko4u57UQ
Now we will use the channel id to assemble the GET technique and fill the API key into the important thing subject:
https://www.googleapis.com/youtube/v3/channels?half=snippet,contentDetails,statistics&id=UCbCmjCuTUZos6Inko4u57UQ&key=
From the returned JSON, essentially the most essential piece of data is the playlist info, which tells us additional about all of the movies.
"contentDetails": {
"relatedPlaylists": {
"likes": "",
"uploads": "UUbCmjCuTUZos6Inko4u57UQ"
}
}
With the brand new adoption of pagination and the utmost variety of objects on one web page being 50, calling playlistItems
will take time to achieve the ultimate checklist. We might want to make use of the present token to retrieve the following web page till no subsequent one is discovered. We are able to put the whole lot collectively in R.
library(shiny)
library(vroom)
library(dplyr)
library(tidyverse)
library(httr)
library(jsonlite)
library(ggplot2)
library(ggthemes)
library(stringr)key <- "to_be_replace"
playlist_url <-
paste0(
"https://www.googleapis.com/youtube/v3/playlistItems?half=snippet,contentDetails,standing&maxResults=50&playlistId=UUbCmjCuTUZos6Inko4u57UQ&key=",
key
)
api_result <- GET(playlist_url)
json_result <- content material(api_result, "textual content", encoding = "UTF-8")
movies.json <- fromJSON(json_result)
movies.json$nextPageToken
movies.json$totalResults
pages <- checklist(movies.json$objects)
counter <- 0
whereas (!is.null(movies.json$nextPageToken)) {
next_url <-
paste0(playlist_url, "&pageToken=", movies.json$nextPageToken)
api_result <- GET(next_url)
print(next_url)
message("Retrieving web page ", counter)
json_result <- content material(api_result, "textual content", encoding = "UTF-8")
movies.json <- fromJSON(json_result)
counter <- counter + 1
pages[[counter]] <- movies.json$objects
}
## Mix all of the dataframe into one
all_videos <- rbind_pages(pages)
## Get a listing of video
movies <- all_videos$contentDetails$videoId
all_videos
ought to give us all of the fields for the video. All we care about at this stage is the videoId so we will fetch detailed info on every video.
Iterate the Video Listing and Fetch Knowledge For Every Video In R
As soon as all of the movies are saved in a vector, we will replicate an identical course of as we did for the playlist. It will likely be a lot simpler this time since we do not have to deal with the pagination.
At this stage, we would care extra concerning the information we’ll finally pull from the video API name. I selected those for our later information evaluation and visualization. To save lots of time in pulling this information once more, it is higher to persist the info right into a CSV file, so we do not have to run the API name a number of occasions.
videos_df = information.body()
video_url <-
paste0(
"https://www.googleapis.com/youtube/v3/movies?half=contentDetails,id,liveStreamingDetails,localizations,participant,recordingDetails,snippet,statistics,standing,topicDetails&key=",
key
)for (v in movies) {
a_video_url <- paste0(video_url, "&id=", v)
print(v)
print(a_video_url)
api_result <- GET(a_video_url)
json_result <- content material(api_result, "textual content", encoding = "UTF-8")
movies.json <- fromJSON(json_result, flatten = TRUE)
# colnames(movies.json$objects)
video_row <- movies.json$objects %>%
choose(
snippet.title,
snippet.publishedAt,
snippet.channelTitle,
snippet.thumbnails.default.url,
participant.embedHtml,
contentDetails.period,
statistics.viewCount,
statistics.commentCount,
statistics.likeCount,
statistics.favoriteCount,
snippet.tags
)
videos_df <- rbind(videos_df, video_row)
}
write.csv(videos_df, "~/cocomelon.csv", row.names=TRUE)
The information is ready for our subsequent stage to discover the Cocomelon Youtube video. Now it is time to carry out some cleanup and create visualizations to indicate findings.
The default object information sort does not work effectively with the later sorting, so we would must convert some objects to drift or date varieties.
videos_df <- videos_df %>% rework(
statistics.viewCount = as.numeric(statistics.viewCount),
statistics.likeCount = as.numeric(statistics.likeCount),
statistics.favoriteCount = as.numeric(statistics.favoriteCount),
snippet.publishedAt = as.Date(snippet.publishedAt)
)
What are the highest 5 most considered Cocomelon movies?
This half is simple. We might want to pick out the fields we’re all for, then type the movies in descending order by the sector. viewCount
.
videos_df %>%
choose(snippet.title, statistics.viewCount) %>%
organize(desc(statistics.viewCount)) %>% head(5)# Output:
# snippet.title statistics.viewCount
#1 Bathtub Music | CoComelon Nursery Rhymes & Youngsters Songs 6053444903
#2 Wheels on the Bus | CoComelon Nursery Rhymes & Youngsters Songs 4989894294
#3 Baa Baa Black Sheep | CoComelon Nursery Rhymes & Youngsters Songs 3532531580
#4 Sure Sure Greens Music | CoComelon Nursery Rhymes & Youngsters Songs 2906268556
#5 Sure Sure Playground Music | CoComelon Nursery Rhymes & Youngsters Songs 2820997030
For you might have watched Cocomelon movies earlier than, it isn’t stunning to see the consequence that “Bathtub Music,” “Wheels on the Bus,” and “Baa Baa Black Sheep” rank within the high 3. It matches the Cocomelon widespread
tab on Youtube. Additionally, the “Bathtub Music” is performed 20%+ extra occasions than the second video — “Wheels on the Bus.” I can see that many toddlers are struggling to take a shower, and having children watch this video might give them an concept of how one can take a shower and luxury them to calm them down.
We additionally create a bar chart with the highest 5 movies:
ggplot(information = chart_df, mapping = aes(x = reorder(snippet.title, statistics.viewCount), y = statistics.viewCount)) +
geom_bar(stat = "identification",fill="lightgreen") +
scale_x_discrete(labels = perform(x) str_wrap(x, width = 16)) +
theme_minimal()
The variety of views and likes are correlated: Is a video extra more likely to get a thumb up (like) with extra views?
We are able to use the info to show it additional. First, normalize the viewCount
and likeCount
to suit higher for the visualization. Secondly, we additionally compute the times because the video was uploaded to get when the favored movies are created.
chart_df <- videos_df %>%
mutate(
views = statistics.viewCount / 1000000,
likes = statistics.likeCount / 10000,
number_days_since_publish = as.numeric(Sys.Date() - snippet.publishedAt)
)ggplot(information = chart_df, mapping = aes(x = views, y = likes)) +
geom_point() +
geom_smooth(technique = lm) +
theme_minimal()
cor(chart_df$views, chart_df$likes, technique = "pearson")
## 0.9867712
The correlation coefficient is 0.98 very extremely correlated: with extra views on a video, it’s more likely to get extra thumbs up. It is also fascinating that solely six movies have over 2B+ views: mother and father and children get pleasure from these six movies and doubtlessly watch them many occasions.
We are able to additional plot the favored movies and discover out that the preferred movies aged 1500–2000 days confirmed these movies have been created round 2018 or 2019.
The favored video is straightforward to retrieve. Nonetheless, widespread movies created 4,5 years in the past can nonetheless be trending attributable to many each day movies.
How about discovering new Cocomelon movies with views? Since we will solely pull the variety of views from the Youtube API for the present state, we would must retailer the info quickly by pulling information from the API with some days in between.
f1 <- read_csv("~/cocomelon_2023_2_28.csv")
df2 <- read_csv("~/cocomelon_2023_3_2.csv")df1<- df1 %>% rework(
statistics.viewCount = as.numeric(statistics.viewCount)
)
df2<- df2 %>% rework(
statistics.viewCount = as.numeric(statistics.viewCount),
snippet.publishedAt = as.Date(snippet.publishedAt)
)
df1 <- df1 %>% choose(snippet.title,
statistics.viewCount)
df2 <- df2 %>% choose(snippet.title,
snippet.publishedAt,
statistics.viewCount)
# Be part of information by snippet.title
joined_df <- inner_join(df1, df2, by = 'snippet.title')
joined_df <- joined_df %>%
mutate(
view_delta = statistics.viewCount.y - statistics.viewCount.x,
number_days_since_publish = as.numeric(Sys.Date() - snippet.publishedAt)
)
# Current Video uploaded inside 200 days and high 5 of them by view delta
chart_df <- joined_df %>%
filter(number_days_since_publish<=200) %>%
choose(snippet.title, view_delta) %>%
organize(desc(view_delta)) %>% head(5)
ggplot(information = chart_df,
mapping = aes(
x = reorder(snippet.title, view_delta),
y = view_delta
)) +
geom_bar(stat = "identification", fill = "lightblue") +
scale_x_discrete(
labels = perform(x)
str_wrap(x, width = 16)
) +
theme_minimal()
# Output
# snippet.title view_delta
#1 🔴 CoComelon Songs Reside 24/7 - Bathtub Music + Extra Nursery Rhymes & Youngsters Songs 2074257
#2 Sure Sure Fruits Music | CoComelon Nursery Rhymes & Youngsters Songs 1709434
#3 Airplane Music | CoComelon Nursery Rhymes & Youngsters Songs 977383
#4 Bingo's Bathtub Music | CoComelon Nursery Rhymes & Youngsters Songs 951159
#5 Fireplace Truck Music - Vans For Youngsters | CoComelon Nursery Rhymes & Youngsters Songs 703467
The highest trending video is 🔴 CoComelon Songs Reside 24/7. This video exhibits that oldsters can hold the children with movies mechanically rotating with out switching movies explicitly. The remaining movies additionally confirmed potential good single songs which are good suggestions.
There are lots of movies to observe on Youtube for teenagers. Cocomelon has many movies, and I need to present my child the great ones with the restricted time he’s allowed to observe each day. Discovering these trending movies is an enchanting exploration for information professionals.
I hope my publish is useful to you. As the following step, I’ll proceed my journey in R and use Shiny to construct an interactive utility with customers.
[ad_2]
Source link