Automatically Managing Data Pipeline Infrastructures With Terraform | by João Pedro

[ad_1]

I do know the guide work you probably did final summer time

Just a few weeks in the past, I wrote a publish about developing a data pipeline using both on-premise and AWS tools. This publish is a part of my latest effort in bringing extra cloud-oriented information engineering posts.

Nonetheless, when mentally reviewing this publish, I observed a giant downside: the guide work.

Every time I develop a brand new venture, whether or not actual or fictional, I all the time attempt to cut back the friction of configuring the atmosphere (set up dependencies, configure folders, receive credentials, and so forth) and that’s why I all the time use Docker. With it, I simply cross you a docker-compose.yaml file + a couple of Dockerfiles and you might be able to creating precisely the identical atmosphere as me with only one command — docker compose up.

Nonetheless, once we need to develop a brand new information venture with cloud instruments (S3, Lambda, Glue, EMR, and so forth) Docker can’t assist us, because the elements have to be instantiated within the suppliers’ infrastructure, and there are two essential methods of doing this: Manually on the UI or programmatically by means of service APIs.

For instance, you may entry the AWS UI in your browser, seek for S3 and create a brand new bucket manually, or write a code in Python to create this identical occasion making a request on the AWS API.

Within the publish talked about earlier, I described step-by-step how one can create the wanted elements MANUALLY by means of the AWS internet interface. The consequence? Even attempting to summarize as a lot as potential (and even omitting components!), the publish ended up with 17 min, 7 min greater than I often do, filled with PRINTS of which display it’s best to entry, the place it’s best to click on, and which settings to decide on.

Along with being a pricey, complicated, and time-consuming course of, it’s nonetheless prone to human errors, which finally ends up bringing extra complications and presumably even dangerous surprises within the month-to-month invoice. Positively an disagreeable course of.

And that is the precisely sort of downside that Terraform comes to resolve.

not sponsored.

Terraform is an IaC (Infrastructure as Code) device that manages infrastructure in cloud suppliers in an automated and programmatically method.

In Terraform, the specified infrastructures is described utilizing a declarative language known as HCL (HashiCorp Configuration Language), the place the elements are specified, e.g. a S3 bucket named “my-bucket” and an EC2 server with Ubuntu 22 within the us-east-1 zone.

The described sources are materialized by Terraform by means of calls within the cloud supplier’s service APIs. Past creation, additionally it is able to destroying and updating the infrastructure, including/eradicating solely the sources wanted to maneuver from the precise state to the specified state, e.g. if 4 cases of EC2 are requested, it’s going to create solely 2 new cases if 2 others exist already. This conduct is achieved as a result of Terraform shops the precise state of the infrastructure in state information.

Due to this, it is potential to handle a venture’s infrastructure in a way more agile and safe approach, because it removes the guide work wanted of configuring every particular person useful resource.

Terraform’s proposal is to be a cloud-agnostic IaC device, so it makes use of a standardized language to mediate the interplay with the cloud suppliers’ APIs, eradicating the necessity of studying how one can work together with them immediately. Nonetheless on this line, HCL language additionally helps variables manipulation and a sure diploma of ‘flux management’ (if-statements and loops), permitting using conditionals and loops in useful resource creation, e.g. create 100 EC2 cases.

Final however not least, Terraform additionally permits infrastructure versioning, as its plain-text information might be simply manipulated by git.

As talked about earlier, this publish seeks to automate the method of infrastructure creation of my earlier publish.

To recap, the venture developed geared toward creating a knowledge pipeline to extract questions from the Brazillian ENEM (Nationwide Examination of Excessive College, on literal translation) checks utilizing the PDFs obtainable on the MEC (Ministry of Training) website.

The method concerned three steps, managed by an area Airflow occasion. These steps included downloading and importing the PDF file to S3 storage, extracting texts from the PDFs utilizing a Lambda operate, and segmenting the extracted textual content into questions utilizing a Glue Job.

Word that, for this pipeline to work, many AWS elements need to be created and accurately configured.

0. Establishing the atmosphere

All of the code used on this venture is accessible on this GitHub Repository.

You’ll want a machine with Docker and an AWS account.

Step one is configuring a brand new AWS IAM consumer for Terraform, this would be the solely step executed within the AWS internet console.

Create a brand new IAM consumer with FullAccess to S3, Glue, Lambda, and IAM and generate code credentials for it.

It is a lot of permission for a single consumer, so hold the credentials secure.

I’m utilizing FullAccess permissions as a result of I wanna make issues simpler for now, however all the time contemplate the ‘least privileged’ strategy when coping with credentials.

Now, again to the native atmosphere.

On the identical path because the docker-compose.yaml file, create a .env file and write your credentials:

AWS_ACCESS_KEY_ID=<YOUR_ACCESS_KEY_ID>
AWS_SECRET_ACCESS_KEY=<YOUR_SECRET_ACCESS_KEY>

These variables can be handed to the docker-compose file for use by Terraform.

model: '3'
providers:
terraform:
picture: hashicorp/terraform:newest
volumes:
- ./terraform:/terraform
working_dir: /terraform
command: ["init"]
atmosphere:
- TF_VAR_AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
- TF_VAR_AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
- TF_VAR_AWS_DEFAULT_REGION=us-east-1

1. Create the Terraform file

Nonetheless on the identical folder create a brand new listing known as terraform. Inside it, create a brand new file essential.tf, this can be our essential Terraform file.

This folder can be mapped contained in the container when it runs, so the interior Terraform will be capable to see this file.

2. Configure the AWS Supplier

The very first thing we have to do is to configure the cloud supplier used.

terraform {
required_version = ">= 0.12"required_providers {
aws = ">= 3.51.0"
}
}
variable "AWS_ACCESS_KEY_ID" {
sort = string
}
variable "AWS_SECRET_ACCESS_KEY" {
sort = string
}
variable "AWS_DEFAULT_REGION" {
sort = string
}
supplier "aws" {
access_key = var.AWS_ACCESS_KEY_ID
secret_key = var.AWS_SECRET_ACCESS_KEY
area     = var.AWS_DEFAULT_REGION
}

That is what a Terraform configuration file seems to be like — a set of blocks with differing kinds, every one with a selected operate.

The terraform block fixes the variations for Terraform itself and for the AWS supplier.

A variable is precisely what the title suggests — a worth assigned to a reputation that may be referenced all through the code.

As you most likely already observed, our variables don’t have a worth assigned to them, so what’s happening? The reply is again within the docker-compose.yaml file, the worth of those variables was set utilizing atmosphere variables within the system. When a variable worth will not be outlined, Terraform will take a look at the worth of the atmosphere variable TF_VAR_<var_name> and use its worth. I’ve opted for this strategy to keep away from hard-coding the keys.

The supplier block can also be self-explanatory — It references the cloud supplier we’re utilizing and configures its credentials. We set the supplier’s arguments (access_key, secret_key, and area) with the variables outlined earlier, referenced with the var.<var_name> notation.

With this block outlined, run:

docker compose run terraform init

To arrange Terraform.

3. Creating our first useful resource: The S3 bucket

Terraform makes use of the useful resource block to reference infrastructure elements comparable to S3 buckets and EC2 cases, in addition to actions like granting permissions to customers or importing information to a bucket.

The code under creates a brand new S3 bucket for our venture.

useful resource "aws_s3_bucket" "enem-bucket-terraform-jobs" {
bucket = "enem-bucket-terraform-jobs"
}

A useful resource definition follows the syntax:

useful resource <resource_type> <resource_name> {
argument_1 = "blah blah blah blah" 
argument_2 = "blah blah blah"
argument_3 {
...
}
}

Within the case above, “aws_s3_bucket” is the useful resource sort, “enem-bucket-terraform-jobs” is the useful resource title, used to reference this useful resource within the file (it isn’t the bucket title within the AWS infrastructure). The argument bucket=“enem-bucket-terraform-jobs” assigns a reputation to our bucket.

Now, with the command:

docker compose run terraform plan

Terraform will examine the present state of the infrastructure and infer what must be finished to attain the specified state described within the essential.tf file.

As a result of this bucket nonetheless doesn’t exist, Terraform will plan to create it.

To use Terraform’s plan, run

docker compose run terraform apply

And, with solely these few instructions, our bucket is already created.

Simple, proper?

To destroy the bucket, simply sort:

docker compose run terraform destroy

And Terraform takes care of the remaining.

These are the essential instructions that can comply with us till the tip of the publish: plan, apply, destroy. Any further, all that we’re going to do is configure the essential.tf file, including the sources wanted to materialize our information pipeline.

4. Configuring the Lambda Operate half I: Roles and permissions

Now on the Lambda Operate definition.

This was one of many trickiest components of my earlier publish as a result of, by default, Lambda features already want a set of fundamental permissions and, on high of that, we had additionally to present it learn and write permissions to the S3 bucket beforehand created.

Initially, we should create a brand new IAM position.

# CREATE THE LAMBDA FUNCTION
# ==========================# CREATE A NEW ROLE FOR THE LAMBDA FUNCTION TO ASSUME
useful resource "aws_iam_role" "lambda_execution_role" {
title = "lambda_execution_role_terraform"
assume_role_policy = jsonencode({
# That is the coverage doc that enables the position to be assumed by Lambda
# different providers can't assume this position
Model = "2012-10-17"
Assertion = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "lambda.amazonaws.com"
}
}
]
})
}

When growing this stuff, I strongly recommend that you just first ask what you need in ChatGPT, GitHub Copilot, or some other LLM buddy after which test the supplier’s documentation on how one of these useful resource works.

The code above creates a brand new IAM position and permits AWS Lambda Capabilities to imagine it. The following step is to connect the Lambda Primary Execution coverage to this position to permit the Lambda Operate to execute with out errors.

# ATTACH THE BASIC LAMBDA EXECUTION POLICY TO THE ROLE lambda_execution_role
useful resource "aws_iam_role_policy_attachment" "lambda_basic_execution" {
policy_arn = "arn:aws:iam::aws:coverage/service-role/AWSLambdaBasicExecutionRole"
position       = aws_iam_role.lambda_execution_role.title
}

The great factor to notice within the code above is that we will reference useful resource attributes and cross them as arguments within the creation of recent sources. Within the case above, as an alternative of hard-coding the ‘position’ argument with the title of the beforehand created position ‘lambda_execution_role_terraform’, we will reference this attribute utilizing the syntax:
<resource_type>.<resource_name>.<attribute>

For those who take a while to look into the Terraform documentation of a useful resource, you’ll be aware that it has arguments and attributes. Arguments are what you cross with the intention to create/configure a brand new useful resource and attributes are read-only properties about this useful resource obtainable after its creation.

Due to this, attributes are utilized by Terraform to implicitly handle dependencies between sources, establishing the suitable order of their creation.

The code under creates a brand new entry coverage for our S3 bucket, permitting fundamental CRUD operations on it.

# CREATE A NEW POLICY FOR THE LAMBDA FUNCTION TO ACCESS S3
useful resource "aws_iam_policy" "s3_access_policy" {
title = "s3_access_policy"
coverage = jsonencode({
Model = "2012-10-17"
Assertion = [
{
Effect = "Allow"
Action = [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject"
]
Useful resource = aws_s3_bucket.enem-data-bucket.arn
}
]
})# ATTACH THE EXECUTION POLICY AND THE S3 ACCESS POLICY TO THE ROLE lambda_execution_role
useful resource "aws_iam_policy_attachment" "s3_access_attachment" {
title       = "s3_and_lambda_execution_access_attachment"
policy_arn = aws_iam_policy.s3_access_policy.arn
roles      = [aws_iam_role.lambda_execution_role.name]
}

Once more, as an alternative of hard-coding the bucket’s ARN, we will reference this attribute utilizing aws_s3_bucket.enem-data-bucket.arn.

With the Lambda position accurately configured, we will lastly create the operate itself.

# CREATE A NEW LAMBDA FUNCTION
useful resource "aws_lambda_function" "lambda_function" {
function_name = "my-lambda-function-aws-terraform-jp"
position          = aws_iam_role.lambda_execution_role.arn
handler       = "lambda_function.lambda_handler"
runtime       = "python3.8"
filename      = "lambda_function.zip"
}

The lambda_function.zip file is a compressed folder that should have a lambda_function.py file with a lambda_handler(occasion, context) operate inside. It should be on the identical path as the primary.tf file.

# lambda_function.py
def lambda_handler(occasion, context):
return "Howdy from Lambda!"

5. Configuring the Lambda Operate half II: Attaching a set off

Now, we have to configure a set off for the Lambda Operate: It should execute each time a brand new PDF is uploaded to the bucket.

# ADD A TRIGGER TO THE LAMBDA FUNCTION BASED ON S3 BUCKET CREATION EVENTS
# https://stackoverflow.com/questions/68245765/add-trigger-to-aws-lambda-functions-via-terraformuseful resource "aws_lambda_permission" "allow_bucket_execution" {
statement_id  = "AllowExecutionFromS3Bucket"
motion        = "lambda:InvokeFunction"
function_name = aws_lambda_function.lambda_function.arn
principal     = "s3.amazonaws.com"
source_arn    = aws_s3_bucket.enem-data-bucket.arn
}
useful resource "aws_s3_bucket_notification" "bucket_notification" {
bucket = aws_s3_bucket.enem-data-bucket.id
lambda_function {
lambda_function_arn = aws_lambda_function.lambda_function.arn
occasions              = ["s3:ObjectCreated:*"]
filter_suffix = ".pdf"
}
depends_on = [aws_lambda_permission.allow_bucket_execution]
}

It is a case the place we should specify an specific dependency between sources, because the “bucket_notification” useful resource must be created after the “allow_bucket_execution”.

This may be simply achieved by utilizing the depends_on argument.

And we’re finished with the lambda operate, simply run:

docker compose run terraform apply

And the Lambda Operate can be created.

6. Including a module to the Glue job

Our essential.tf file is getting fairly large, and do not forget that that is only a easy information pipeline. To reinforce the group and cut back its dimension, we will use the idea of modules.

A module is a set of sources grouped in a separate file that may be referenced and reused by different configuration information. Modules allow us to summary advanced components of the infrastructure to make our code extra manageable, reusable, organized, and modular.

So, as an alternative of coding all of the sources wanted to create our Glue job within the essential.tf file, we’ll put them inside a module.

Within the ./terraform folder, create a brand new folder ‘glue’ with a glue.tf file inside it.

Then add a brand new S3 bucket useful resource within the file:

# INSIDE GLUE.TF
# Create a brand new bucket to retailer the job script
useful resource "aws_s3_bucket" "enem-bucket-terraform-jobs" {
bucket = "enem-bucket-terraform-jobs"
}

Again in essential.tf, simply reference this module with:

module "glue" {
supply = "./glue"
}

And reinitialize terraform:

docker compose run terraform init

Terraform will restart its backend and initialize the module with it.

Now, if we run terraform plan, it ought to embody this new bucket within the creation checklist:

Utilizing this module, we’ll be capable to encapsulate all of the logic of making the job in a single exterior file.

A requirement of AWS Glue jobs is that their job information are saved in an S3 bucket, and that’s why we created “enem-bucket-terraform-jobs”. Now, we should add the job’s file itself.

Within the terraform path, I’d included a myjob.py file, it’s simply an empty file used to simulate this conduct. To add a brand new object to a bucket, simply use the “aws_s3_object” useful resource:

# UPLOAD THE SPARK JOB FILE myjob.py to s3
useful resource "aws_s3_object" "myjob" {
bucket = aws_s3_bucket.enem-bucket-terraform-jobs.id
key    = "myjob.py"
supply = "myjob.py"
}

Any further, it’s only a matter of implementing the Glue position and creating the job itself.

# CREATE A NEW ROLE FOR THE GLUE JOB TO ASSUME
useful resource "aws_iam_role" "glue_execution_role" {
title = "glue_execution_role_terraform"
assume_role_policy = jsonencode({
# That is the coverage doc that enables the position to be assumed by Glue
# different providers can't assume this position
Model = "2012-10-17"
Assertion = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "glue.amazonaws.com"
}
}
]
})
}# ATTACH THE BASIC GLUE EXECUTION POLICY TO THE ROLE glue_execution_role
useful resource "aws_iam_role_policy_attachment" "glue_basic_execution" {
policy_arn = "arn:aws:iam::aws:coverage/service-role/AWSGlueServiceRole"
position       = aws_iam_role.glue_execution_role.title
}

Not so quick. We should guarantee that this job has the identical learn and write permissions to the bucket “enem-data-bucket” because the Lambda Operate, i.e. we have to connect the aws_iam_policy.s3_access_policy to its position.

However, as a result of this coverage was outlined in the primary file, we can’t reference it immediately in our module.

# THIS WILL RESULT IN A ERROR!!!!
# ATTACH THE THE S3 ACCESS POLICY s3_access_policy TO THE ROLE glue_execution_role
useful resource "aws_iam_policy_attachment" "s3_access_attachment_glue" {
title       = "s3_and_glue_execution_access_attachment"
policy_arn = aws_iam_policy.s3_access_policy.arn
roles      = [aws_iam_role.glue_execution_role.name]
}

With a purpose to obtain this conduct, we should cross the entry coverage arn as an argument to the module, and that’s fairly easy.

First, within the glue.tf file, create a brand new variable to obtain the worth.

variable "enem-data-bucket-access-policy-arn" {
sort = string
}

Return to the primary file and, within the module reference, cross a worth to this variable.

module "glue" {
supply = "./glue"
enem-data-bucket-access-policy-arn = aws_iam_policy.s3_access_policy.arn
}

Lastly, within the glue file, use the worth of the variable within the useful resource.

# ATTACH THE THE S3 ACCESS POLICY s3_access_policy TO THE ROLE glue_execution_role
useful resource "aws_iam_policy_attachment" "s3_access_attachment_glue" {
title       = "s3_and_glue_execution_access_attachment"
policy_arn = var.enem-data-bucket-access-policy-arn
roles      = [aws_iam_role.glue_execution_role.name]
}

Now, take a minute to consider the ability of what we had simply finished. With modules and arguments, we will create absolutely parametrized advanced infrastructures.

The code above doesn’t simply create a selected job for our pipeline. By simply altering the worth of the enem-data-bucket-access-policy-arn variable, we will create a brand new job to course of information from a completely completely different bucket.

And that logic applies to something you need. It’s potential, for instance, to concurrently create a whole infrastructure for a venture for the event, testing, and manufacturing environments, utilizing simply variables to alternate between them.

With out additional speaking, all that rests is to create the Glue job itself, and there’s no novelty in that:

# CREATE THE GLUE JOB
useful resource "aws_glue_job" "myjob" {
title     = "myjob"
role_arn = aws_iam_role.glue_execution_role.arn
glue_version = "4.0"
command {
script_location = "s3://${aws_s3_bucket.enem-bucket-terraform-jobs.id}/myjob.py"
}
default_arguments = {
"--job-language" = "python"
"--job-bookmark-option" = "job-bookmark-disable"
"--enable-metrics" = ""
}
depends_on = [aws_s3_object.myjob]
}

And our infrastructure is finished. Run terraform apply to create the remaining sources.

docker compose run terraform apply

And terraform destroy to eliminate every little thing.

docker compose run terraform destroy

I met Terraform a couple of days after publishing my 2nd publish about creating information pipelines utilizing cloud suppliers, and it blew my thoughts. I immediately considered all of the guide work that I did to arrange the venture, all of the prints captured to showcase the method and all of the undocumented particulars that can hang-out my nightmares after I want to breed the method.

Terraform solves all these issues. It’s easy, straightforward to arrange, and simple to make use of, all it wants are a couple of .tf information together with the suppliers’ credentials and we’re able to go.

Terraform tackles that sort of downside that individuals often don’t are so excited to consider. When growing information merchandise, all of us take into consideration efficiency, optimization, delay, high quality, accuracy, and different data-specific or domain-specific elements of our product.

Don’t get me unsuitable, all of us research to use our higher mathematical and computational information to resolve these issues, however we additionally want to consider crucial elements of the improvement course of of our product, like reproducibility, maintainability, documentation, versioning, integration, modularization, and so forth.

These are elements that our software program engineer colleagues have been involved about for a very long time, so we don’t need to reinvent the wheel, simply study one factor or two from their greatest practices.

That’s why I all the time use Docker in my tasks and that’s additionally why I’ll most likely add Terraform in my fundamental toolset.

I hope this publish helped you in understanding this device — Terraform — together with its aims, fundamental functionalities, and sensible advantages. As all the time, I’m not an professional in any of the topics addressed on this publish, and I strongly suggest additional studying, see some references under.

Thanks for studying! 😉

All of the code is accessible in this GitHub repository.
Knowledge used — ENEM PDFs, [CC BY-ND 3.0], MEC-Brazilian Gov.
All the photographs are created by the Writer, until in any other case specified.

[1] Add set off to AWS Lambda features by way of Terraform. Stack Overflow. Link.
[2] AWSLambdaBasicExecutionRole — AWS Managed Coverage. Link.
[3] Brikman, Y. (2022, October 11). Terraform suggestions & methods: loops, if-statements, and gotchas. Medium.
[4] Create Useful resource Dependencies | Terraform | HashiCorp Developer. Link.
[5] TechWorld with Nana. (2020, July 4). Terraform defined in 15 minutes | Terraform Tutorial for Inexperienced persons [Video]. YouTube.
[6] Terraform Registry. AWS supplier. Link.

[ad_2]

Source link

Automatically Managing Data Pipeline Infrastructures With Terraform | by João Pedro | May, 2023

UC Berkeley Researchers Propose FastRLAP: A System for Learning High-Speed Driving via Deep RL (Reinforcement Learning) and Autonomous Practicing

Meet Lamini AI: A Revolutionary LLM Engine Empowering Developers to Train ChatGPT-level Language Models with Ease

Editor

Meet Lamini AI: A Revolutionary LLM Engine Empowering Developers to Train ChatGPT-level Language Models with Ease

Leave a Reply Cancel reply

Browse by Category

Categories

Recommended

Automatically Managing Data Pipeline Infrastructures With Terraform | by João Pedro | May, 2023

I do know the guide work you probably did final summer time

0. Establishing the atmosphere

1. Create the Terraform file

2. Configure the AWS Supplier

3. Creating our first useful resource: The S3 bucket

4. Configuring the Lambda Operate half I: Roles and permissions

5. Configuring the Lambda Operate half II: Attaching a set off

6. Including a module to the Glue job

UC Berkeley Researchers Propose FastRLAP: A System for Learning High-Speed Driving via Deep RL (Reinforcement Learning) and Autonomous Practicing

Meet Lamini AI: A Revolutionary LLM Engine Empowering Developers to Train ChatGPT-level Language Models with Ease

Editor

Meet Lamini AI: A Revolutionary LLM Engine Empowering Developers to Train ChatGPT-level Language Models with Ease

Leave a Reply Cancel reply

Browse by Category

Browse by Tags

Categories

Recommended