Deploy AWS Glue ETL Using AWS CodePipelines
In this post, we discuss a centralized deployment solution utilizing AWS
for data lakes. The solution implements continuous integration of data lake
infrastructure, data processing and ETL jobs through mult-accounts.
continuous Deployment Solution
The following deployment strategy is based on the following best practice design
1. Dedicated AWS account to run CDK pipelines.
2. One or more AWS accounts into which the data lake is deployed.
3. The data lake infrastructure has a dedicated source code repository.
Data lake infrastructure is rarely re-deployed and it makes good sense to
it's own repository to add a level of isolation from frequently changing ETL
4. Each ETL job or a group of similar ETL jobs has a dedicated source code
repository. Each/group of ETL jobs may
have unique AWS service, orchestration, and configuration requirements.
Therefore, a dedicated source code repository allow more flexibly in
building and maintaining the
The AWS CodeCommit repository is organized into three branches: dev (main),
test, and prod. A centralized deployment account contains three separate CDK
each pipeline is sourced from a dedicated branch. Here we choose a branch-based
software development method in order to demonstrate the strategy in more complex
scenarios where integration testing and validation layers require human
facilitates the propagation of changes through
environments without blocking independent development priorities. We accomplish
isolating resources across environments in the central deployment account,
for the independent management of each environment, and avoiding
during each pipeline’s self-mutating updates.
A closer look at the CICD pipeline and deployment flow
Each repository represents a cloud infrastructure code definition. This includes the
pipelines construct definition.
Pipelines have one or more actions, such as cloning the source code (source action) and
synthesizing the stack into an AWS CloudFormation template (synth action). Each pipeline
has one or more stages, such as testing and deploying.
CICD Pipeline Explained
1. The DevOps administrator checks in the code to the repository.
2. The DevOps administrator (with elevated access) facilitates a one-time manual
deployment on a target environment. Elevated access includes administrative
privileges on the central deployment account and target AWS environments.
3. CodePipeline periodically listens to commit events on the source code
repositories. This is the self-mutating nature of CodePipeline. It’s configured
to work with and can update itself according to the provided definition.
4. Code changes made to the main repo branch are automatically deployed to the
data lake dev environment.
5. Code changes to the repo test branch are automatically deployed to the test
6. Code changes to the repo prod branch are automatically deployed to the prod
Source code and AWS Secrets Manager allow deployments to utilize values that are
declared globally in a single location. Providig consistent management of global
configurations and dependencies such as resource names,
AWS account Ids, Regions, and VPC CIDR ranges. Similarly, the CDK Pipelines export
outputs from CloudFormation stacks for later consumption via other resources.
Repeatable and consistent deployment of new ETL jobs
Continuous integration and continuous delivery pipelines allow teams to deploy
to production more frequently. Code changes can be safely and securely propagated
through environments and released for deployment. This allows rapid iteration on data
processing jobs, and these jobs can be changed in isolation from pipeline changes,
resulting in reliable workflows.
Scalable and centralized deployment model
We utilize a scalable and centralized deployment model to deliver end-to-end automation.
This allows DevOps and data engineers to use the single responsibility principal while
maintaining precise control over the deployment strategy and code quality. The model can
readily be expanded to more accounts, and the pipelines are responsive to custom
controls within each environment, such as a production approval layer.