Autoverse: Refactor Informatica ETL to AWS Glue

A Customer Journey

Modernize Informatica ETL to AWS GLue

Modernization of Informatica ETL jobs to AWS Glue

In 2021 Autoverse was tasked with the job of converting Informatica ETL jobs to AWS Glue. The overall objective of the project was to free our client from costly licensing and modernize their ETL solution. ETL was combining data from multiple disparate sources, such as MySQL, PostgreSQL, numerous flat files landing on S3.

ETL Assessment

Our first step was to log in to the Cloud Informatica console and start understanding the ETL flows. ETL flow diagrams were created from this discovery and formed the foundation of our migration plan. Once these flow diagrams were created, we analyzed the process to determine a more efficient and simpler ways of doing ETL. We spent time modernizing ETL patterns with AWS Cloud services. We wanted an automated process that was re-entrant and automatically recovered from a failure. Our goal was to develop a process that required zero manual intervention, and worked every time it ran even after failure.

Data Sources and File Processing

After we developed a clear understanding of the ETL jobs and our modernization path in AWS, we started to work on determining our data sources and targets. Source data came from numerous places such as DB2, MySQL and PostgreSQL databases, we also had flat files that were received in both CSV and ASCII formats. Our targets were flat flies on s3 in CSV and ASCII formats. In order for this ETL job to be fully automated and re-entrant, it was important to develop a process of handling received files and files that are created by ETL flows. When new files were received to s3, an event was triggered which called a Lambda function. The lambda function dated the new file, placed the file in a different S3 prefix and the filename was updated with a suffix of READY.

In order for ETL jobs to be fully automated and re-entrant, it was important to develop a process of handling received and created files. When new files were received to s3, an event was triggered which called a Lambda function. The lambda function dated the new file, placed the file in a different S3 folder and updated the file name with a READY suffix. The lambda function governed all new files that were added to the s3 bucket with a specific naming convention.

Glue Pipeline and Workflow Creation

Once the file's name was changed with the suffix READY it triggered a lambda function that ran the ETL flow. The ETL flow would continually update this suffix as the process was running. The file name's suffix would change from STARTING, PROCESSING, PROCESSED or FAILED. We designed it this way in order to have the process communicate specific stages of the run. We wanted our client to be able to login to AWS and view s3 bucket files and get an idea of what was happening within an ETL flow, without knowing anything about the process. When the AWS Glue job was finished and completely successful, processed files are moved to a different S3 prefix and archived. For files with the suffix of FAILURE, an entire failure process was created to automatically fix bad data, notify stakeholders of issues and essentially self-heal from an errored ETL flow.

When ETL flow is finished and complealty successful, processed files are moved to a different S3 prefix and archived. For files with the suffix of FAILURE, an entire failure process was created to automatically fix bad data, notify stake holders of issues and essentially self heal from an errored ETL flow.

Glue jobs in PySpark and Pandas

Much of our time was spent working through the Informatica ETL process and converting it to an AWS Glue job using PySpark and AWS Glue python packages. We had difficulty working with ASCII formatted files. ASCII format is generally not a supported file format and is considered legacy. We worked around this issue by padding each column with blanks until the max record length was reached. Exporting the data consisted of concatenating all the columns in the data frame and exporting as a single column CSV file.

Services

Cloud Consulting Services
Advisory Services
Project Management

Technologies

AWS Lambda
AWS Glue
AWS EventBridge
AWS SNS
AWS CloudWatch
Sage Maker
Terraform

Dates

01/2021 - 12/2021