Modernization of Informatica ETL jobs to AWS Glue
In 2018 Autoverse was tasked with the job of converting Informatica ETL jobs to AWS
Glue. The overall objective of the project was to free our client from costly licensing and modernize
their ETL solution. ETL was combining data from multiple disparate sources, such as MySQL, PostgreSQL,
numerous flat files landing on S3.
Our first step was to log in to the Cloud Informatica console and start understanding
the ETL flows. ETL flow diagrams were created from this discovery and formed the foundation of our
migration plan. Once these flow diagrams were created, we analyzed the process to determine a more
efficient and simpler ways of doing ETL. We spent time modernizing ETL patterns with AWS Cloud services.
We wanted an automated process that was re-entrant and automatically recovered from a
failure. Our goal was to develop a process that required zero manual intervention, and worked every time
it ran even after failure.
Data Sources and File Processing
After we developed a clear understanding of the ETL jobs and our modernization path
in AWS, we started to work on determining our data sources and targets. Source data came from numerous
places such as DB2, MySQL and PostgreSQL databases, we also had flat files that were received in both
CSV and ASCII formats. Our targets were flat flies on s3 in CSV and ASCII formats. In order for this ETL
job to be fully automated and re-entrant, it was important to develop a process of handling received
files and files that are created by ETL flows. When new files were received to s3, an event was
triggered which called a Lambda function. The lambda function dated the new file, placed the file in a
different S3 prefix and the filename was updated with a suffix of READY.
In order for ETL jobs to be fully automated and re-entrant, it was important to
develop a process of handling received and created files. When new files
were received to s3, an event was triggered which called a Lambda function. The lambda function dated
the new file, placed the file in a different S3 folder and updated the file name with a READY suffix.
lambda function governed all new files that were added to the s3 bucket with a specific naming
Glue Pipeline and Workflow Creation
Once the file's name was changed with the suffix READY it triggered a lambda function
that ran the ETL flow. The ETL flow would continually update this suffix as the process was running. The
file name's suffix would change from STARTING, PROCESSING, PROCESSED or FAILED. We designed it this way
in order to have the process communicate specific stages of the run. We wanted our client to be able to
login to AWS and view s3 bucket files and get an idea of what was happening within an ETL flow, without
knowing anything about the process. When the AWS Glue job was finished and completely successful,
are moved to a different S3 prefix and archived. For files with the suffix of FAILURE, an entire failure
process was created to automatically fix bad data, notify stakeholders of issues and essentially
self-heal from an errored ETL flow.
When ETL flow is finished and complealty successful, processed files are moved to
a different S3 prefix and archived. For files with the suffix of FAILURE, an entire failure process was
created to automatically fix bad data, notify stake holders of issues and essentially self heal from an
errored ETL flow.
Glue jobs in PySpark and Pandas
Much of our time was spent working through the Informatica ETL process and converting
it to an AWS Glue job using PySpark and AWS Glue python packages. We had difficulty working with ASCII
formatted files. ASCII format is generally not a supported file format and is considered legacy. We
worked around this issue by padding each column with blanks until the max record length was reached.
Exporting the data consisted of concatenating all the columns in the data frame and exporting as a
single column CSV file.