Startec

Startec

Harnessing AWS Glue for High-Volume JSON Processing 🚀

Mai 25, às 20:29

·

4 min de leitura

·

0 leituras

Introduction: In today's data-driven world 🌐, managing and interpreting substantial amounts of data has become increasingly vital. We often encounter scenarios where JSON files flood our AWS S3 buckets,...
Harnessing AWS Glue for High-Volume JSON Processing 🚀

Introduction:

In today's data-driven world 🌐, managing and interpreting substantial amounts of data has become increasingly vital. We often encounter scenarios where JSON files flood our AWS S3 buckets, demanding efficient processing. In this blog, I will share a recent use case where I successfully used AWS Glue, Python, and Terraform(mainly to handle infra) to tackle the challenge of processing a deluge of small JSON files efficiently.

Background:

Data handling at scale can be an arduous task, especially when dealing with a continuous influx of JSON files. I was asked to process approximately 1 million small JSON files per hour which are landing into an s3 bucket! 🕰️ These files needed to be transformed into larger, more manageable XML files for further analysis by another application. To address this demanding task, I leveraged the power of AWS Glue and other technologies to create a robust and scalable data processing pipeline.

Tech Stack:

Terraform for creating AWS Resources 🏗️
Glue, S3, Systems Manager, Cloud Watch services from AWS 🛠️
Python for writing the Glue Scripts 🐍

Overcoming the Small Files Problem with Glue:

The first hurdle I encountered was the infamous "Small Files Problem" within AWS Glue. This challenge arises when the Spark driver in AWS Glue is overwhelmed with list() method calls to S3, leading to memory exhaustion and, unfortunately, job failures. The first iteration of my solution ran for 6 hours and, you guessed it, failed due to a memory error. 😅

S3 small file issue

To combat this issue, I devised a primary Glue job written in Python, leveraging PySpark and Boto3. This job merged multiple small JSON files into larger ones based on a configurable parameter, mitigating the Small Files Problem. It also optimized downstream processing by partitioning the merged files into date-based folders within the S3 bucket. I saved the files in the following format:

YYYYMMDD/HHMM

This approach helped me run this partition job multiple times per day, partitioning files based on specific hours. This made the main transformation job, running at midnight, more efficient as it dealt with fewer files.

Transforming and Optimizing Data Processing:

The secondary, or main, Glue job was designed to initialize a Spark session and fetch essential parameters like batch size and date from the AWS Systems Manager (SSM) Parameter Store. This job processed the JSON files in batches using Spark's inherent capabilities. I used the popular Python library xml.etree.ElementTree to convert JSON data into the desired XML format. To optimize storage and write performance, the transformed data was compressed using gzip before being written back to the S3 bucket.

Here is the high-level architecture

architecture diagram

The Fruitful Outcome:

The result? A super-efficient pipeline that processed approximately 70k files in less than 2.5 minutes! 🚀 This feat required 10 Data Processing Units (DPUs) in AWS Glue. However, the solution is scalable and can accommodate additional DPUs to meet future requirements. With the current setup, it can easily process 1 million JSON files in around 35 mins which is quite fast and seems scalable as well.

Key Takeaway: The Power of Synergistic Tools and Strategies

The experience of using AWS Glue, Python, and Terraform to handle large amounts of JSON data highlights the significance of using the right combination of tools and strategies to address big data challenges successfully. As data keeps growing rapidly, AWS Glue offers a reliable platform for creating strong data processing pipelines. Python's flexibility allows for intricate data transformations, and Terraform helps optimize infrastructure efficiency. By leveraging these synergistic tools and strategies, you can effectively tackle the complexities of big data processing. 🏗️👏

P.S. - Thanks to generative AI for the edits ;) and making my English look better. The use case is real though :D


Continue lendo

DEV

20 Essential Python Extensions for Visual Studio Code
Visual Studio Code has become popular among Python developers due to its versatility and extensibility. By leveraging the power of extensions, developers can enhance their productivity and streamline their...

Hoje, às 02:47

freeCodeCamp

Golang Environment – GOPATH vs go.mod
In this article, we'll explore the differences between the traditional GOPATH environment and the go.mod based environment for Go programming. This distinction has significant implications for how Go developers structure and manage their project workspaces and dependencies. We'll start by understanding the GOPATH environment, its organization and its

Hoje, às 00:23

freeCodeCamp

How to Create Data Validation Rules in Excel
Data validation is a feature in Excel used in restricting data entry in specific cells. It can also prompt the user to enter valid data into the cells based on the rules and restrictions provided by the creator of the Excel worksheet. When setting up a workbook, you might

Hoje, às 00:07

TabNews

Função não espera para executar antes do post · Gustavo2000B
Olá Pessoal, boa noite, estou construindo uma rota POST porém não consigo implementar uma maneira do back end esperar o input do usuario para rodar uma função, conseguem me ajudar com iss...

Mai 25, às 23:45

IT Forum

Cibersegurança no Brasil segue modelo 'isolado' e aumenta riscos - IT Forum
Imagem: Shutterstock A grande maioria dos profissionais brasileiros de cibersegurança (87%) descrevem o atual modelo de proteção de suas empresas como “isolado”, revelou uma nova pesquisa realizada pela...

Mai 25, às 23:39

freeCodeCamp

What Exactly is Node.js? A Guide for Beginners
If you're thinking about doing back-end development using JavaScript, you will hear the term ‘Node.js’. Node is often associated with developing powerful web servers. But what exactly is Node.js? Is it a JavaScript framework just like Angular [https://angular.io/]? Is it a programming language? Is it a JavaScript Library? Is it

Mai 25, às 23:08

Hacker News

Distributed Java Background Job Scheduler
JobRunr for CTO's and managers JobRunr for developers and engineers 20% productivity increase. A bold claim - we know. Praised by management, these words are not ours: We have seen a 20%...

Mai 25, às 21:45

HackerNoon

Real Estate on the Blockchain: The Rise of Tokenized Properties | HackerNoon
Real Estate on the Blockchain: The Rise of Tokenized Properties [email protected] Long; Didn't ReadBlockchain technology is enabling a new wave of real estate investment through fractional ownership. Anyone can...

Mai 25, às 21:35

IT Forum

Assinaturas digitais batem recorde em 2023 e reforçam mercado em ascensão - IT Forum
Imagem: Shutterstock O número de assinaturas digitais emitidas no Brasil atingiram recorde em março deste ano. Segundo a Associação Nacional de Certificação Digital (ANCD), no mês foram emitidos 874.714...

Mai 25, às 21:35

Discovery

Como criar uma assinatura de e-mail no Gmail
Se você usa o Gmail para enviar e-mails, é importante ter uma assinatura para adicionar informações importantes como seu nome, cargo, número de telefone e links para suas redes sociais. Uma assinatura bem...

Mai 25, às 20:47