AWS Glue Jobs
Welcome back to our advanced AWS Glue tutorial series. In this blog , we’ll delve into AWS Glue’s powerful ETL (Extract, Transform, Load) job types, providing real-world code examples that demonstrate the platform’s adaptability and capacity.
Python Shell Jobs in AWS Glue are essential for simplifying and precision data processing processes. Below is a brief example of a Python Shell Job that modifies a CSV file by adding a new column and applying data filtering:
import pandas as pd # Load data from an S3 source input_bucket = 'input-bucket' input_key = 'data/input.csv' df = pd.read_csv(f's3://{input_bucket}/{input_key}') # Perform data transformation df['new_column'] = df['old_column'] * 2 filtered_df = df[df['some_condition'] == True] # Export the result to S3 output_bucket = 'output-bucket' output_key = 'data/output.csv' filtered_df.to_csv(f's3://{output_bucket}/{output_key}', index=False)
Complex data transformations become simple when using the vast power of Apache Spark within AWS Glue. Consider the word count on a text file as an example.
Want to learn Apache Spark follow this link: here
from pyspark.context import SparkContext from pyspark.sql import SparkSession # Initialize Spark sc = SparkContext() spark = SparkSession(sc) # Load data from an S3 source input_bucket = 'input-bucket' input_key = 'data/input.txt' lines = spark.read.text(f's3://{input_bucket}/{input_key}') # Perform word count word_counts = lines.rdd.flatMap(lambda line: line[0].split(' ')).countByValue() # Store the results back in S3 output_bucket = 'output-bucket' output_key = 'data/word_counts.txt' with open(f's3://{output_bucket}/{output_key}', 'w') as f: for word, count in word_counts.items(): f.write(f'{word}: {count}\n')
Ray Jobs in AWS Glue come to top of the list for compute-intensive operations that require distribution and parallelization. Consider processing a series of photos in parallel:
import ray @ray.remote def process_image(image_path): # Some processing logic return processed_image # List of image paths image_paths = ['image1.jpg', 'image2.jpg', 'image3.jpg'] # Process images in parallel results = ray.get([process_image.remote(path) for path in image_paths]) # Save processed images back to S3 output_bucket = 'output-bucket' for i, processed_image in enumerate(results): output_key = f'processed/image_{i}.jpg' processed_image.save(f's3://{output_bucket}/{output_key}')
AWS Glue Jupyter Notebook Jobs provide an interactive and collaborative environment for data exploration and analysis. Here’s an example of how to use Jupyter Notebook for data exploration:
# Import necessary libraries import pandas as pd import matplotlib.pyplot as plt # Load and explore data data = pd.read_csv('data.csv') data.head() # Visualize data plt.scatter(data['X'], data['Y']) plt.xlabel('X-axis') plt.ylabel('Y-axis') plt.title('Scatter Plot') plt.show()
Selecting the right job type depends on the specific requirements of your ETL tasks:
Regardless of the job type you choose, it’s essential to follow best practices:
Follow me for next parts and also subscribe me via email.
Show your appreciation by clapping if you enjoy my blog. Your applause motivates me to complete this blog series.
Ask me anything : here
— — — — — — — ** Buy me a Coffee ** — — — — — — -