, , ,

boto3 or aws cli for s3 … Python vs Bash and other thoughts.

For any Data Engineer working on aws for any length of time, there is one task that always seems to come up and never go away. Manipulating files on s3 a bucket on aws is something I’ve had to do for years, it just never goes away. It’s always something … listing files, moving files, copying files, checking for files, getting the last modified file, checking file sizes, downloading files … it pretty much never ends.

Luckily aws provides a few tools to make these easy, their handy cli for command-line work, or the trusty boto3 Python package. I want to give an introduction to the common commands Data Engineers have to run with both the aws cli and boto3 to perform various common tasks. We will then compare and contrast which tool to use in our pipelines and the pros and cons of each.

The two aws options for s3boto3 and aws cli

I’ve spent more time than I care to think about messing with files on s3. It’s one of those tasks that you get numb to, the list of reasons to mess with files in s3 is endless, and there are always new reasons that pop up with every new project. For those that are newer to working with files in s3 I want to go over some of the common ways to do those tasks.

First, let’s start with the aws cli command-line tool.

aws cli

AWS provides a wonderful command-line style tool, it works perfectly doing various tasks on s3. The installation is very easy and requires very little setup. Once installed you will get a .aws folder at the root of where you installed it. There will be two files of note …

  • config
  • credentials

They are self-explanatory, config allow you to set up different accounts, say for dev and prod if you have them. The credentials is where your keys are stored. But enough of that, what are the common aws cli s3 commands you will find yourself running?

  • listing files – aws s3 ls s3://my_bucket/some_folder
  • copying files – aws s3 cp some_local_folder s3://my_bucket/some_folder
  • sync folders – aws s3 sync s3://my_bucket/my_folder s3://other_bucket/another_folder
  • of note is the --recursive option, if we have multiple sub-directories etc we are working with.
  • we can use options to control what files we move or sync, or the other commands --delete --exclude "*some_files.csv*" --include "*.txt"
    • delete
    • exclude
    • include
    • recursive

These commands cover about %90 of what a data engineer will probably do day-to-day when working with files on s3. You will probably come to know these commands and options by heart if you haven’t already. ( BTW, you can install the aws cli via pip, pretty nice feature)

boto3

If I’m not using bash files to automate aws cli commands to shove around s3 files for some CI/CD thingy-ma-bob, I find myself often using the Python package boto3. It’s another great way, although sometimes the more annoying option to code s3 actions. It isn’t that often used in “big data” because I wouldn’t call it that performant, but you can do just about anything you can think of to s3 with boto3.

The list of actions is endless, but mainly I find myself using the following features of boto3 to mess with s3 files and buckets.

  • list and paginate s3 bucket contents
  • find last modified file
  • filter bucket contents
  • download a file(s) or folder(s)
  • copy a file of folder(s)
  • upload file(s).

I typically always use a client session of boto3 to do most things on s3. Of course, your aws keys need to be available in the environment variables, never keep them in the code.

s3_client = boto3.client('s3')

Many times I will use boto3 to paginate through all the records in a s3 bucket (paging through contents is common practice, similar to the concept when using REST APIs).

def get_pages(client: object, bucket) -> list:
    paginator = client.get_paginator('list_objects')
    page_iterator = paginator.paginate(Bucket=bucket)
    pages = [page['Contents'] for page in page_iterator]
    return pages

Or maybe I want to look through all the pages to find files matching a prefix, and get that last modified file.

def get_latest_file(pages: iter, prefix: str) -> str:
    if pages:
        all_files = []
        for page in pages:
            all_files.extend(page)
        files = [(file['Key'], file['LastModified']) for file in all_files if prefix in file['Key']]
        files.sort(reverse=True, key=lambda x: x[1])
        recent_file = files[0][0]
        print(recent_file)
        return 

I personally find boto3 a little verbose to use, some things you have to do manually seem like they should come out of the box. I mean, it is Python after all. But don’t get me wrong, some things like copying or deleting files are pretty straightforward.

def copy_s3_file(client: object, key: str, new_key: str) -> None:
    copy_source = {
        'Bucket': 'my-wonderful-bucket',
        'Key': key
    }
    client.copy(copy_source, 'some-other-wonderful-bucket', new_key)

def delete_object(client: object, key: str) -> None:
    client.delete_object(Bucket='my-wonderful-bucke', Key=key)

Other Thoughts.

I use a mix of both bash / aws cli and boto3 in most of my production code writing. They both have their uses and places in this world.

  • aws cli for quick development work.
  • aws cli + bash for most CI/CD and other infastructure and deployment jobs.
  • boto3 for production code.
  • aws cli for large amounts of work needing to be done in s3.
  • boto3 for intricate s3 work.

I’ve seen a lot of people use Python subprocess calls to kick-off aws cli in Production code. I don’t like this approach and it rarely works out well in my opinion … it’s never unit tested and never catches or handles errors or exceptions very well. On the other hand, boto3 is great for intricate aws s3 work, you can unit test the crud out of it and handle all sorts of exceptions and problems.

You should learn to use both.

Different situations call for different solutions … if you need to resort to boto3 when you’re doing quick development or exploratory work … you’re probably going to waste a lot of time. You need to learn the aws cli for quick and dirty work. On the other hand, learning all the nuances of boto3 for s3 file and folder manipulation can get a little old, but if you write your functions correctly you can pretty much re-use them over and over again.

I’m curious to know what you use most of the time with your s3 work, cli or boto3, or something else? Drop a comment and let me know.

1 reply

Comments are closed.