,

Engineering Lessons Learned from LLM Fine Tuning

Well, I finally got around to it. What you say? Fine-tuning an LLM, that’s what. I mean all the cool kids are talking about and caring on like it’s the next thing. What can I say … I’m jaded. I’ve been working on ML systems for a good few years now, and I’ve seen the best, and worst.

Most of Machine Learning is Data Engineering. That’s the truth. Is the LLM gold rush any different?

Lessons Learned from fine-tuning OpenLLaMA LLM.

I just wanted to give a few short insights on what it is like to fine-tune an LLM, in my case OpenLLaMA from a Data Engineering perspective.

First, you can go to GitHub and checkout the full repo and code. It contains a nice overview of what it’s like to fine-tune an LLM.

So here are some quick and dirty thoughts … from a Data Engineering perspective.

  • You will probably be working on an Linux instance to do that actual work.
  • You will probably heavily use Docker because of the above.
  • There are lots of Python tools to pip install and manage.
  • Playing with LLMs requires ALOT of memory AND disk in real life.
  • Eventually, you will need GPUs (check out vast.ai for cheap by-the-hour rentals.)
  • Because of remote GPU machines, Docker etc. You need to understand bash and ssh command.
  • Data cleaning and prep is going to be the hardest part and the most code.
  • Choose your LLM model upfront because it will affect everything downstream.
  • Choose your preferred libraries up front for training and inference (ex, huggingface)
  • Lots of scripts to deploy your code and data to cloud storage (ex s3) will make your life easier when deploying to a remote GPU machines.

I highly recommend using vast.ai to rent cheap GPUs by the hour. Most of your code and headache will be gathering data to train on, because it’s unstructured, and then getting it into a semi-structured format. It’s a pain and takes time. No shortcuts.

If you want more in-depth info …

You can follow along here https://dataengineeringcentral.substack.com/p/llms-part-2-fine-tuning-openllama

If you are not down with LLMs … see Part 1 which gives a high-level overview of LLMs (local inference on a laptop). https://dataengineeringcentral.substack.com/p/demystifying-the-large-language-models