I had to do a search through logs from our backend on my laptop. We have about 40 various web services that write logs in JSON format. And these logs together with logs from sidecars then streamed to Sematext (which is a managed ELK service). After a retention period logs got chunked into LZ4 compressed text files and backed up to an AWS S3 bucket.
The problem? There're a lot of logs. Every such archive is ~1 MB is size. It contains about 5,000 of JSON records. 1 month of logs is about 100k of files and about 100 GB in size.
Unfortunately, there was no established process on the project. A couple of team mates had experience with logs a year ago and all they did was: download, unpack, grep.
In my case grepping wasn't the best option as I needed not only to find the specific request (log), but also the review its distributed trace and context.
I had 2 options on my mind: local ELK
and lnav. I tried to google about how to quickly run ELK in Docker locally (I had no prior experience in setting ELK up), but haven't found good tutorials. I believe they exist, but I also was thinking about possible caveats.
Our logs are not unified: some services populate different field names like status
vs. statusCode
, and they may also be of a different type (number vs. string). Writing ElasticSearch preprocessors in painless
is not so painless. And not everything could be done there. I know this as I was setting up Sematext Pipelines before.
So, I've downloaded 112GB of logs for 1 month and tried to tame lnav
instead. Unfortunately, lnav
couldn't open my lz4
files because of some unexpected end of stream
issues. lz4
command line utility produces such errors as well, but still decompresses fine. But uncompressed files are 8 MB instead of 1 MB (112GB --> 896GB?). I tried on an uncompressed directory with just couple of files and it was overwhelming. I was reading through lnav
docs when saw that you may also define a scheme and load logs into an internal SQLite database.
This is also where I saw a new shiny bicycle 🚲 But at least it could be fun in comparison with boring docs...
I'm a JavaScript dev: loading/parsing files is easy. I also could easily transform all logs to a unified format and throw away fields that I don't need like Kubernetes labels, etc. And our team is used to Postgres. If I upload all the data into a local DB, it might be very convenient. And I could sort everything properly, because logs in files are sorted in order of streaming from multiple sources, not by actual timestamps. So many benefits!
I settled on a very simple streamed pipeline:
glob --> lz4 --> createReadStream --> readline
--> transform --> filter --> insert
Creating CLIs with yargs is a piece of cake. For LZ4 there is a streaming implementation on npm which is a single function and also happened to decompress my files without any errors. Everything else is out-of-box Node.js.
Transformation itself is just creating a new object based on parsed JSON. I know all 12 fields I will need. Creating a DDL for Postgres table didn't took much time as it's structure is the same as TS interface I already defined.
Interestingly, every log has a natural Primary Key, because logs are in files, which are static. Backups only add up, no files are modified. We could use file_name
+ line_number
to identify any log record. Why do we need this? For an incremental workflow. Inserting into a DB is very slow and is the main bottleneck. Of course, we could optimize it by using multi-value insert, disable triggers and constraints, etc. But I only disabled WAL for the table (as I will not modify row and can restore data) and decided to load data gradually:
- Filter incoming records in JS to find the place in time where your issue(s) happened. For example, load only DELETE requests.
- Use SQL GUI to view/filter this data, sort by time.
- When time spans of interest are identified, tweak glob expression to load additional files for these time spans without filtering or with a different filer.
- Repeat.
Filter is a function in the TS file, which gets a transformed record and returns a boolean. If it's true
, the tool executes an INSERT
command with Postgres' extension ON CONFLICT DO NOTHING
. So, even if you try to load the same file multiple times, you would not get duplicate records thanks to the natural PK.
It looks like this:
$ yarn ts-node ./log-loader.ts './aws/2021/02/**/*.lz4'
...
Loading file [159432/159438]: ./aws/2021/02/24/21/20210224T214041.318Z_5950e47a96ee741360bedbf6153ab069.json.lz4
...
Done, 159,438 files loaded, 1,095,380,480 lines parsed, 241,584 records added
✨ Done in 7572.51s
Time varies greatly with the number of inserts. Still, it's not exactly instant, but is fast enough for my use case. After a couple of iterations I got proper distributed traces for several requests of my interest in the DB and found who called what and why. I saved time not in grepping, but in a nice GUI to look at final logs and understanding call stacks.
A couple of month gave me 397,785 final rows. SQL queries work quite fast even without indexes: a search by URL prefix took only 2.278 sec.
Actually, loading local files is not enough. I have only 512 GB on my laptop SSD and all it can fit is 3 month. So, for some cases I had to load data directly from AWS S3. It is slower for multiple runs, but I couldn't use external drive or other computer.
Adding S3 support was very easy too. I just extracted fs/glob
code into a fsAdapter.ts
and added awsAdapter.ts
. If you specify --aws
option, then the positional argument of CLI is treated as bucket + prefix and the AWS adapter is used. From AWS client you only need listObjects
and getObject
, after which you obtain the same Readable stream as you had with the file system.
Next time I might try to set up a local ELK to get a similar experience to not yet backed up logs (nice UI). As for the current project... I'm curious how many years the log-loader
will be in use by my colleagues if the manual way lasted for almost 6 years.