Tale of a small ticket

In this post, I’ll complain about a lot of stuff, so here is your warning.

I had a ticket, which was innocent enough: Put the CDN logs into ELK stack.

Which makes sense, right? You want to be able to track issues related to CDN, while troubleshooting. I picked the sub-ticket for one CDN provider we use and started digging. How hard can it be? Enable syslog output from the provider and allow it from our ELK gateway. Worst case, if they don’t have such feature, set up a vm and collect logs there & push.

I won’t name the provider, but a quick glance in their documentation showed they don’t support pushing logs or anything. The way to retrieve logs is..

Yeah, not SFTP or anything, just good old plain fukkin FTP. I checked the date to make sure I am still in 2020 for a moment.

But anyway, no stopping.

I wrote a Python script to grab the logs using ftplib module. Thankfully one of my colleagues already prepared a vm for it. Deployed there, and voila! Logs are flowing in.

Bunch of .gz files which extracts to some weird csv-ish logs, which has a custom format of required info. Alright, set up a filebeat daemon which will grab the logs from extraction directory and send. I just gave our logstash as output and started to send the logs.

LOL of course not. These logs had a custom format, remember? So we need to visit our old friend: GROK.

Parsing GROK is always a breeze. You don’t even realize how the time is flowing. Because all the parsers are pile of shit, which only reports “No match”, nothing else. So you need to start from first element, add the second, add 2 more, wait, didn’t work, remove one, ah it’s not %{MONTH}, it’s %{MONTHNUM}, of course!!

Also had some small brain-seizures while parsing the log since it was TAB separated (and you need to use actual TAB on template too). But eventually managed to match it.

Anyway, now logs are flowing. But filebeat does put whole log line inside message container, so timestamps are wrong: It doesn’t show the time of the actual log, but sending time from the vm. Need to convert and parse more I guess. After this, starts the parade. Add converts, mutates around and do hardcore trial/error with the stack until it works.

Ah, did I told you it was ELK. I guess I did.

Did you ever restarted a big-ish logstash cluster?

It just takes 20 minutes if you want to do it gracefully (god, do I miss Graylog).
So I isolated one instance, dropped it from the loadbalancer and started to torture it with random puppet changes.

Result? It didn’t work 🎉. As a plus, logstash doesn’t even start listening the ports it’s supposed to listen anymore and shouts some cryptic warning logs! No errors or criticals; just flowing, looping, crying warnings..

Asked my friend who is responsible from the stack for further debugging (since ELK forums are so helpful). I’ll continue tomorrow..

So, how was your Monday?



516 Words

2020-09-21 13:24 +0000