Did you ever have to analyze large log files before?



Well, I recently had to analyze a large log based dataset and decided to try out the ELK stack.

Introduction:
Please see this intro if you are not familiar with ELK, the rest of the blog assumes you know what they are. eg. elastic server, logstash & kibana. It is a popular server side tool to index, search & graph a large collection of logs or similar structured/unstructured data. This blog post mainly talks about my experience setting up this well known stack & the unexpected things learnt during the process.

Goal:
"How do I enable rich filtering & analysis on the large set of product logs beyond some simple scripting?"
I thought I just had to upload a few log files to the server & then have some awesome graphs appear almost magically out of the box!

Servers:
"How do I get a server up and running?"

There is a free trial for cloud based elastic stack. I signed up and got a 14 day free trial with 4 instance running on Google Cloud. Next day, I logged in to see that the server is no longer active.  I gave up after restarting the deployment a couple of times. - "It failed the 2 minute or 1 day test". :)

Then, I thought I will try the docker route to avoid manually downloading the servers. So, I downloaded the docker files from github & had the servers up and running in a few minutes. I uploaded a few logs and soon the elastic server went to read only mode due to lack of disk quotas. I didn't have the patience to edit & mount config files since I thought it is easier to just install everything manually. - "It was good for prototyping, but didn't expect it to take the load with millions of rows anyway".

Next, I installed the stack on my Windows dev box by downloading each product from the website. When I started logstash, I realized that it worked only with Java 8. I had Java 11 installed & the server died in a few seconds. I wasn't expecting this. Anyway, I re-installed older Java 8 and was able to get the stack up and running. It didn't last long since I ran out of disk space on my dev box. I tried configuring the high/low watermarks, but didn't help and local logstash seemed confused about the backlog since I switched servers. - "It looks like the fix for Java11 support was merged a few days ago, but initially set off an yellow flag for me"

Finally, I got a new Ubuntu 18.04 VM with large memory & disk, installed ELK stack on it from the elastic repository using instruction from digitalocean. I also configured the high/low watermarks so that it will keep going until the machine dies! The remote server required an nginx proxy to be setup, but the guide din't talk about elastic server, so I had to open port 9201 for the same. - "I could have probably gone with a cluster, but not for a two day exercise to prove feasibility"

Pre-processing:
"How do I get the right data to the server?"

Now, running the ELK stack is easy compared to what you have to do to make the log files readable by the server. My log files were in a custom format and hence couldn't use any of the built-in logstash plugins.

ELK stack is very flexible & that can be confusing for a newbie. I had multiple options to send the data. One option was to use filebeat to send the data to logstash, which eventually forwards it to the elastic server. This was an overkill since I had all the logs in one machine. - "Filebeat is used to ship logs from multiple servers, but not required for my use case where logs where all in one machine"

Another option was to format the log to jsonl myself & post directly to elastic server. Initially, I thought this option was simple, and it did work for simple use cases. But, the server quickly choked with large datasets since there was no throttling etc. - "Don't attempt to use the elastic server directly without logstash for large datasets"

The next option I considered was to copy the Windows csv files to Linux server and have logstash read them. It turned out that the files processed in Windows were not read properly by logstash on the linux server.  "This could work given time to address encoding or CR type issues"

Finally, I had logstash run locally on my Windows machine & forward the pre-processed lines to elastic server running on the remote server. This also enabled me to use the csv filter which seemed easier to use than the grok filter. I tried building grok patterns for the log file with grokconstructor and grok debugger, but didn't make it easier with complex custom logs. "Re-use grok patterns for existing logs or learn grok syntax on your own to be an ELK expert"

So, I created a quick awk script that extracted multiple fields from the log lines, including formatting of timestamp and then passed the output csv to logstash filter. This forwarded it to the elastic server, verified that the logs are there from the kibana discovery page. (Initially, I didn't see anything but eventually I figured out later that the timerange had to be adjusted. :) You have to also create an index pattern before you can see the data. - "For the first timer, selecting the time range and index patterns are guided through the UI"

Results:
The last step is to create some pretty graphs, it is easy if you look at some of the sample dashboards to do this.  You also need a relatively large server if you plan to mine a lot of data, still the queries were quite fast in my case even with millions of rows. - "It is also easy to export the graphs and dashboard to other servers"

Summary:
All I wanted to do was to enable others to analyze a large collection of logs without using scripts. It shouldn't have so many gotchas, even for a newbie. At least, next time it shouldn't take me this long. :)

"What are the key takeaways?"
If you are attempting to use ELK for log analysis, save yourself some time by installing a beefy Linux server with sufficient memory, disk space & Java8. Do not attempt to post directly to the elastic server.  Please do not ignore logstash - it automatically takes care of  creating the data type, index names, throttling etc. Try to learn grok patterns for full flexibility, even though I worked around it with the csv filter.

It took me two days between meetings to get this finally going, but this is a good tool to have in your toolbox and can be handy for a variety of tasks.

More Details:
#this is the logstash configuration, output csv is used for debugging only. there is also a simple awk script to pre-process each logline to a valid csv.

input {
file {
path => "/*.csv"
start_position => "beginning"
}
}
filter {
csv {
separator => ","
columns => ['', '']
}
date {
match => [ "timestamp" , "yyyy-MM-dd HH:mm:ss:SSS" ]
}
mutate
{
convert => [ "" , "integer" ]
}
}
output {
elasticsearch
{
hosts => ":9201"
index => "-%{+YYYY.MM.dd}"
}

  stdout 
  {
    codec => rubydebug
  }
}




Comments

Shahid said…
Very helpful. Hopefully you will write more often.
thanks, :) probably, maybe.
Oz said…
Good blog entry Vishnu.

Popular posts from this blog

Google Appengine