What is Logstash?

Bonus: useful Logstash commands free list 

By: Sadequl Hussain, Product Marketing Associate

In this article, we are going to have a quick introduction to Logstash, a very popular application for collecting, processing and filtering log data – and see how it works.

We will review plugins, installation and configuration of Logstash, we will briefly mention Beats, and also compare Logstash to other log collectors and review Logstash alternatives.

Let’s begin

The classic definition of Logstash says it’s an open-source, server-side data processing pipeline that can simultaneously ingest data from a wide variety of sources, then parse, filter, transform and enrich the data, and finally forward it to a downstream system.

In most cases, the downstream system is Elasticsearch, although it doesn’t always have to be that, as we will learn later.

Logstash is typically used as the “processing” engine for any log management solution (or systems that deal with changing data streams).

These applications collect logs from different sources (software, hardware, electronic devices, API calls, etc.), process the collected data, and forwards it to a different application for further processing or storing.

This makes Logstash essentially a data pipeline. But there’s more to it than typical pipelines that data engineers develop.

Usually, custom-developed data pipelines extract specific types of data from specific sources, perform some predefined actions on the data, and then save the result in a specific location.

This is what an ETL (Extraction, Transformation, and Loading)  job will do.

Unlike ETL jobs though, Logstash is a generic engine; which means it can accept data from many different sources out-of-box.

These data can be structured, semi-structured, or unstructured, and can have many different schemas. To Logstash, all these data are “logs” containing “events”.

Logstash can easily parse and filter out the data from these log events using one or more filtering plugins that come with it.

Finally, it can send the filtered output to one or more destinations. Again, there are prebuilt output interfaces that make this task simple.

How Logstash Works?

Data flows through a Logstash pipeline in three stages: the input stage, the filter stage, and the output stage.

In each stage, there are plugins that perform some action on the data.

This is shown in the image below.

How Logstash Works?

In the input stage, data is ingested into Logstash from a source.

Logstash itself doesn’t access the source system and collect the data, it uses input plugins to ingest the data from various sources.

Note: There’s a multitude of input plugins available for Logstash such as various log files, relational databases, NoSQL databases, Kafka queues, HTTP endpoints, S3 files, CloudWatch Logs, log4j events or Twitter feed. 

Once data is ingested, one or more filter plugins take care of the processing part in the filter stage.

In this stage, necessary data elements are extracted from the input stream.

Remember: different types of filter plugins exist for different processing needs.

For example, there are plugins for parsing and processing XML, JSON, unstructured, and CSV data, API responses, Geocoding IP addresses, or relational data.

The processed data is sent to a receiver in the output stage.

Like input plugins, there are output plugins available for many different endpoints, including those for Elasticsearch,  HTTP, e-mail, S3 file, PagerDuty alert, or Syslog to name just a few.

As a standalone data pipeline, Logstash isn’t worth much.

Logstash real value comes when its processed data is saved in a high-performance, searchable storage engine, and easily viewable from a user interface tier.

In the ELK stack, the storage (and indexing) engine is Elasticsearch and the UI is Kibana.

Note: the destination doesn’t have to be Elasticsearch, nor does the UI have to be Kibana. But more on that later.

Logstash Input Plugins

There are many input plugins available for Logstash for different types of events. Here are some common ones:

BeatsThe beats plugins can ingest common types of data and logs to Logstash. For example, winlogbeat can ingest Windows Event Logs, filebeat can ingest contents of a file
CloudwatchThis plugin can pull log events from the AWS CloudWatch
fileThe file plugin can capture events from a file and stream it to Logstash
execThe output of a shell script or command is captured by this plugin
HTTPThe http plugin can receive data from endpoints using HTTP protocol
JDBCThis plugin can be used to ingest data from JDBC compliant databases
KafkaMessages from a Kafka topic can be streamed in with this plugin
S3This plugin can stream events from files in an s3 bucket
SNMPIt used to stream in log events from network devices using Simple Network Management Protocol
SQSThe SQS plugin is used to capture messages from AWS Simple Queue Service queues
Syslogsystem log events are ingested using Syslog plugin
TCPEvents from a TCP socket can be streamed using the tcp plugin
UDPThis plugin can capture events over the UDP protocol

The complete list of input plugins can be found from the GitHub repository for Logstash plugins

Logstash Filter Plugins

Filter plugins are used to process ingested data. Here are some of the native filter plugins available for Logstash:

grokThe grok plugin can transform unstructured data into something structured and queryable. 
JSONThis plugin is used to parse event data from JSON payload
xmlThis is used to parse event data from XML payload 
csvThe csv plugin parses comma-separated data and separates it into individual fields
SplitThis splits a multi-line input event into separate event lines. 
CloneThe clone filter plugin duplicates an event record
DNSPerforms a reverse DNS lookup from the event data
GeoIPThe geoip plugin adds geographical information about an IP address in the input event
urldecodeThis decodes URL-encoded fields 

Here’s a more detailed list of Logstash filter plugins.

Logstash Output Plugins

Output plugins are used to send data from Logstasah to one or more destinations. Like input and filter plugins, there are many output plugins available for Logstash:

WebSocketThis is used for sending the filtered/processed data to a WebSocket
TCPThis plugin sends data to a tcp socket
StdoutThis sends out data to standard output
HTTPThe HTTP output plugin can send data to endpoints using HTTP protocol
SyslogThe Syslog output plugin sends event data to a Syslog server
SNSPushes event data to an AWS Simple Notification Service topic
PipeThis is used to pipe the output data to another application’s input
FileThis plugin writes output data to disk file
EmailThis sends output data to a specified e-mail address 
KafkaWrites the events to a Kafka topic
MongoDBThe MongoDB output plugin writes events to a MongoDB database
SinkThis plugin discards any data received and does not send anywhere

And once again, the GitHub repos for Logstash output plugins show a more detailed list.

Installing Logstash

Installing Logstash is fairly simple.

Here, we are going to install Logstash on an Amazon Linux 2 server.

First, we will install Java 8, (Open JDK), so we run the following command as a sudo user:

# yum install java-1.8.0-openjdk -yOnce the packages are installed, we run the “alternatives” command to specify the version:

Once the packages are installed, we run the “alternatives” command to specify the version:

# alternatives --config java
There is 1 program that provides 'java'.
 Selection    Command
-----------------------------------------------
*+ 1           java-1.8.0-openjdk.x86_64 
(/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.amzn2.0.1.x86_64/jre/bin/java)
Enter to keep the current selection[+], or type selection number: 1

Finally, we check the Java version:

# java -version
openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)

With Java installed, we run the following command to download and install the Elasticsearch public signing key:

# rpm --import https://artifacts.elastic.co/GPG-KEY-elasticsearch

We then create a file called logstash.repo under the /etc/yum.repos.d/ directory:

# vi /etc/yum.repos.d/logstash.repo
[logstash-7.x]
name=Elastic repository for 7.x packages
baseurl=https://artifacts.elastic.co/packages/7.x/yum
gpgcheck=1
gpgkey=https://artifacts.elastic.co/GPG-KEY-elasticsearch
enabled=1
autorefresh=1
type=rpm-md

Finally, we install Logstash:

# yum install logstash -y

Configuring Logstash for a Pipeline

Let’s create a pipeline in Logstash now.

We will use the sample Logstash configuration from Elastic’s GitHub repo, but change it slightly for our purpose.

First, we download a sample Apache log file from Elastic’s GitHub repo:

wget 
https://raw.githubusercontent.com/elastic/examples/master/Common%20Data%20Formats/apache_logs/apache_logs

If we look at the file, we can see it contains typical web server log events:

# head apache_logs
83.149.9.216 - - [17/May/2015:10:05:03 +0000] "GET /presentations/logstash-monitorama-2013/images/kibana-search.png HTTP/1.1" 200 203023 "http://semicomplete.com/presentations/logstash-monitorama-2013/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36
83.149.9.216 - - [17/May/2015:10:05:43 +0000] "GET /presentations/logstash-monitorama-2013/images/kibana-dashboard3.png HTTP/1.1" 200 171717 "http://semicomplete.com/presentations/logstash-monitorama-2013/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36"
83.149.9.216 - - [17/May/2015:10:05:47 +0000] "GET /presentations/logstash-monitorama-2013/plugin/highlight/highlight.js HTTP/1.1" 200 26185 "http://semicomplete.com/presentations/logstash-monitorama-2013/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36"
83.149.9.216 - - [17/May/2015:10:05:12 +0000] "GET /presentations/logstash-monitorama-2013/plugin/zoom-js/zoom.js HTTP/1.1" 200 7697 "http://semicomplete.com/presentations/logstash-monitorama-2013/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36"

Next, we create a configuration file under /etc/logstash/conf.d directory and name it logstash_apache.conf

# vim /etc/logstash/conf.d/logstash_apache.conf
input {
 file {
path => "/root/apache_logs"
mode => "read"
start_position => "beginning"
file_completed_action => "log"
file_completed_log_path => "/tmp/finish.log"
ignore_older => 864000
  file_chunk_size => 33554432
  }
}
filter {
grok {
match => {
"message" => '%{IPORHOST:clientip} %{USER:ident} %{USER:auth} \[%{HTTPDATE:timestamp}\] "%{WORD:verb} %{DATA:request} HTTP/%{NUMBER:httpversion}" %{NUMBER:response:int} (?:-|%{NUMBER:bytes:int}) %{QS:referrer} %{QS:agent}'
   }
  }
date {
match => [ "timestamp", "dd/MMM/YYYY:HH:mm:ss Z" ]
locale => en
 }
geoip {
   source => "clientip"
  }
useragent {
source => "agent"
target => "useragent"
  }
}
output {
 stdout { codec => "rubydebug" }
}

This is a Logstash pipeline. It has three sections: one for the input plugin, one for filter and the last one for output.

For the input, we are using the file plugin.

We are specifying the file’s path, asking the plugin to open the file for reading it from the beginning and specifying a few other parameters. For example, we are saying when the file is read, Logstash should log its name in a file under the /tmp directory.

In the filter section, we are using four plugins to parse Apache log events:

grok for parsing each line, date plugin for parsing the event date and time, GeoIP for geo-coding the IP address, and user agent for parsing the user agent field.

Finally, we are telling Logstash to show the results to standard output which is the console. The output should be shown in the ruby-debug format.

In short: this pipeline will read our Apache log file, parse each line for a specified number of fields and then print the results on the screen,

Run A Command to Run The Pipeline

We can now run the command below to run the pipeline.

Note how we are using the -f switch to specify the configuration file:

# /usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/logstash_apache.conf

It will start with a message block like the following:

Could not find log4j2 configuration at path /usr/share/logstash/config/log4j2.properties. Using default config which logs errors to the console
[WARN ] 2020-06-30 11:02:06.129 [LogStash::Runner] multilocal - Ignoring the 'pipelines.yml' file because modules or command line options are specified
[INFO ] 2020-06-30 11:02:06.140 [LogStash::Runner] runner - Starting Logstash {"logstash.version"=>"7.8.0", "jruby.version"=>"jruby 9.2.11.1 (2.5.7) 2020-03-25 b1f55b1a40 OpenJDK 64-Bit Server VM 25.252-b09 on 1.8.0_252-b09 +indy +jit [linux-x86_64]"}
[INFO ] 2020-06-30 11:02:08.247 [Converge PipelineAction::Create
] Reflections - Reflections took 49 ms to scan 1 urls, producing 21 keys and 41 values
[INFO ] 2020-06-30 11:02:09.549 [[main]-pipeline-manager] geoip - Using geoip database {:path=>"/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-filter-geoip-6.0.3-java/vendor/GeoLite2-City.mmdb"}
[INFO ] 2020-06-30 11:02:10.085 [[main]-pipeline-manager] javapipeline - Starting pipeline {:pipeline_id=>"main", "pipeline.workers"=>2, "pipeline.batch.size"=>125, "pipeline.batch.delay"=>50, "pipeline.max_inflight"=>250, "pipeline.sources"=>["/etc/logstash/conf.d/logstash_apache.conf"], :thread=>"#"}
[INFO ] 2020-06-30 11:02:11.445 [[main]-pipeline-manager] file - No sincedb_path set, generating one based on the "path" setting
{:sincedb_path=>"/usr/share/logstash/data/plugins/inputs/file/.sincedb_d2aed600d1e56802aff928fa76d3d925", :path=>["/root/apache_logs"]}
[INFO ] 2020-06-30 11:02:11.478 [[main]-pipeline-manager] javapipeline - Pipeline started {"pipeline.id"=>"main"}
[INFO ] 2020-06-30 11:02:11.548 [Agent thread] agent - Pipelines running {:count=>1, :running_pipelines=>[:main], :non_running_pipelines=>[]}
[INFO ] 2020-06-30 11:02:11.626 [[main]9600}

Then it will quickly scroll past the screen as the pipeline outputs the parsed results from each line of event in rubydebug codec.

Here is an example of fields parsed from a single line:

{
"host" => "ip-16-0-1-80.ec2.internal",
    "geoip" => {
    "postal_code" => "144700",
"continent_code" => "EU",
             "city_name" => "Moscow",
             "latitude" => 55.7527, 
          "region_name" => "Moscow",
              "timezone" => "Europe/Moscow",
              "longitude" => 37.6172,
         "country_code3" => "RU",
              "location" => {
            "lon" => 37.6172,
            "lat" => 55.7527
  },
           "region_code" => "MOW",
                    "ip" => "83.149.9.216",
          "country_name" => "Russia",
         "country_code2" => "RU"
   },
 "useragent" => {
         "build" => "",
         "name" => "Chrome",
     "os_major" => "10",
           "major" => "32",
           "minor" => "0",
               "os" => "Mac OS X",
          "device" => "Other",
         "os_name" => "Mac OS X",
        "os_minor" => "9",
        "patch" => "1700"
},
      "referrer" => "\"http://semicomplete.com/presentations/logstash-monitorama-2013/\"",
          "path" => "/root/apache_logs",
   "httpversion" => "1.1",
        "auth" => "-",
 "timestamp" => "17/May/2015:10:05:03 +0000",
 "request" => "/presentations/logstash-monitorama-2013/images/kibana-search.png",
 "message" => "83.149.9.216 - - [17/May/2015:10:05:03 +0000] \"GET /presentations/logstash-monitorama-2013/images/kibana-search.png HTTP/1.1\" 200 203023 \"http://semicomplete.com/presentations/logstash-monitorama-2013/\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36\"",
       "ident" => "-",
       "verb" => "GET",
      "response" => 200,
   "agent" => "\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36\"",
  "@timestamp" => 2015-05-17T10:05:03.000Z,
    "bytes" => 203023,
    "clientip" => "83.149.9.216",
    "@version" => "1"
}

You can see how the Logstash pipeline was able to parse an event and extracted fields from it.

Not only did it extract the fields, but it also used a filter like the geoip to add extra information about the client IP address location.

We can stop the Logstash process by pressing Ctrl+C in the command prompt.

If we rerun the command, there will be nothing printed except the header information.

This is because the file input plugin keeps track of the current position within a file.

It does so using another hidden file called “sincedb”. By default, sincedb is located in the data folder of Logstash.

If the file input process fails halfway through reading the file, Logstash can use this information in sincedb to start from where it had left off.

On the other hand, if the file is fully processed, the plugin will know it does not have to do anything.

Also, by default, the file input plugin would watch and read the tail of a file.

This is because the file input plugin is used for reading from live log files where events are continuously added to the end.

In our case, we are using a static copy of an Apache log file; so we are using “start_position => “beginning” as a configuration value.

To run Logstash as a service, we can run this command:

# systemctl start logstash.service

Now let’s review Logstash alternatives.

Skip to section: Logstash vs. PortX.

Logstash vs. Fluentd 

Logstash isn’t the only log collection and processing engine in the market – there are others that can do the same task. The most commonly mentioned alternative among these is Fluentd.

Fluentd is another open-source log processing pipeline.

Like Logstash, Fluentd can ingest data from many different sources, parse, analyze and transform the data, and push it to different destinations.

However, there are some differences between these two technologies.

Logstash is part of the popular Elastic stack – often dubbed the ELK stack – consisting of Elasticsearch, Logstash, and Kibana. Coming from the same vendor Elastic, these three tools have tight integration.

Fluentd is a project of “Cloud Native Computing Foundation” (CNCF).

Logstash was written in JRuby – the Java implementation of the Ruby programming language, as opposed to CRuby – the C implementation of Ruby – which was used to write Fluentd.

Note: The Java dependency means Logstash needs a Java runtime available in its server.

Part of log data processing can often involve using a “route” – specifying where the pipeline should send its data if multiple output plugins are defined.

  • In Logstash, this is done using a conditional check with an “if” statement. For example, if a condition is met, Logstash will send its data to one destination. If the condition turns out to be false, the data goes somewhere else.
  • Fluentd uses tags to route events to output plugins.

For high availability (HA)

  • Logstash can use the Beats protocol to ensure both load balancing and resiliency of messages.
  • Fluentd on the other hand uses both active-active and active-passive deployment architecture for both HA and scalability. Fluentd can forward events to any number of additional processing nodes.

Both log processing pipelines come with an extensive collection of plugins.

  • The Logstash official GitHub repository offers over 200 plugins, all in one place.
  • The Fluentd official GitHub repo has fewer than these, although other plugins (more than 500 plugins) can be found in other repos.

Logstash and Beats

Initially, Logstash was solely responsible for both ingesting, and processing data.

Unfortunately, this created performance bottlenecks for complex pipelines. To make things simple, Beats was born. 

Beats is a collection of lightweight, open-source tools that can collect logs from many different sources and forward those to either Logstash, or directly to Elasticsearch.

Being lightweight, these applications have small resource utilization footprints, and work like agents installed on a server. 

There are some common types of Beats that come with Logstash: Filebeat which can extract log files from servers, Winlogbeat which can collect Windows events, Metricbeat that can collect server metrics, or Packetbeat that can extract network-related data. Other Beats are developed by Elastic or the user community. 

PortX vs. Logstash

How XPLG’s PortX Compares With Logstash?

XpoLog is an advanced platform for collecting, parsing, indexing, and analyzing data from modern, hybrid cloud networks. 

Some of XpoLog’s features include fast and efficient indexing, Artificial Intelligence (AI) powered anomaly or trend detection, and real-time analytics, leading apps marketplace with over 1000 of ready to use reports and dashboards, monitoring system, more tools to investigate and manage log data, and the platform continues to evolve all the time.

On a higher level: XpoLog can be thought of as the ELK stack tightly knit together.

Unlike different server layers working separately, the collection, processing, storage, and user interface layers are all part of a single application in XpoLog.

XpoLog’s interface is stunning and friendly to use as the XpoLog platform is fully automated.

PortX

At the core of XpoLog is PortX, the engine that collects, parses, indexes and enriches log data before forwarding it to one or more destinations.

Available as a separate product, PortX is similar to Logstash in functionality, only much better.

  • It’s capable of getting data from many different sources, including Beats and Logstash itself.
  • The log collection is agent-less so there’s no need to set up and configure collection agents on source systems.
  • Setting up a processing pipeline in PortX is 90% faster than it is in Logstash because there are no complex pipeline configurations to write.

By default, PortX will send its parsed and filtered data to XpoLog, but customers can opt to use any platform of their choice as a destination, including Logstash or Elasticsearch. With this approach, customers can keep their existing investment in the ELK stack.

In the image below, we have uploaded our Apache log file to XpoLog 7 running on the same server where we installed Logstash:

Uploaded Apache log file to XpoLog 7 running on the same server where we installed Logstash:

Immediately, PortX parses the known file format using one of its built-in parsing patterns.

Note: we didn’t have to write any grok patterns here.

XpoLog's PortX parses the known file format using one of its built-in parsing patterns

Based on the automatic pattern, it also shows us the parsed fields in separate columns:

Based on the automatic pattern, it also shows us the parsed fields in separate columns:

This automatic parsing feature of PortX can easily identify data patterns from many different applications and systems.

It’s also possible to build data filtering logic with regular expression using a log viewer and visual pattern builder.

In addition, there are multiple plugins that support automatic parsing for hundreds of systems which make the configuration process fast and efficient.

Optimize costs: Once the data is parsed, and indexed, it can be filtered and sent to one or more consumers like Logstash, Elasticsearch, Splunk, SIEM tools, or any other logging tool.

In a different architectural pattern: Logstash can be kept as a data collection and processing engine and XpoLog can be used as the indexing and user interface layer, replacing both Elasticsearch and Kibana.

Logstash output plugin

XpoLog has its own Logstash output plugin which is a Ruby application. Using this plugin, a Logstash instance can send data to XpoLog.

At XpoLog end, a “listener” can receive the data and make it available for indexing, searching, and analyzing.

An XpoLog listener is a  part of the application that can monitor incoming traffic coming over different protocols like Syslog (using UDP or TCP), HTTP/S, XpoLog Agents, Cisco routers and switches, and Kafka topics. The Logstash output plugin sends its data over HTTP/S or Syslog protocol to specific ports.

Other areas where XpoLog shines better than Logstash 

XpoLog is a comprehensive platform to manage log data, monitor, investigate and view insights out-of-the-box. But why using XpoLog over Logstash open-source system?

Some of the features that should also be considered are:

  • Advanced and efficient visual tools to parse data, and forward it in a structured manner as a whole or filtered.
  • Built-in system health check and load balancing for optimal performance.
  • Efficient data storage.
  • Automatic tagging of data elements.
  • Alerting and visualizations.
  • Single Sign-On (SSO) facility with Active Directory or other identity providers.
  • Built-in security and data masking.

Final Words

This was a very quick introduction to Logstash and how it works.

As we saw, the product is quite versatile, and allows parallel data processing from many different sources.

We also reviewed some of Logstash alternatives such as Fulentd and XPLG’s PortX and compared each option. 

XpoLog is the ELK stack alternative, which takes away much of the pain of upgrading, patching, load balancing, performance tuning and security hardening of Logstash, or the ELK stack in general.

With XpoLog, you also don’t have to install extra plugins, write complex pipeline configurations, or think about interfacing with a separate storage and indexing layer.

Test all these benefits and more: