What is Logstash?
Bonus: useful Logstash commands free list
By: Sadequl Hussain, Product Marketing Associate
In this article, we are going to have a quick introduction to Logstash, a very popular application for collecting, processing and filtering log data – and see how it works.
We will review plugins, installation and configuration of Logstash, we will briefly mention Beats, and also compare Logstash to other log collectors and review Logstash alternatives.
Let’s begin
The classic definition of Logstash says it’s an open-source, server-side data processing pipeline that can simultaneously ingest data from a wide variety of sources, then parse, filter, transform and enrich the data, and finally forward it to a downstream system.
In most cases, the downstream system is Elasticsearch, although it doesn’t always have to be that, as we will learn later.
Logstash is typically used as the “processing” engine for any log management solution (or systems that deal with changing data streams).
These applications collect logs from different sources (software, hardware, electronic devices, API calls, etc.), process the collected data, and forwards it to a different application for further processing or storing.
This makes Logstash essentially a data pipeline. But there’s more to it than typical pipelines that data engineers develop.
Usually, custom-developed data pipelines extract specific types of data from specific sources, perform some predefined actions on the data, and then save the result in a specific location.
This is what an ETL (Extraction, Transformation, and Loading) job will do.
Unlike ETL jobs though, Logstash is a generic engine; which means it can accept data from many different sources out-of-box.
These data can be structured, semi-structured, or unstructured, and can have many different schemas. To Logstash, all these data are “logs” containing “events”.
Logstash can easily parse and filter out the data from these log events using one or more filtering plugins that come with it.
Finally, it can send the filtered output to one or more destinations. Again, there are prebuilt output interfaces that make this task simple.
How Logstash Works?
Data flows through a Logstash pipeline in three stages: the input stage, the filter stage, and the output stage.
In each stage, there are plugins that perform some action on the data.
This is shown in the image below.
In the input stage, data is ingested into Logstash from a source.
Logstash itself doesn’t access the source system and collect the data, it uses input plugins to ingest the data from various sources.
Note: There’s a multitude of input plugins available for Logstash such as various log files, relational databases, NoSQL databases, Kafka queues, HTTP endpoints, S3 files, CloudWatch Logs, log4j events or Twitter feed.
Once data is ingested, one or more filter plugins take care of the processing part in the filter stage.
In this stage, necessary data elements are extracted from the input stream.
Remember: different types of filter plugins exist for different processing needs.
For example, there are plugins for parsing and processing XML, JSON, unstructured, and CSV data, API responses, Geocoding IP addresses, or relational data.
The processed data is sent to a receiver in the output stage.
Like input plugins, there are output plugins available for many different endpoints, including those for Elasticsearch, HTTP, e-mail, S3 file, PagerDuty alert, or Syslog to name just a few.
As a standalone data pipeline, Logstash isn’t worth much.
Logstash real value comes when its processed data is saved in a high-performance, searchable storage engine, and easily viewable from a user interface tier.
In the ELK stack, the storage (and indexing) engine is Elasticsearch and the UI is Kibana.
Note: the destination doesn’t have to be Elasticsearch, nor does the UI have to be Kibana. But more on that later.
Logstash Input Plugins
There are many input plugins available for Logstash for different types of events. Here are some common ones:
Beats | The beats plugins can ingest common types of data and logs to Logstash. For example, winlogbeat can ingest Windows Event Logs, filebeat can ingest contents of a file |
Cloudwatch | This plugin can pull log events from the AWS CloudWatch |
file | The file plugin can capture events from a file and stream it to Logstash |
exec | The output of a shell script or command is captured by this plugin |
HTTP | The http plugin can receive data from endpoints using HTTP protocol |
JDBC | This plugin can be used to ingest data from JDBC compliant databases |
Kafka | Messages from a Kafka topic can be streamed in with this plugin |
S3 | This plugin can stream events from files in an s3 bucket |
SNMP | It used to stream in log events from network devices using Simple Network Management Protocol |
SQS | The SQS plugin is used to capture messages from AWS Simple Queue Service queues |
Syslog | system log events are ingested using Syslog plugin |
TCP | Events from a TCP socket can be streamed using the tcp plugin |
UDP | This plugin can capture events over the UDP protocol |
The complete list of input plugins can be found from the GitHub repository for Logstash plugins
Logstash Filter Plugins
Filter plugins are used to process ingested data. Here are some of the native filter plugins available for Logstash:
grok | The grok plugin can transform unstructured data into something structured and queryable. |
JSON | This plugin is used to parse event data from JSON payload |
xml | This is used to parse event data from XML payload |
csv | The csv plugin parses comma-separated data and separates it into individual fields |
Split | This splits a multi-line input event into separate event lines. |
Clone | The clone filter plugin duplicates an event record |
DNS | Performs a reverse DNS lookup from the event data |
GeoIP | The geoip plugin adds geographical information about an IP address in the input event |
urldecode | This decodes URL-encoded fields |
Here’s a more detailed list of Logstash filter plugins.
Logstash Output Plugins
Output plugins are used to send data from Logstasah to one or more destinations. Like input and filter plugins, there are many output plugins available for Logstash:
WebSocket | This is used for sending the filtered/processed data to a WebSocket |
TCP | This plugin sends data to a tcp socket |
Stdout | This sends out data to standard output |
HTTP | The HTTP output plugin can send data to endpoints using HTTP protocol |
Syslog | The Syslog output plugin sends event data to a Syslog server |
SNS | Pushes event data to an AWS Simple Notification Service topic |
Pipe | This is used to pipe the output data to another application’s input |
File | This plugin writes output data to disk file |
This sends output data to a specified e-mail address | |
Kafka | Writes the events to a Kafka topic |
MongoDB | The MongoDB output plugin writes events to a MongoDB database |
Sink | This plugin discards any data received and does not send anywhere |
And once again, the GitHub repos for Logstash output plugins show a more detailed list.
Installing Logstash
Installing Logstash is fairly simple.
Here, we are going to install Logstash on an Amazon Linux 2 server.
First, we will install Java 8, (Open JDK), so we run the following command as a sudo user:
# yum install java-1.8.0-openjdk -yOnce the packages are installed, we run the “alternatives” command to specify the version:
Once the packages are installed, we run the “alternatives” command to specify the version:
# alternatives --config java
There is 1 program that provides 'java'.
Selection Command
-----------------------------------------------
*+ 1 java-1.8.0-openjdk.x86_64
(/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.amzn2.0.1.x86_64/jre/bin/java)
Enter to keep the current selection[+], or type selection number: 1
Finally, we check the Java version:
# java -version openjdk version "1.8.0_252" OpenJDK Runtime Environment (build 1.8.0_252-b09) OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)
With Java installed, we run the following command to download and install the Elasticsearch public signing key:
# rpm --import https://artifacts.elastic.co/GPG-KEY-elasticsearch
We then create a file called logstash.repo under the /etc/yum.repos.d/ directory:
# vi /etc/yum.repos.d/logstash.repo [logstash-7.x] name=Elastic repository for 7.x packages baseurl=https://artifacts.elastic.co/packages/7.x/yum gpgcheck=1 gpgkey=https://artifacts.elastic.co/GPG-KEY-elasticsearch enabled=1 autorefresh=1 type=rpm-md
Finally, we install Logstash:
# yum install logstash -y
Skip between sections:
Configuring Logstash for a Pipeline
Let’s create a pipeline in Logstash now.
We will use the sample Logstash configuration from Elastic’s GitHub repo, but change it slightly for our purpose.
First, we download a sample Apache log file from Elastic’s GitHub repo:
wget
https://raw.githubusercontent.com/elastic/examples/master/Common%20Data%20Formats/apache_logs/apache_logs
If we look at the file, we can see it contains typical web server log events:
# head apache_logs 83.149.9.216 - - [17/May/2015:10:05:03 +0000] "GET /presentations/logstash-monitorama-2013/images/kibana-search.png HTTP/1.1" 200 203023 "http://semicomplete.com/presentations/logstash-monitorama-2013/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36 83.149.9.216 - - [17/May/2015:10:05:43 +0000] "GET /presentations/logstash-monitorama-2013/images/kibana-dashboard3.png HTTP/1.1" 200 171717 "http://semicomplete.com/presentations/logstash-monitorama-2013/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36" 83.149.9.216 - - [17/May/2015:10:05:47 +0000] "GET /presentations/logstash-monitorama-2013/plugin/highlight/highlight.js HTTP/1.1" 200 26185 "http://semicomplete.com/presentations/logstash-monitorama-2013/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36" 83.149.9.216 - - [17/May/2015:10:05:12 +0000] "GET /presentations/logstash-monitorama-2013/plugin/zoom-js/zoom.js HTTP/1.1" 200 7697 "http://semicomplete.com/presentations/logstash-monitorama-2013/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36" … …
Next, we create a configuration file under /etc/logstash/conf.d directory and name it logstash_apache.conf
# vim /etc/logstash/conf.d/logstash_apache.conf input { file { path => "/root/apache_logs" mode => "read" start_position => "beginning" file_completed_action => "log" file_completed_log_path => "/tmp/finish.log" ignore_older => 864000 file_chunk_size => 33554432 } } filter { grok { match => { "message" => '%{IPORHOST:clientip} %{USER:ident} %{USER:auth} \[%{HTTPDATE:timestamp}\] "%{WORD:verb} %{DATA:request} HTTP/%{NUMBER:httpversion}" %{NUMBER:response:int} (?:-|%{NUMBER:bytes:int}) %{QS:referrer} %{QS:agent}' } } date { match => [ "timestamp", "dd/MMM/YYYY:HH:mm:ss Z" ] locale => en } geoip { source => "clientip" } useragent { source => "agent" target => "useragent" } } output { stdout { codec => "rubydebug" } }
This is a Logstash pipeline. It has three sections: one for the input plugin, one for filter and the last one for output.
For the input, we are using the file plugin.
We are specifying the file’s path, asking the plugin to open the file for reading it from the beginning and specifying a few other parameters. For example, we are saying when the file is read, Logstash should log its name in a file under the /tmp directory.
In the filter section, we are using four plugins to parse Apache log events:
grok for parsing each line, date plugin for parsing the event date and time, GeoIP for geo-coding the IP address, and user agent for parsing the user agent field.
Finally, we are telling Logstash to show the results to standard output which is the console. The output should be shown in the ruby-debug format.
In short: this pipeline will read our Apache log file, parse each line for a specified number of fields and then print the results on the screen,
Skip between sections:
Run A Command to Run The Pipeline
We can now run the command below to run the pipeline.
Note how we are using the -f switch to specify the configuration file:
# /usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/logstash_apache.conf
It will start with a message block like the following:
Could not find log4j2 configuration at path /usr/share/logstash/config/log4j2.properties. Using default config which logs errors to the console [WARN ] 2020-06-30 11:02:06.129 [LogStash::Runner] multilocal - Ignoring the 'pipelines.yml' file because modules or command line options are specified [INFO ] 2020-06-30 11:02:06.140 [LogStash::Runner] runner - Starting Logstash {"logstash.version"=>"7.8.0", "jruby.version"=>"jruby 9.2.11.1 (2.5.7) 2020-03-25 b1f55b1a40 OpenJDK 64-Bit Server VM 25.252-b09 on 1.8.0_252-b09 +indy +jit [linux-x86_64]"} [INFO ] 2020-06-30 11:02:08.247 [Converge PipelineAction::Create ] Reflections - Reflections took 49 ms to scan 1 urls, producing 21 keys and 41 values [INFO ] 2020-06-30 11:02:09.549 [[main]-pipeline-manager] geoip - Using geoip database {:path=>"/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-filter-geoip-6.0.3-java/vendor/GeoLite2-City.mmdb"} [INFO ] 2020-06-30 11:02:10.085 [[main]-pipeline-manager] javapipeline - Starting pipeline {:pipeline_id=>"main", "pipeline.workers"=>2, "pipeline.batch.size"=>125, "pipeline.batch.delay"=>50, "pipeline.max_inflight"=>250, "pipeline.sources"=>["/etc/logstash/conf.d/logstash_apache.conf"], :thread=>"#"} [INFO ] 2020-06-30 11:02:11.445 [[main]-pipeline-manager] file - No sincedb_path set, generating one based on the "path" setting {:sincedb_path=>"/usr/share/logstash/data/plugins/inputs/file/.sincedb_d2aed600d1e56802aff928fa76d3d925", :path=>["/root/apache_logs"]} [INFO ] 2020-06-30 11:02:11.478 [[main]-pipeline-manager] javapipeline - Pipeline started {"pipeline.id"=>"main"} [INFO ] 2020-06-30 11:02:11.548 [Agent thread] agent - Pipelines running {:count=>1, :running_pipelines=>[:main], :non_running_pipelines=>[]} [INFO ] 2020-06-30 11:02:11.626 [[main]9600}
Then it will quickly scroll past the screen as the pipeline outputs the parsed results from each line of event in rubydebug codec.
Here is an example of fields parsed from a single line:
{ "host" => "ip-16-0-1-80.ec2.internal", "geoip" => { "postal_code" => "144700", "continent_code" => "EU", "city_name" => "Moscow", "latitude" => 55.7527, "region_name" => "Moscow", "timezone" => "Europe/Moscow", "longitude" => 37.6172, "country_code3" => "RU", "location" => { "lon" => 37.6172, "lat" => 55.7527 }, "region_code" => "MOW", "ip" => "83.149.9.216", "country_name" => "Russia", "country_code2" => "RU" }, "useragent" => { "build" => "", "name" => "Chrome", "os_major" => "10", "major" => "32", "minor" => "0", "os" => "Mac OS X", "device" => "Other", "os_name" => "Mac OS X", "os_minor" => "9", "patch" => "1700" }, "referrer" => "\"http://semicomplete.com/presentations/logstash-monitorama-2013/\"", "path" => "/root/apache_logs", "httpversion" => "1.1", "auth" => "-", "timestamp" => "17/May/2015:10:05:03 +0000", "request" => "/presentations/logstash-monitorama-2013/images/kibana-search.png", "message" => "83.149.9.216 - - [17/May/2015:10:05:03 +0000] \"GET /presentations/logstash-monitorama-2013/images/kibana-search.png HTTP/1.1\" 200 203023 \"http://semicomplete.com/presentations/logstash-monitorama-2013/\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36\"", "ident" => "-", "verb" => "GET", "response" => 200, "agent" => "\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36\"", "@timestamp" => 2015-05-17T10:05:03.000Z, "bytes" => 203023, "clientip" => "83.149.9.216", "@version" => "1" }
You can see how the Logstash pipeline was able to parse an event and extracted fields from it.
Not only did it extract the fields, but it also used a filter like the geoip to add extra information about the client IP address location.
We can stop the Logstash process by pressing Ctrl+C in the command prompt.
If we rerun the command, there will be nothing printed except the header information.
This is because the file input plugin keeps track of the current position within a file.
It does so using another hidden file called “sincedb”. By default, sincedb is located in the data folder of Logstash.
If the file input process fails halfway through reading the file, Logstash can use this information in sincedb to start from where it had left off.
On the other hand, if the file is fully processed, the plugin will know it does not have to do anything.
Also, by default, the file input plugin would watch and read the tail of a file.
This is because the file input plugin is used for reading from live log files where events are continuously added to the end.
In our case, we are using a static copy of an Apache log file; so we are using “start_position => “beginning” as a configuration value.
To run Logstash as a service, we can run this command:
# systemctl start logstash.service
Now let’s review Logstash alternatives.
Skip to section: Logstash vs. PortX.
Logstash vs. Fluentd
Logstash isn’t the only log collection and processing engine in the market – there are others that can do the same task. The most commonly mentioned alternative among these is Fluentd.
Fluentd is another open-source log processing pipeline.
Like Logstash, Fluentd can ingest data from many different sources, parse, analyze and transform the data, and push it to different destinations.
However, there are some differences between these two technologies.
Logstash is part of the popular Elastic stack – often dubbed the ELK stack – consisting of Elasticsearch, Logstash, and Kibana. Coming from the same vendor Elastic, these three tools have tight integration.
Fluentd is a project of “Cloud Native Computing Foundation” (CNCF).
Logstash was written in JRuby – the Java implementation of the Ruby programming language, as opposed to CRuby – the C implementation of Ruby – which was used to write Fluentd.
Note: The Java dependency means Logstash needs a Java runtime available in its server.
Part of log data processing can often involve using a “route” – specifying where the pipeline should send its data if multiple output plugins are defined.
- In Logstash, this is done using a conditional check with an “if” statement. For example, if a condition is met, Logstash will send its data to one destination. If the condition turns out to be false, the data goes somewhere else.
- Fluentd uses tags to route events to output plugins.
For high availability (HA)
- Logstash can use the Beats protocol to ensure both load balancing and resiliency of messages.
- Fluentd on the other hand uses both active-active and active-passive deployment architecture for both HA and scalability. Fluentd can forward events to any number of additional processing nodes.
Both log processing pipelines come with an extensive collection of plugins.
- The Logstash official GitHub repository offers over 200 plugins, all in one place.
- The Fluentd official GitHub repo has fewer than these, although other plugins (more than 500 plugins) can be found in other repos.
Logstash and Beats
Initially, Logstash was solely responsible for both ingesting, and processing data.
Unfortunately, this created performance bottlenecks for complex pipelines. To make things simple, Beats was born.
Beats is a collection of lightweight, open-source tools that can collect logs from many different sources and forward those to either Logstash, or directly to Elasticsearch.
Being lightweight, these applications have small resource utilization footprints, and work like agents installed on a server.
There are some common types of Beats that come with Logstash: Filebeat which can extract log files from servers, Winlogbeat which can collect Windows events, Metricbeat that can collect server metrics, or Packetbeat that can extract network-related data. Other Beats are developed by Elastic or the user community.
PortX vs. Logstash
How XPLG’s PortX Compares With Logstash?
XpoLog is an advanced platform for collecting, parsing, indexing, and analyzing data from modern, hybrid cloud networks.
Some of XpoLog’s features include fast and efficient indexing, Artificial Intelligence (AI) powered anomaly or trend detection, and real-time analytics, leading apps marketplace with over 1000 of ready to use reports and dashboards, monitoring system, more tools to investigate and manage log data, and the platform continues to evolve all the time.
On a higher level: XpoLog can be thought of as the ELK stack tightly knit together.
Unlike different server layers working separately, the collection, processing, storage, and user interface layers are all part of a single application in XpoLog.
XpoLog’s interface is stunning and friendly to use as the XpoLog platform is fully automated.
PortX
At the core of XpoLog is PortX, the engine that collects, parses, indexes and enriches log data before forwarding it to one or more destinations.
Available as a separate product, PortX is similar to Logstash in functionality, only much better.
- It’s capable of getting data from many different sources, including Beats and Logstash itself.
- The log collection is agent-less so there’s no need to set up and configure collection agents on source systems.
- Setting up a processing pipeline in PortX is 90% faster than it is in Logstash because there are no complex pipeline configurations to write.
By default, PortX will send its parsed and filtered data to XpoLog, but customers can opt to use any platform of their choice as a destination, including Logstash or Elasticsearch. With this approach, customers can keep their existing investment in the ELK stack.
In the image below, we have uploaded our Apache log file to XpoLog 7 running on the same server where we installed Logstash:
Immediately, PortX parses the known file format using one of its built-in parsing patterns.
Note: we didn’t have to write any grok patterns here.
Based on the automatic pattern, it also shows us the parsed fields in separate columns:
This automatic parsing feature of PortX can easily identify data patterns from many different applications and systems.
It’s also possible to build data filtering logic with regular expression using a log viewer and visual pattern builder.
In addition, there are multiple plugins that support automatic parsing for hundreds of systems which make the configuration process fast and efficient.
Optimize costs: Once the data is parsed, and indexed, it can be filtered and sent to one or more consumers like Logstash, Elasticsearch, Splunk, SIEM tools, or any other logging tool.
In a different architectural pattern: Logstash can be kept as a data collection and processing engine and XpoLog can be used as the indexing and user interface layer, replacing both Elasticsearch and Kibana.
Logstash output plugin
XpoLog has its own Logstash output plugin which is a Ruby application. Using this plugin, a Logstash instance can send data to XpoLog.
At XpoLog end, a “listener” can receive the data and make it available for indexing, searching, and analyzing.
An XpoLog listener is a part of the application that can monitor incoming traffic coming over different protocols like Syslog (using UDP or TCP), HTTP/S, XpoLog Agents, Cisco routers and switches, and Kafka topics. The Logstash output plugin sends its data over HTTP/S or Syslog protocol to specific ports.
Other areas where XpoLog shines better than Logstash
XpoLog is a comprehensive platform to manage log data, monitor, investigate and view insights out-of-the-box. But why using XpoLog over Logstash open-source system?
Some of the features that should also be considered are:
- Advanced and efficient visual tools to parse data, and forward it in a structured manner as a whole or filtered.
- Built-in system health check and load balancing for optimal performance.
- Efficient data storage.
- Automatic tagging of data elements.
- Alerting and visualizations.
- Single Sign-On (SSO) facility with Active Directory or other identity providers.
- Built-in security and data masking.
Final Words
This was a very quick introduction to Logstash and how it works.
As we saw, the product is quite versatile, and allows parallel data processing from many different sources.
We also reviewed some of Logstash alternatives such as Fulentd and XPLG’s PortX and compared each option.
XpoLog is the ELK stack alternative, which takes away much of the pain of upgrading, patching, load balancing, performance tuning and security hardening of Logstash, or the ELK stack in general.
With XpoLog, you also don’t have to install extra plugins, write complex pipeline configurations, or think about interfacing with a separate storage and indexing layer.
Test all these benefits and more: