The “search pipeline” refers to the structure of a Splunk search, which consists of a series of commands that are delimited by the pipe character (|). The pipe character inputs the results of the last command to the next, to chain SPL commands to each other.
Generally, searches are comprised of commands piped to another command to help reduce and formulate the results into something that we want.
A Splunk search starts with search terms at the beginning of the pipeline. These search terms are keywords, phrases, boolean expressions, key/value pairs, etc. that specify which events you want to retrieve from the index(es).
The retrieved events can then be passed as inputs into a search command using a pipe character, which would be transformed into the results that you need.At the beginning of a search pipeline, the search command is implied, even when you don’t explicitly state it. So if you immediately type: host=”localhost”, it is completed as search host=”localhost”
Events and results flowing through the Search pipeline exist as a collection of fields, which fundamentally comes from the data. The fields contain value strings relevant to specific events in the data and could be used alongside search commands to filter out data. Fields can come from the Index or from a wide range of sources at search time such as tags, regex extractions, event types, etc. For a given event, a field name might be present or absent, if present it might contain a single or multiple string values.
Certain important fields are index, _time, host, source, and _raw.
Some notable fields are:
Null: A field that is not present on a particular result or event. Other events or results in the same search might have values for this field.
Empty Field: A field that contains a single value that is the empty string.
Empty value: A value that is the empty string, or “”. You can also describe this as a zero-length string.
Multivalue Fields: A field that has more than one value. All non-null fields contain an ordered list of strings. The common case is that this is a list of one value. When the list contains more than one entry, it is a multivalue field
Clauses help group or rename fields in the result to help format the results. Some common clauses are the “BY” clause which sorts the results by a certain field, the “AS” clause used for renaming, and the “WHERE” clause used for sorting or filtering.
Some useful clauses used in filtering results include the “AND” and “OR” clauses, these clauses are generally used with search terms to specify which terms will be included. If there is no clause provided at the beginning of a search, the “AND” clause is automatically used.
A subsearch runs its own search and returns the results to the parent command as the argument value. The subsearch is run first before the command and is contained in square brackets. This type of search is generally used when you need to access more data or combine two different searches together.
An example of a sub-search in a command is:
union [search index=a | eval type = “foo”] [search index=b | eval mytype = “bar”]
Some examples of the above components in this example are:
Search Terms: index=”access_combined”, index=”main”
Clause: OR, by
Functions: avg()
Commands: stats, dedup, head
Argument: keepevents=true
A distributable streaming command is a command that runs on the indexer or search head, depending on where in the search that the command is invoked. This allows the commands to run subsets of indexed data in parallel, speeding up the execution of the command greatly. Examples of data distributable streaming commands include: convert, eval, fields, regex, and rename.
A centralized streaming command applies a transformation to each event returned by a search on the search head. Unlike a distributable streaming command, it cannot run the command on indexers, meaning that there is less parallelization that could be utilized on it.
Examples of data distributable centralized commands include: dedup, head, join, and transaction
A transforming command orders the results into a data table. These commands alter the values for each event into numerical values for Splunk software can use for statistical purposes. These commands are required to transform search result data into the data structures that are required for visualizations such as charts and tables.
Examples of transforming commands include: chart, timechart, stats, top, and rare
A generating command is a command that generates data from the indexers, without any prior transformations. Generating commands don’t expect or require an input, and are usually invoked at the beginning of the search with a leading pipe. That is there cannot be any command that is piped into a generating command. They are either event-generating (distributable or centralized) or report-generating. Depending on the command used, the results are returned as a list or a table.
Examples of generating commands include: dbinspect, datamodel, inputcsv, metadata, pivot, and search
An orchestrating command is one that does not directly affect the end result of the search but controls some aspects of how the search is processed. Orchestrating commands are generally used to help optimize the search so that the search completes faster.
Examples of orchestrating commands include redistribute, noop, and localop
A dataset processing command is one that requires the entire dataset before the command can run. These commands are not transforming, non-distributable, non-streaming, and non-orchestrating.
Examples of data processing commands include: sort, eventstats, some modes of cluster, dedup, and fillnull.
There are two ways that commands can ingest data, either streaming the data or waiting for the data to be fully available before utilizing them. These two methods of waiting for data are organized into two categories, Streaming Search Commands, and Non-Streaming Search Commands.
Streaming Search commands are commands in which the command operates on each event as it comes in, and has one input and one or no outputs. This type of command is run on indexers and can be applied to subsets of index data in a parallel fashion as long as it’s not preceded by a non-streaming search command.
Non-streaming search commands are commands that run on the search head and requires that all of the events are gathered from the indexers before running. An example of a non-streaming search command is the “sort” command, which requires all of the data to be retrieved before it can be sorted correctly.
Knowing which goal you want your search to accomplish can help you optimize searches.
For searches in which we want to retrieve data, when retrieving raw events from an index, no additional processing of the events is done before being retrieved, so being as specific as we can speed up searches. You could do this with keywords and field-value pairs that are unique to the events. When you want to retrieve events that occur frequently, the search is referred to as a dense search, if the event is rare in the dataset, it is known as a sparse search. Sparse searches that run against large volumes of data take longer than dense searches since it takes longer to find those events.
When running a search that generates a report that summarizes or organizes data, it would be best to be more restrictive and specific when retrieving data, since the data is going to be stored and processed within memory.
Command | Description |
Dedup | Removes duplicate results that match a certain criteria |
eval | Calculates an expression, see examples below |
fields | Removes fields from search results, can specify what fields we want |
head/tail | Returns the top/bottom N results |
lookup | Adds field values from an external source such as a lookup table |
chart/timechart | Returns results in a tabular format, such as a time chart of bar chart |
rename | Renames a field, use wildcards for multiple fields |
rex | Specifies a regular expression named groups to extract fields from results |
search | Filters results to those that match the search expression |
sort | Sorts the results by the specified field. Can be ascending or descending |
stats | Provides statistics, can be grouped by fields. See examples below |
top/rare | Displays the most/least common values in a field. Can be useful for grouping |
where | Filters search results using eval expressions. Used to compare two different fields |
table | Specifies fields to keep in the result set, and retains data in a tabular format |
Function | Description |
abs(x) | Returns absolute value of x |
case(x,”y”,…) | Consumes pairs of arguments X and Y, where X arguments are Boolean expressions. When evaluated to TRUE, the arguments return the corresponding Y argument. |
ceil(x) | Ceiling of number x |
cos(x) | Cosine of x |
exact(x) | Evaluates an expression x using double precision floating point arithmetic. IE exact(3.14*num) |
exp(x) | Returns eX |
if(x,y,z) | If X evaluates to TRUE, the result is the second argument Y. If X evaluates to FALSE, the result evaluates to the third argument Z |
isbool(x) | Returns true if X is a boolean |
isint(x) | Returns true if X is an integer |
isnull(x) | Returns true if X is null |
isstr(x) | Returns true if X is a string |
len(x) | Returns the character length of X |
log(x,y) | Takes the log of the X using the base of Y |
match(x,y) | Returns if X matches the regex pattern Y. |
max(x, y, …) | Returns maximum |
min(x, y, …) | Returns minimum |
md5(x) | Returns the MD5 hash of a string value X. |
mvcount(x) | Returns the number of values of X |
now() | Returns the current time, represented in Unix time |
null() | Returns null |
random() | Returns a random number from 0 to 2147483647 |
replace(x,y,z) | Returns a string formed by substituting string Z for every occurrence of regex string Y in string X |
round(x,y) | Returns X rounded to the amount of decimal places specified by Y. The default is to round to an integer |
split(x,”y”) | Returns X as a multi-valued field, split by delimiter Y |
time() | Returns the wall-clock time with microsecond resolution |
sqrt(x) | Returns the square root of X |
tonumber(x,y) | Converts input string X to a number, where Y (optional, defaults to 10) defines the base of the number to convert to |
tostring(x,y) | Returns a field value of X as a string. If the value of X is a number, it reformats it as a string. If X is a Boolean value,, reformats to “True” or “False”. If X is a number, the second argument Y is optional and can either be “hex”, “commas”, or “duration” |
typeof(x) | Returns a string representation of the field type |
urldecode(x) | Returns the URL X decoded |
Stats Function | Description |
avg(x) | Returns the average of the values in X |
count(x) | Returns the number of occurrences of the field X |
dc(x) | Returns the count of distinct values in X |
earliest(x) | Returns the earliest seen value of X |
latest(x) | Returns the latest seen value of X |
max(x) | Returns the max value within field X. If the values of X are non-numeric, the max is found from alphabetical ordering |
min(x) | Returns the min value within field X. If the values of X are non-numeric, the min is found from alphabetical ordering |
median(x) | Returns the middle most value of field X |
mode(x) | Returns the most frequent value of field X |
perc<x>(y) | Returns the X-th percentile value of the field Y |
range(x) | Returns the difference between the max and min values of the field X |
stdev(x) | Returns the sample standard deviation of the field X |
sum(x) | Returns the sum of the values of the field X |
sumsq(x) | Returns the sum of the squares of the values of the field X |
values(x) | Returns the list of all distinct values of the field X as a multi-value entry. The order of the values is alphabetical |
var(x) | Returns the sample variance of the field X |
Sean Malloy is working as an Automation Engineer at Crest Data Systems. Sean has worked on multiple automation and 508 Compliance projects for Splunk. Before joining Crest, Sean worked as an intern twice at SAP and has led multiple projects as part of his internship for Machine Learning and web development. Sean holds a Bachelor’s degree from UC Davis.