--- id: parsing-language-reference-guide title: Parsing Language Reference Guide sidebar_label: Parsing Language Reference description: Parsing is the first step in the Cloud SIEM record processing pipeline. slug: /help/docs/cse/schema/parsing-language-reference-guide/ canonical: https://www.sumologic.com/help/docs/cse/schema/parsing-language-reference-guide/ --- This topic describes the Cloud SIEM parsing language, which you can use to write custom parsers. ## What is parsing? Parsing is the first step in the Cloud SIEM [record processing pipeline](/docs/cse/schema/record-processing-pipeline) — it is the process of creating a set of key-value pairs that reflect all of the information in an incoming raw message. We refer to the result of the parsing process as a *field dictionary*. The raw message is retained.  Parsers are written in a specialized Sumo Logic Parsing Language. The parser code resides in a parser configuration object. At runtime, parser code is executed by the Sumo Logic parsing engine. ## Key concepts This section explains a number of concepts that are fundamental to the parsing process. ## Regular expressions A regular expression, often referred to as a regex, is a sequence of characters that define a search pattern. A regular expression engine compares strings to regular expressions to find matches. Regexes can also be used to extract substrings and bind them to a name, known as a group in a dictionary. Many Cloud SIEM parsers rely upon regex exclusively to parse messages. (Sumo Logic Field Extraction Rules also use regex: they parse selected fields from log messages at the time of ingestion.)  Sumo Logic's parsing engine performs top-level, gross format parsing first using compiled built-in formats, and then relies on regular expressions to extract information from irregular or complex formats. The parser engine uses the [RE2 regular expression library](https://github.com/google/re2/wiki/Syntax). This is important to know because regex syntax varies between implementations. RE2 is a slightly modified version of the standard regular expression libraries that is designed to operate with bounded execution time. :::note For historic reasons, the named groups in the regex of many parsers still uses Python-style notation, for instance `(?P[^ ]+ +[^ ]+ [^ ]+)`. When you write new regular expressions, you can omit P. ::: You can find a regex debugger at [https://www.debuggex.com](https://www.debuggex.com). :::note This debugger uses the GoLang RE2 library, but all RE2 libraries are based on the same codebase and it is a sufficient test mechanism. ::: ## Normalizing Mapping the initial field/value dictionary into a single schema - that is, one fixed set of field names and value formats. In general, our parsers are not intended to normalize log messages when parsing. Instead, the intent is to - as much as possible - preserve the original naming and structure of the log messages. ## Patterns Patterns are predefined named regular expressions similar to [*Grok*](https://logz.io/blog/logstash-grok/); using them simplifies and speeds the development of regex-based parsers.  Patterns are stored in `patterns.conf` as ` = ` key value pairs, for example: `IPV4 = \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}` In parsers, you refer to a pattern as `%{}`. You can use a pattern anywhere that regex can be used.  You can assign patterns to a named capture group like this: `%{:}` For available patterns, see [Parsing Patterns](/docs/cse/schema/parsing-patterns). ## Mustache templates We use the Mustache template system to define string templates. String templates are used to format one or more values into a single new field value. For more information on Mustache, see [https://en.wikipedia.org/wiki/Mustache_(template_system)](https://en.wikipedia.org/wiki/Mustache_(template_system)). ### Whitespace removal By default, whitespace at the beginning and end of a message is removed before parsing. Whitespace at the beginning and end of a parsed value is also removed. Use the [STRIP_WHITESPACE](#strip_whitespace) attribute to enable or disable whitespace removal. ### Implicit anchoring Regular expressions are **always** anchored to the front of the string. Keep this in mind when constructing regexes. If an expression doesn’t target the beginning of the string, or the anchoring isn’t compensated for, the expression will fail. Applying a caret at the beginning of the expression is accepted, but essentially ignored. If you add an end anchor to a regex, the regex will be flagged as illegal. ### Initial parsing based on FORMAT attribute Each stanza can define a FORMAT attribute for the message or string it is parsing. This results in a gross parse, populating the field dictionary in an appropriate way. ### REGEX parsing This is the default value of the [FORMAT](#format) attribute. With this setting, the message is parsed using the regex defined by the [REGEX](#regex) attribute. A stanza that contains FORMAT = REGEX, must also contain a REGEX attribute, otherwise, it will perform no parsing.  Capture groups names in the regex you define with the [REGEX](#regex) attribute can contain any character except close brackets. This includes spaces, however the use of spaces in capture group names are not recommended unless there is a very good reason for using them. For example: `REGEX = (?P.*)` `REGEX = (?.*)` Although Java supports backtracking and possessive sequences as well, their use is discouraged in parsers, as they are extremely inefficient. `REGEX = (?P.++)` ### JSON parsing JSON is parsed and flattened. Fields of sub-objects are prepended with the containing field name and separated with periods. For example, | This JSON | Results in  | |:-----------------------------------------------|:---------------------------------------------------------------| | `{“foo”: {“bar”: 2, “barrier”: 3}, “baz”: 4}` | `foo.bar = 2 foo.barrier = 3 baz = 4` | List items have a one-based index number inserted between the containing field name and the sub-object field names. For example, | This JSON | Results in  | |:--------------------------------------------------------|:-----------------------------------------------------------------------------------------------| | `{“foo”: [{“bar”: 1, “baz”: 2}, {“bar”: 3, “baz”: 4}]` | `foo.1.bar = 1 foo.1.baz = 2 foo.2.bar = 3 foo.2.baz = 4` | By default, an index number is inserted, even in a single element list. For example,  | This JSON | Results in  | |:---------------------------------|:-----------------------| | `{“test”: [{“field”:”value”}]}` | `test.1.field: value` | However, if you set the JSON_FLATTEN_SINGLE_LISTS flag to true, an index value *is not* inserted in the single element list. This is useful for collapsing redundant JSON elements from sources like AWS. When JSON_FLATTEN_SINGLE_LISTS is true: | This JSON | Results in  | |:---------------------------------|:----------------------| | `{“test”: [{“field”:”value”}]}` | `test.field: value,` | ### CSV parsing Parses delimiter separated values, commas by default. You can set another delimiter character, using the [FIELD_DELIMS](#field_delims) attribute. ### XML parsing Parses and flattens XML. ### CEF parsing  Parses CEF format messages. In the parsing process, we unpack custom fields in a CEF message. CEF custom fields are held in two fields: one holding a field name and another holding the value. Our CEF parsing creates a new single field whose name and value come from the CEF custom fields, and discards those fields. ### LEEF parsing Parses LEEF format logs. ### WINDOWS_XML parsing Parses Windows XML messages from Cloud SIEM Windows Sensor.  ## Mapping hints After parsing, the next step in the Cloud SIEM record processing pipeline is log mapping, which is the process of mapping fields that were parsed out of messages to Cloud SIEM schema attributes.  Every parser must provide *mapping hints* that provide information Cloud SIEM can use to select the correct log mapper for parsed messages. You do this with the MAPPER attribute. For more information, see [MAPPER](#mapper). ### Internal temporary variables supported in parsers #### _$log_entry At the start of parser execution, `_$log_entry` contains the value of the entire message being parsed. Within a transform stanza, `_$log_entry` represents the value being processed by a transform. When you are applying a transform to a field, you can use `_$log_entry` to refer to the value of the current parsed field. #### _$log_entry_field The field that the parser is transforming. The value of `_$log_entry_field` is updated each time a transform is applied to a field because temporary fields aren’t stripped from field dictionaries until after all parsing is complete, causing the `_$log_entry_field` to be overwritten by that transform’s `_$log_entry_field`. #### Excluding variables from field dictionary You can declare your own variables in a parser. To ensure that a variable is not included in the field dictionary that results from the parsing process, prefix the variable name with `_$`, for example: `_$my_variable` ## Parsing fields Messages are parsed to create a dictionary of field values, a start time, and an end time. When choosing a field name, avoid using non-alphanumeric characters unless that goes against the conventional practice or a well-known name. For instance, in PAN-firewall parser there is a field named `X-Forwarded-For`. That name was selected after the well-known protocol header. Any other name would not be as easily recognized. But, whenever possible, it’s preferable to stick with alphanumeric names so that they won’t need quoting when they are used in Sumo Logic Platform features, such as Sumo Logic core platform log and metric queries, action templates, and dashboards. Field names beginning with `_$` (underscore followed by the dollar sign) aren’t saved in the field dictionary, but can be used to pass values from one part of the parsing process to another (from a parser to a transform, for instance). :::note The key principal: When selecting a name for the field, stay as close to the name well known in the industry for the corresponding source. ::: ### Timestamps and time handling The `_starttime` and `_endtime` fields are normally assigned values using [START_TIME_FIELD](#start_time_field) and [END_TIME_FIELD](#end_time_field).  Note that if none of [DEFAULT_START_TIME](#default_start_time),  [DEFAULT_END_TIME](#default_end_time), START_TIME_FIELD or END_TIME_FIELD are defined `_starttime` and `_endtime` will not be included in the field dictionary. If `_starttime` is defined (at minimum, `START_TIME_FIELD` has been specified in the parser), it will be used as the record timestamp. If `_starttime` is not defined, the timestamp should be set by the Cloud SIEM log mapper that processes the record, typically by mapping a parsed field to the `timestamp` schema attribute. ### Representation of “no value” The representation of no value or a field that doesn’t exist is ‘None’ for evaluating variable transforms; JSON uses “null” if [JSON_DROP_NULLS](#json_drop_nulls) is set to false or not present, and drops them if so. ## Stanzas Parser definitions are organized into stanzas. A stanza consists of a type declaration, consisting of a keyword and a name, followed by a series of attributes that function much like commands in a scripting language, except that each command is uniquely keyed.  There are three types of stanzas: * **parser**—Defines the entry point for the overall parser and contains attributes that control the overall execution of the parser. A parser contains one and only one parser stanza. The syntax for declaring a parser stanza is: `[parser]` `parser` is the only stanza keyword that can only appear once in a parser definition.   * **transform**—A transform stanza is analogous to a function in most scripting languages. Transforms can be invoked on a log message as a whole with all currently parsed fields accessible within the new transform, or on strings that have been parsed from a message without the currently parsed fields. You can use transforms to extract information of interest using regex patterns, assign values to variables, drop fields, rename fields, populate time fields, create mapping hints, and more.  One transform can even call another. You can use transforms to perform a wide variety of parse actions; the most common use is extracting a value from log message. The syntax for declaring a transform stanza is: `[transform:]` * **dependencies**—You can use a dependencies stanza to include resources from another parser, using the [INCLUDE](#include) attribute. The syntax for declaring a dependencies stanza is: `[dependencies]` Stanza types must be lower case. It is recommended but not required that transform names be lower case.  For example, `[transform:]` References to transform names in attributes are case sensitive. The case in the reference must match the case used in the transform name. Transform names are limited to alphanumeric characters, the dash (-) and the underscore (_). ### Specifying attributes * All attribute names must be uppercase. * Attribute names are limited to alphanumeric characters, the dash (-) and the underscore (_).  * All attributes that take assignments must use an equal sign (’=’) between name and assignment. For example, `FORMAT = REGEx` ### Attribute overriding Attributes with the same key override each other. For example, given: `TRANSFORM = Cylance_Parse TRANSFORM = Cylance_Factor` We apply only the second [TRANSFORM](#transform) attribute. You can add labels to duplicate transforms to avoid overriding. A label is text appended to the attribute, separated by a dash. A label can be any string that doesn’t contain an equals sign. For example, `TRANSFORM-parse = Cylance_Parse TRANSFORM-factor = Cylance_Factor` With the labels added, we’ll apply both TRANSFORM attributes. ### “r\|” Syntax With certain attributes, you can apply `r|` syntax, in place of an explicit field name. The attribute is applied to all fields of the field dictionary with a name that matches the regex following the `r|`. For example: `DROP:r|^\d+$` Would remove all the fields whose names are numbers from the field dictionary . You can apply `r|` syntax to these attributes: * TRANSFORM * FIELD_TYPE * JOIN_LIST * DROP ### Field binding Attempts to access the value of a field created by parsing must follow the parsing. Attempts to access the value of a field that has not been set will produce an error. ### Includes You can include resources from another parser using a \[dependencies\] stanza. In that stanza only, you can add `INCLUDE:/Parsers/path/to/parser = true` The specified resource’s transforms will be available to the current parser.  ## Attributes used in all stanza types ### ADD_VALUES If true, when parsing produces a value for the same field more than once, append the second and subsequent values to the field. If false, replace the value of the field. This only is applied from the `[parser]` stanza or a transform on a field, but is applied to any other transforms the field dictionary is passed to. For example, if parsing produces two values for `fielda`, “monkey” and “business”, the value of `fielda` will be set to \["monkey", "business"\] **Syntax** `ADD_VALUES = ` **Default** false **Example** `ADD_VALUES = true ` ### ALIAS Creates a read-only reference between `alias_field_name` and `old_field_name.` **Syntax** `ALIAS: = ` **Default** None **Notes** * `` and `` are required. * If the value of `` is None, the alias will not be created. ### CASE If `` from the [CASE_SWITCH](#case_switch) attribute equals `matched_value`, sets field to `set_value`.  **Syntax** `CASE: = ` **Example** Assume an incoming message that contains a ‘severity’ field that stores severity as one of three words: high, medium and low. But we want to store a normalized severity value as an integer ranging from 0 to 9. We might use a `CASE_SWITCH` statement paired with a list of `CASE` statements to perform the mapping. ``` CASE_SWITCH:normalized_severity = severity CASE:High = 9 CASE:Medium = 4 CASE:Low = 0 ``` ### CASE_SWITCH Sets `` to the value specified in the CASE statement if `` is set to the value specified there. **Syntax** `CASE_SWITCH: = ` ### CLEAR Clears the field value of the fields whose values match the specified regex. **Syntax** `CLEAR: = ` `` and `` are required. ### COPY_FIELD Copies the value of one field to another field. **Syntax** `COPY_FIELD: = ` `` and `` are required **Default** none ### DEFAULT_END_TIME Value is used in various ways depending on the [END_TIME_HANDLING](#end_time_handling) attribute. **Syntax** `DEFAULT_END_TIME =