Advanced Search and Replace Syntax

Advanced Search, without Replace, is available from the List of Found search (key F2). Search and Replace is on key Shift-F4.

For Advanced Search (once the Advanced option has been selected), parts of a search string are interspersed with symbols representing various types of wildcard. All the parts of the search string must be in double inverted commas ("). Anything else, not in double inverted commas, is interpreted as an element to be matched. Of course there is a way (escaping) of including a double inverted comma (or any other special elemental character) inside the parts of the search string.

Search syntax is different from Replace syntax, but both share much of the same syntax - so we will use coloured backgrounds to indicate which the section applies to, thus:
Search Both Replace
The concepts in Advanced Search and Replace are not difficult - but it can get very confusing working out how to express what you are searching for!

Page contents

Working out Search and Replace expressions

Probably the easiest way of doing this, if the expressions are at all complicated, is to work out the Search and Replace expressions separately in a StrongED page.

Once you have a pair of expressions worked out you can combine the clipboard with the Copy key (f7) to transfer both to the dialogue box. You need !IcnClipBrd (which is available from !Store) to do this.

When working out complicated Search expressions it can also be helpful to use the Advanced List of Found F2 to check that your expression catches exactly what you want it to catch.

Slightly more complicated, but may be easier to remember, and doesn't need !IcnClipBrd, is a method using F7 only.

Patterns

If you use some search and replace patterns repeatedly, these can be stored by StrongED and then called by name. A separate page covers Patterns as you probably won't want to use this feature until you have learnt more.

New Lines

StrongED can switch between four different Newline codes:
LF Risc OS, Linux
CR BBC, spectrum
LF+CR
CR+LF Windows

Which ever is in use at the time in the document concerned is matched by NL or by $ or by \n

The newline type to match by these characters is decided during the search so it's perfectly fine to search multiple texts which have different newline types.

For example searching for "foo" $ "bar" will match wherever a line ends with 'foo' and where the next starts with 'bar', regardless of the newline type of the text(s) being searched.

"  " - Strings

Anything inside double quotation marks is a search string and will be interpreted as it is.

If you want to search for a string which contains " then that must be magiced thus \". Similarly, as \ is part of a "magic" character, if you want to search for a string that contains \ you must use \\.
Thus, for example, to replace \\"" by ""\\ and vice versa, you could use

Search "\\\\\"\""
Replace "\"\"\\\\"

Of course that is a case where a simple search would do an easier job.

A few more examples:

"01234567"Matches "0" followed by "1" followed by "2" etc
"string\t"Matches "string" followed by a Tab character
"\"" Matches the " (doublequote) character
"\\" Matches the {\} (backslash) character

There are a number of other "magic" characters which are listed in the section on shorthands
For an example, see the section on Markers.

There are a number of other escaped characters that can be used in Search strings (but not in Replace):

The escape character has many more uses. See Character Sets

@ - Markers

Special symbols (@1 though @8) may be inserted in the search string so that parts of the search string can be inserted into the replace string. Mark @0 is already set to the start of the search string and @9 is set to the end - though both @0 and @9 can be reset if necessary. A similar use of @ markers is in Shortcuts (macro expansions) in Modefiles.

To show how these work, let's take a piece of text...

To be, or not to be, that is the question:

...and apply this Advanced Search string to it...

"To be,"@1" or not to be,"@2" that is"@3" the question:

...with this advanced replace string:

"@01="@01$"@02="@02$"@23="@23$"@09="@09$


@01=To be,
@02=To be, or not to be,
@23= that is
@09=To be, or not to be, that is the question:

@09 tends to be the most used, so there is a shorthand for this - @@.

Notice that the replace strings above include everything, including initial and end spaces, between the @ markers. I also used the $ sign in the Replace expression outside of inverted commas this causes (a) NewLine character(s) to be entered.

You may also have noticed that the Search and, less so, Replace strings can get impossibly long. See the hint on Working out Search and Replace expressions

It may seem that the 2 digit @nm elements are only for use in the replace expression. However they can also be used in the searches - after the single digit @n @m markers have been set: this is explained in Back References.

Search forward - * in line.
Search forward - ** in text.

A single asterisk searches * in the line until the next element of the search expression is found.

A double asterisk searches ** in the text until the next element of the search expression is found.

If you have read the page How a search is performed you can understand that what the * wildcard does is to advance the pointer Em to the next element (in this case e5) then to advance the pointer Tm until either the end of line or the element e5 matches. If that happens, the search is continued. But if an end of line occurs first, the match fails.

** does the same thing but does not abort at the end of line.

cnt - Counter


In the Search/Replace dialogue box is a couNt which can be used in the Advanced mode by putting cnt in the Replace expression. Every time this cnt element is used in the replace process the present cnt value is used and the counter is advanced by one (or whatever increment value you have put into the Step box).
cnt/png
The choices in the dialogue box are:

A real life example is available.

Characters and Strings without Quotes and Sets
'   ' - Character sets
Character shorthands

Anything within single quotes is a character set. This will match if the character in the text is any one of the characters in the set. The character set can contain single characters and ranges. Thus you can define your own sets - there is a simple example of a character set,

However StrongED has a number of pre-defined Character sets which probably cover everything most users will ever need.

Some of these sets can be called by a single Character. Most have Shorthands. Some are Named. You can therefore use any of the first four columns below in the Search expression. These should not be inside '   ' single quotes.

Some of these, coloured thus can also be used in Replace expressions.

Entries coloured thus are character shortcut elements which can be used inside strings in both Search and Replace expressions. See note.

Char Short Name Char Set or
Hex value
Description
? \a Alpha 'a-zA-Z' Upper and lower case letters
\a also includes all accented characters. See Character Sets
Upper 'A-Z' Upper case letters
Lower 'a-z' Lower case letters
Ctrl '\x00-\x1F\x7F'
\b \x08 Backspace
\c '\x00-\x1F\x7F' All Control characters
# \d Digit '0-9' Decimal digits
D \d Digit '0-9' Decimal digits
\e \x1B Escape character
\f \x0c Form feed
\h Hex '0-9a-fA-F' Hexdecimal digits
See Hex Character Set
\i '0-9a-zA-Z' Identifier characters
The set which \i matches depends on the definitions in the ModeFile (ID_FirstChar, ID_Middle and ID_LastChar)
\l \x0A Line feed
\p punct '!'(),./:;\?' Punctuation characters
\r \x0D Carriage Return
\s white '\t\x0A ' White space
See Handling White Space
\t \x09 Tab character
\v \x0B Vertical tab
AD \w AlphaNum '0-9a-zA-Z' AlphaNumeric characters
\x \xHH Character code HH
\x is a special case: the two characters following it are interpreted as a hex number and the corresponding letter/code is used. If the two characters following the \x are not a hex character, StrongED simply reports String not found
\" Quote character in strings
\\ Backslash character in strings
\+ Turns case-sensitive matching on.
\- Turns case-sensitive matching off.
That's \ minus
\= Restores original case-sensitivity as defined in dbox or function.
That's \ equals
CW Matches word at caret
CW will find the caret word even when it is part of another word. See Cursor Word example
$ \n NL Newline
StrongED can be set to use LF, CR, LF CR or CR LF as New Line characters. $ or NL will match whatever is currently in use.
. A full-stop matches any character other than Newline
@ Marks in search string, see section on Markers
@@ Short for @09

The upper case equivalent of a Set shorthand matches all characters that are NOT within that set. For example \D matches all characters that are not a decimal digit, i.e. that are not 0-9.

Names and character shorthands are case insensitive.

Use in a string of \ followed by a character not in the above list will simply match the character that comes after the '\'. Note thet this character will obey the case-sensitivity set in the dbox or function, unlike other character shortcuts.

The . (full stop) metacharacter matches any character except for newline character(s). It can be used to absorb any characters that are irrelevant to the overall match. However Care must be taken when using . with a quantifier, eg { . }, because '.' matches everything but newline any subsequent elements may therefore fail to match. See example using ..

Examples

To remove any white space at the beginning of all lines in a text.

Search <white
Replace

Define your own Character sets

If you wish to define your own sets in a Search expression some examples are

Some example character sets (note \x precedes a hex number) are

'01234567' Matches any of the characters 0 to 7
'32104765' Same as above. Ordering doesn't matter
'0-7' Same again.. Given as a range
'a-zA-Z' Matches any letter, upper or lowercase.
'\-\' Matches the "-", and the "\"
'\x00-\xff' Matches any single hex character
'\x41-\x5A' Matches any single upper case letter
'\x61-\x7A' Matches any single lower case letter
'\x20-\xff' Matches any single hex character not including control characters but including top bit set characters
'\t\x0A ' Matches tab (\t) , linefeed (\x0A) and space

Character sets can be specially useful when combined with braces (curly brackets). A real life example of this repeat matching is available.

There are more shorthands with special meanings

{ } - Repeat Element Matching

Any element placed inside braces (curly brackets) { } is matched repeatedly against the text being searched, within the limits specified for minimum and maximum number of matches. It can be used to match an element without having to know the exact length of what is to be matched. For example count (match) all words regardless of their length. There are special cases and qualifiers can follow the curly bracket thus:

A search for a group of n items will find the first match in a string of more than n items. See the example Exclusive repeat match

For example:

Search "B" {"A"}
Matches "B", "BA", "BAA", "BAAA" and so on.

Whereas

Search "B" {"A"}+
Matches "B", but only if followed by at least one "A"

To count all words you could use Advanced List of Found CouNt facility thus:

Search {?}+

A real life example of set matching is available.

{ } - Repeat Element Matching is very similar to [ ] - Optional element. See Optional or Repeat Matching?

Named Expressions

You can use Named Expressions in searches. You can define a Named Expression in a ModeFile or in the global Patterns file.

If you have an expression named foobar then you can search for that simply by entering foobar in the search field and making sure advanced search is active. Similarly if you had another expression named foobar you can use that in the replace field. Or any other expression you have named as a Replace expression.

There is more on Named Expressions in Search and Replace Patterns and in the ModeFile Syntax page

Named Expressions can also be used in sections of the ModeFile - Functions, KeyList and ClickList

 | (bar) - x OR y

The metacharacter | signifies an OR search. If the current element is true, the next element after the bar is not evaluated but is skipped.

However if the current element is false, then the next element after the | is evaluated. In this is true, the test proceeds, If false the test is aborted.

So the | tells the search not to advance to the next character in the text but to test the present character once more against the next element in the search string.

Thus you can have many successive elements in a search expression separated by |. Some examples:

"a" "b" | "c" "d" matches abd and acd - i.e. an 'a' then 'b' or 'c' followed by 'd'

("a" "b" ) | ("c" "d") matches 'ab' or 'cd'
The second example uses grouping

There is no AND character other than an optional space, nor is one necessary. Because StrongED tests each element in the search expression in sequence, the search is only completed and marked as a find when every element is tested True - so the whole string of elements is in practice ANDed. Element 1 AND Element 2 AND etc. makes a Match.

< - Start of line and > - End of line

The Start of line - <flag is set to true as the search passes a single Return, so at the start of a new line. StrongED can be set to use LF, CR, LF CR or CR LF as New Line characters and < detects whatever is in use at the time.

Similarly the End of line - > is set before the New Line characters

This flag is very useful in, for example, writing html, when it is likely you will start many htlm tags are likely to be at the start of a new line, depending on your writing style.

Another example might be labels in an assembler text.

If you wanted to capture or delete all whole lines that contain 'foobar' you could use:
< \* "foobar" \* >
the elements being Start of line - search forward in line for string "foobar" - search forward in line for End of line marker.

Another example is in Numbering a List

< and < are most likely to be useful at the start or end of a search expression. If used in the middle of an expression, in most cases a match would not be made as you're already into the search expression and hence the current text line. But they can be useful, for example, in conjunction with **. For example:

Search "a" ** (@lt; "b")
This will match "a" and then search forward through the text until it finds a line that starts with "b". Notice the (@lt; "b") - the brackets group the two elements.

As an example of how NOT to use them, I thought it might be useful to add something to the ends of every line in a block:

Search >
Replace"something"
I suggest you do NOT try this with a live text! Try with a blank text with one (or more) lines. Though StrongED behaves very nicely - so you can recover - and does exactly what you have told it!

<< - Start of paragraph and >> - End of paragraph

These are the same type of flags as the start/end of line, but are set by two consecutive Return characters,

An example is in Numbering a List

<<< - Start of text and >>> - End of text

These are similar to the other start and end flags but are set to True at start or end of a whole text.

These can be useful if you want to add something at the starts or ends of several files.

( ) - Grouping

Parentheses can be used to group elements together so that they are treated as one element. Usually this will be used in conjunction with other constructs such as alternation (OR) or negative lookahead.

StrongED will apply alternation or negative lookahead only to the next element in the search expression. To make it look further ahead requires the elements to be placed inside parenthesis.

Examples of Grouping

"a" | "b" "c"matches "a" or "b", followed by "c", so matches "ac" or "bc"
"a" | ("b" "c")matches "a" or "bc"

Another example is under OR.

There is a more involved example of grouping in our examples section.

There is an example of grouping used with a back reference to find html start and end tag pairs.

[ ] - Optional Element

Anything placed inside square brackets is optional, meaning that if it fails to match then the search is not stopped but simply continues with the next element in the search expression.

Example

Search "a" "b" ["xx"] "c" "d"
This matches 'abxxcd' but also 'abcd'

Optional or Repeat Matching?

It may be noticed that [  ] is very similar to {  }. Both will match zero times. But [  ] will only "not match" once. {  } will "not match" as many times in succession as it is encountered. An example should make this clear:
string ["a"]"b" {"a"}"b"
b b b
ab ab ab
aab ab aab
aaab ab aaab

As an example {"a"}0:1 will match whether a is present or not.
["a"] will optionally match a - so will match whether a is present or not.
["a"] is then equivalent to {"a"}0:1

See also ~ - Ignore

~ - Ignore

The tilde ~ is best considered not as an element itself, but rather as a qualifier.

When placed before an element ~ it qualifies that element. This causes text that is matched by the qualified element to be ignored and the Text Match (Tm) pointer stays at the start of the ignored text. The Ta pointer searches ahead for a byte that doesn't match.

When a byte is found that does not match the qualified element, then the element itself is ignored and the Em pointer is moved to the next element, similar to grouped elements.

Since a qualified element is ignored when there is no match, beware of using a qualified element without a following element as, without a following element, the qualified element is re-tested ad infinitum - or until Escape is pressed!

See How a Search is Made to understand the pointers.

There is an example of Negative Lookahead
or ignore vowels

See also [ ] - Optional Element and ( ) - Grouping

Back References

Back references involve using already set markers later in a search string. The markers @0 to @9 can be set and once set the same search expression can then use the string between markers to search for that same string.

It is best not to rely on @0 being automatically set, but to set it implicitly before use. @9 of course won't be set automatically in any event.

There is an example of Back References demonstrating how to find repeat words in a text.

There is an example of back reference used with grouping to find html start and end tag pairs.

&xByte, word and Halfword

The sequence &xx can be used to match a specific byte in the text. The 'xx' part can range from 0 to 255, so it can match any character in an alphabet of 256 characters. Or it can match a byte in a file.

Variants are &xxxx and &xxxxxxxx which match 16-bit and 32-bit entities respectively so are mainly of use in program writing.

As a point of interest, the search is little-endian. Thus reversed from other searches. If you were to searrch for the word "byte" you would have to search for letters in the order "etyb" - so search for &65747962.

\( and \{ Bracketed Text Blocks

These Block Shorthands can be used to find an entire block of text enclosed in braces {} or brackets (). Because braces and brackets have other uses, when used in this way the opening item must be preceded by a backslash.

This can be useful in languages such as PHP, Perl etc., that enclose function bodies in brackets. You could use these shorthands in Advanced LoF, for instance, to list all functions. Or you could search for a function name and then use a block shorthand to skip the function body. But you can also use it to search for any text within brackets.

Brace. \{ Checks that the character at the search position is a { and then searches forward for the matching }. Nesting is taken into account, braces in comments and strings are ignored. The search continues after the matching }.

Bracket. \( Checks that the character at the search position is a ( and then searches forward for the matching ). Any nesting is taken into account, parenthesis in comments and strings are ignored. The search continues after the matching ).

Note that round brackets are used in a search expressions to group things together and { } is used for Repeat Element matching, hence the need to use the escape \ character to opeb the effect.

&xx=yy AND - Mask and Compare

If you understand binary and how bits relate to Hexadecimal bytes, this section could be for you.

Specifying &xx=yy (where xx and yy are hex values) in the search does a bit-wise AND of the current character with 'xx'. This is then compared to 'yy'. If there is a match, the search proceeds. If no match, then the search fails.

As an example

Search &E0=00
would find control characters - although there are better ways to do that.

A more useful example is

Search &80=80
which will find top-bit-set characters.

Replace Syntax

If you have used Advanced Search and Replace you have probably encountered the error message Only strings, NL and @xx allowed in replace expression. This message is shorter than the full explanation.

A replace string consists of a number of elements. Each time a match is found and a replacement is made these elements are evaluated to obtain the actual replacement text. This version of StrongED allows a replace expression of a maximum 256 bytes. However the replacement text can be any length (memory permitting).

The elements in the replace expression can be any of:

NL $ Newline character(s). The actual character(s) inserted are dependent on the line-ending of the text the replacement is applied to.
Cnt Inserts current value of the replace counter. After replace, the counter's value is increased. The counter is set up from the Search&Replace dialogue box.
Name Name of a replace expression as defined in the ModeFile. When this is used it must be the only thing in the replace string.
Range Text between two marks. This allows parts of the matched text to be preserved.
"String" Literal text to insert. None of the characters in a string have special meaning with the exception of the character shorthands mentioned below. The double quotes are *not* inserted.
Character shorthands can also be used, both inside and outside of strings. Any shorthands not listed below will cause the character after the backslash to be inserted.
\b0x08 Backspace
\e0x1B Escape character
\f0x0C Form feed
\l0x0A Line feed
\r0x0D Carriage return
\t0x09 Tab character
\v0x0B Vertical tab
\x0xnn Character code nn
\"0x22 Quote character
\\0x2D Backslash
The string shorthands \n or $ can be used in place of NL but, like NL, only outside of strings.

Hex Character Set

There are three ways of specifying the set of hexadecimal digits
  1. Hex
  2. \h
  3. '0-9a-fA-f'
However the sets involved are the characters which may (or may not) indicate a hexadecimal digit. Thus the letters a to f may, in the right context, be a hexadecimal digit. Otherwise they are simply letters. So to find a hex digit, you need to prefix the set with a relevant prefix, as used in the text you are editing, to indicate the presence of the hex digit. For example:
Search "&" {Hex}+
Other prefixes may be \x, # or 0x etc.

Other relevant pages

Top of page


Page Information

http://css.torrens.org/valid-html401-bluehttp://css.torrens.org/valid-css Document URI: http://stronged.torrens.org/man/search/advanced.html
Page first published: January 2018
Last modified:Mon, 08 Jul 2024 09:18:35 BST
© 2018 - 2024 Richard John Torrens.