r语言正则表达式

Regular expressions in R or a regex are a sequence of special characters that are defined to match a particular search pattern in the text. Regular expressions can be created for several diverse purposes such as identifying sequences of numbers, formatted addresses, special strings, parts of names and so on.

R或regex中的正则表达式是一系列特殊字符,它们被定义为与文本中的特定搜索模式匹配。 可以出于多种目的创建正则表达式,例如标识数字序列,带格式的地址,特殊字符串,部分名称等。

In Linux based systems, regular expressions have always been computed and searched using the grep command. R programming also supports a function named grep() to accomplish these tasks as we will see in the following sections.

在基于Linux的系统中,总是使用grep命令来计算和搜索正则表达式。 R编程还支持一个名为grep()的函数来完成这些任务,我们将在以下各节中看到。

R中的正则表达式的组成部分 (Components of Regular Expressions in R)

A regular expression is comprised of some special characters and symbols that add to the meaning of the search pattern we are looking for. While there are symbols that match any kind of search string, it will be helpful to learn some of the commonly appearing symbols and characters. These are listed below.

正则表达式由一些特殊字符和符号组成,这些字符和符号增加了我们正在寻找的搜索模式的含义。 尽管存在与任何搜索字符串匹配的符号,但了解一些经常出现的符号和字符会有所帮助。 这些在下面列出。

  • Dot (.) – matches any character except for a new line.

    点(。) –匹配除换行符以外的任何字符。
  • Pipe (|) – Used to specify an alternate or condition on the expression.

    管道(|) –用于在表达式上指定替代或条件。
  • Square braces [] – Any characters listed within the square braces are to be matched.

    方括号[] –方括号中列出的所有字符均应匹配。
  • Hyphen (-) – Used to specify character range as in [a-m] or [A-Z].

    连字符(-) –用于指定字符范围,如[am][AZ]
  • Cap (^) – used to specify characters to exclude, as in [^0-9] means none of the digits should be matched.

    大写(^) –用于指定要排除的字符,例如[^0-9]表示不应该匹配任何数字。

Other than these, we also need symbols known as anchors to construct regular expressions. Anchors are the characters to match the beginning or end of a word or a string. They are:

除此之外,我们还需要称为锚的符号来构造正则表达式。 锚是与单词或字符串的开头或结尾匹配的字符。 他们是:

  • Cap (^) – Matches the beginning of a string.

    Cap(^) –匹配字符串的开头。
  • Dollar ($) – Matches the end of the string.

    Dollar($) –匹配字符串的结尾。
  • \\< – Matches the beginning of a word

    \\ < –匹配单词的开头
  • \\> – Matched the end of a word.

    \\> –匹配单词的结尾。

A lot of times, you will also be required to specify the number of occurrences you need to match. For example, a telephone number might be a string containing 10 digits. This specification of the number of occurrences is done by using symbols named as quantifiers. These are:

很多时候,您还需要指定需要匹配的出现次数。 例如,电话号码可能是包含10位数字的字符串。 通过使用称为量词的符号来完成对出现次数的指定。 这些是:

  • Star (*) – Match the given pattern at least 0 times.

    星号(*) –匹配给定的图案至少0次。
  • Plus (+) – Match the given pattern at least 1 time.

    加号(+) –匹配给定模式至少1次。
  • Question mark (?) – Match the pattern exactly once.

    问号(?) –完全匹配一次模式。
  • {n} – Match the given pattern exactly n times.

    {n} –完全匹配给定的模式n次。
  • {n,} – Match the pattern at least n times.

    {n,} –至少匹配n次图案。
  • {,n} – Match the pattern utmost n times.

    {,n} –最多匹配n次模式。
  • {n,m} – Match the patterns occurring at least n and utmost m times.

    {n,m} –匹配至少出现n次且最多m次的模式。

Characters in search patterns are sometimes grouped into classes for easier reading. Each character class has a representative symbol and can be used to match a large number of characters belonging to that class. Some of these are:

有时将搜索模式中的字符分为几类,以便于阅读。 每个字符类都有一个代表符号,可用于匹配属于该类的大量字符。 其中一些是:

  • \d – represents a digit and \D represents a non-digit.

    \ d –代表数字, \ D代表非数字。
  • \w represents an alpha-numeric character and \W represents non-alpha-numeric characters.

    \ w代表字母数字字符,而\ W代表非字母数字字符。
  • \x for hexadecimal digits.

    \ x代表十六进制数字。
  • \s for space and \S for non-spaces.

    \ s用于空格, \ S用于非空格。

Finally, when we want to match any of these special characters in our regex, it is necessary to escape them. For example, dot already carries meaning in the regex but if you want to actually match a dot in your string, it is necessary to precede the character by a backslash as in –\..

最后,当我们想在正则表达式中匹配这些特殊字符中的任何特殊字符时,有必要对其进行转义。 例如,点已经在正则表达式中带有含义,但是如果您想实际匹配字符串中的点,则必须在字符前面加上反斜杠,例如– \.

R中的正则表达式示例 (Examples of Regular Expressions in R)

As we are now aware of the components of regular expressions in R, we can put them to use in some examples.

现在我们知道R中的正则表达式的组成部分,因此可以在一些示例中使用它们。

  • ^The – Match all sentences beginning with the word The.

    ^ The –匹配所有以单词The开头的句子。
  • boat$ – Match all sentences ending with the word boat.

    boat $ –匹配所有以单词boat结尾的句子。
  • ^The.*boat$ – Match all sentences beginning with The, containing zero or more other characters and ending with boat.

    ^ The。* boat $ –匹配所有以The开头,包含零个或多个其他字符并以boat结尾的句子。
  • ab*c – Matches patterns ac, abc, abbc, abbbc, abbbbbc ..etc

    ab * c –匹配模式ac,abc,abbc,abbbc,abbbbbc ..etc
  • ab+c – Matches patterns abc, abbc, abbbc, abbbbc…etc.

    ab + c –匹配模式abc,abbc,abbbc,abbbbc…等。
  • \d{3}\-\d{4} – Matches all occurrences with three numerical characters, followed by a hyphen and followed by four numerical characters – for example – 456-1223 , or 222-7658 and so on.

    \ d {3} \-\ d {4} –将所有匹配项与三个数字字符,一个连字符和四个数字字符相匹配,例如– 456-1223或222-7658,依此类推。
  • a….. – Matches all six-letter words starting with a lower case a.

    a ... –匹配所有以小写字母a开头的六个字母的单词。
  • [^anc].+ – Matches any string of at least length two that doesn’t start with a, n or c.

    [^ anc] 。+ –匹配任何长度至少为2且不以a,n或c开头的字符串。

Hence, it is possible to construct a regular expression for any pattern string to be matched.

因此,可以为要匹配的任何模式字符串构造一个正则表达式。

R中的grep()函数 (The grep() function in R )

The grep function matches a pattern against a text and returns the positions of the matched pattern. The grep function has multiple signatures that return the search results in different manners.

grep函数将模式与文本匹配,并返回匹配模式的位置。 grep函数具有多个签名,这些签名以不同的方式返回搜索结果。

First, we create a long string vector to search for the required patterns.

首先,我们创建一个长字符串矢量来搜索所需的模式。


> strvec <- c("Beamite", "Gazelow", "Gazairy", "Pantheon", "Chimeton", "Sandite", "Zebrawl", "Barrazel", "Bellibou", "Sandapi" )

This is a list of some randomly generated fantasy character names. Let us start using our grep function on this vector.

这是一些随机生成的幻想角色名称的列表。 让我们开始在此向量上使用grep函数。


#Get the indexes of all the names in the list starting with B
> grep("B.*",strvec)
[1] 1 8 9

Instead, if we wish to get the names instead of indices, we just add the value=TRUE argument to the grep function.

相反,如果我们希望获取名称而不是索引,只需将value=TRUE参数添加到grep函数。


> grep("B.*",strvec, value=TRUE)
[1] "Beamite"  "Barrazel" "Bellibou"

Similarly, if you wish to check if a pattern matches against each element of a string vector, you can use the grepl function instead of grep.

同样,如果您希望检查模式是否与字符串向量的每个元素匹配,则可以使用grepl函数而不是grep。

Suppose that we wish to know which one of the names is exactly 7 characters long, we write the following regular expression and feed it as a pattern to the grepl function below.

假设我们希望知道哪个名称的长度恰好是7个字符,我们编写以下正则表达式,并将其作为模式提供给下面的grepl函数。


> grepl("^(.){7}$", strvec )
 [1]  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE

处理R中正则表达式的其他函数 (Other Functions to Handle Regular Expressions in R)

If you like to actually locate the pattern within the string, regexp() function is used. This function also returns the length of the first occurrence of the pattern matched.

如果您想在字符串中实际定位模式,则使用regexp()函数。 此函数还返回匹配的模式首次出现的长度。


> regexpr("ite$",strvec)
 [1]  5 -1 -1 -1 -1  5 -1 -1 -1 -1
attr(,"match.length")
 [1]  3 -1 -1 -1 -1  3 -1 -1 -1 -1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

We are searching for all the strings ending in “ite” here. The first and sixth elements of our string vector end with it. Therefore, the position is depicted as 5 for each of these.

我们在这里搜索所有以“ ite”结尾的字符串。 字符串向量的第一个和第六个元素以它结尾。 因此,对于每个位置,位置被描绘为5。

If the pattern doesn’t occur, the function returns -1. The next part of the function result also returns the length of the string matched which is 3 in the present case.

如果没有出现该模式,则该函数返回-1。 函数结果的下一部分还返回匹配的字符串的长度,在当前情况下为3。

Instead of position, if you wish to retrieve the subpattern matched, the function to be used is regmatches().

如果要检索匹配的子模式,请使用regmatches()而不是位置。


> regmatches(strvec,regexpr("S.*", strvec) )
[1] "Sandite" "Sandapi"

This function obtains the positions to be matched using the regexp() function and gets the string that matches the expression.

此函数使用regexp()函数获取要匹配的位置,并获取与表达式匹配的字符串。

Finally, if you wish to match a string and also replace it with a new pattern, sub() is the function to go for. Suppose that we wish to change all B’s to D’s in our string vector. We can do this in the following manner.

最后,如果您希望匹配字符串并用新的模式替换它,则sub()是要使用的函数。 假设我们希望将字符串向量中的所有B更改为D。 我们可以通过以下方式做到这一点。


> sub("[Bb]","D",strvec)
 [1] "Deamite"  "Gazelow"  "Gazairy"  "Pantheon" "Chimeton" "Sandite"  "ZeDrawl" 
 [8] "Darrazel" "Dellibou" "Sandapi"

Observe that the sub-function only replaces the first occurrence of the pattern. The string that is previously “Bellibou” has now become “Dellibou”. Instead, to replace all the occurrences of the letter, we use the gsub() function, g meaning global substitution.

请注意,子功能仅替换了第一次出现的图案。 以前是“ Bellibou”的字符串现在变成了“ Dellibou”。 相反,要替换所有出现的字母,我们使用gsub( )函数,g表示全局替换。


> gsub("[Bb]","D",strvec)
 [1] "Deamite"  "Gazelow"  "Gazairy"  "Pantheon" "Chimeton" "Sandite"  "ZeDrawl" 
 [8] "Darrazel" "DelliDou" "Sandapi"

结论 (Conclusion)

Regular expression support is an important feature of any programming language. The operations offered for regular expressions in R greatly ease the data preprocessing tasks. String handling is not often easy in such scenarios. The grep function combined with any other powerful string processing library such as strings helps the programmers a lot.

正则表达式支持是任何编程语言的重要功能。 R中为正则表达式提供的操作大大简化了数据预处理任务。 在这种情况下,字符串处理通常并不容易。 grep函数与任何其他强大的字符串处理库(例如字符串)相结合,对程序员有很大帮助。

翻译自: https://www.journaldev/36776/regular-expressions-in-r

r语言正则表达式

更多推荐

r语言正则表达式_R中的正则表达式