R文本处理 · 柳溪的个人博客

1 字符串统计和字符翻译

1.1 nchar和length
nchar向量元素的字符个数，length向量长度（向量元素的个数）

1 2	x <- c("Hellow", "World", "!"); nchar(x) /6,5,1

1 2	nchar(""); /0 length(""); /1

1.2 tolower,toupper和chartr

DNA <- "AtGCtttACC" ;
tolower(DNA); /'atgctttacc'
toupper(DNA); /'ATGCTTTACC'	
chartr("Tt","Uu",DNA); /'AuGCuuuACC'	
chartr("Tt", "UU", DNA); /'AUGCUUUACC'

2 字符串连接

2.1 paste函数
把向量连接成字串向量，较短的向量被循环使用。其他类型的数据会转换成向量

paste("CK", 1:6, sep = ""); //"ck1" "ck2" "ck3" "ck4" "ck5" "ck6"
x <- list(a = "aaa", b = "bbb", c = "ccc");
y <- list(d = 1, e = 2);
paste(x,y,sep = "-"); /"aaa-1" "bbb-2" "ccc-1"
paste(x,y,collapse = ";") /"aaa 1;bbb 2;ccc 1"

3 字符串拆分

3.1 strsplit函数

strsplit函数使用正则表达式，使用格式为:
strsplit(x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE)
参数x为字串向量，每个元素都将单独进行拆分。
参数split为拆分位置的字串向量，默认为正则表达式匹配（fixed=FALSE）。如果你没接触过正则表达式，设置fixed=TRUE，表示使用普通文本匹配或正则表达式的精确匹配。普通文本的运算速度快。
perl=TRUE/FALSE的设置和perl语言版本有关，如果正则表达式很长，正确设置表达式并且使用perl=TRUE可以提高运算速度。
参数useBytes设置是否逐个字节进行匹配，默认为FALSE，即按字符而不是字节进行匹配

text <- "Hello Adam!\nHello Ava!"; 
strsplit(text, " ");    /"Hello" "Adam!\nHello" "Ava!"
strsplit(text, "\\s");	/"Hello" "Adam!" "Hello" "Ava!"
strsplit(text, "");	    /"H"  "e"  "l"  "l"  "o"  " "  "A"  "d"  "a"  "m"  "!"  "\n" "H"  "e"  "l"  "l"  "o"  " "  "A"  "v"  "a"  "!" 
	class(strsplit(text, " "));  /"list"
	class(text)  /"character"

4 字符串查询

4.1 grep 和 grepl函数
在正则表达式中匹配元字符本身，比如在文本中查找问号“?”，那么就要使用引用符号（或称换码符号），一般是反斜杠 “\”。需要注意的是，在R语言中得用两个反斜杠

files<-list.files("/home/.."); 
files;
grep("\\.idx$",files); 
grepl("\\.idx$", files)

4.2 regexpr , gregexpr和regexec

text <- c("Hellow, Adam!", "Hi, Adam!", "How are you, Adam.");
regexpr("Adam", text);
gregexpr("Adam",text);
regexec("Adam",text)

5 字符串替换

5.1 sub和gsub函数
虽然sub和gsub是用于字符串替换的函数，但严格地说R语言没有字符串替换的函数，因为R语言不管什么操作对参数都是传值不传址

1
2
3

sub(pattern = "Adam", replacement = "world", text); / "Hellow, world!"      "Hi, world!"          "How are you, world."
sub(pattern = "Adam|Hi", replacement = "world", text); /"Hellow, world!"      "world, Adam!"        "How are you, world."
gsub(pattern = "Adam|Hi", replacement = "world", text) /"Hellow, world!"      "world, world!"       "How are you, world."

sub和gsub函数可以使用提取表达式（转义字符+数字）让部分变成全部。圆括号代表组，数字代表对应的组号

1	sub(pattern = ".(Hellow).(Adam).*", replacement = "\\1\\2", text) / "HellowAdam" "Hi, Adam!" "How are you, Adam."

6 字符串提取

6.1 substr 和 substring 函数
substr(x, start, stop)
substr返回的字串个数等于第一个参数的长度
而substring返回字串个数等于三个参数中最长向量长度，短向量循环使用

1
2
3

x <- "123456789";
substr(x, c(2, 3), c(4, 5, 8));  /"234"
substring(x, c(2, 3), c(4, 5, 8))  /"234" "345" "2345678"

1 2	x <- c("123456789", "abcdefghijklmnopq"); substr(x, c(2, 3), c(4, 5, 8)) /"234" "cde"

1 2	DNA <- "GCAGCGCATATG"; substring(DNA, seq(1, 10, by = 3), seq(3, 12, by = 3)) /"GCA" "GCG" "CAT" "ATG"

7 其他

7.1 strtrim 函数

1	strtrim(c("abcdef", "abcdef", "abcdef"), c(1, 5, 10)) /"a" "abcde" "abcdef"

7.2 strwrap 函数
该函数把一个字符串当成一个段落的文字（不管字符串中是否有换行符），按照段落的格式（缩进和长度）和断字方式进行分行，每一行是结果中的一个字符串。例如：

str1 <- "Each character string in the input is first split into paragraphs\n(or lines containing whitespace only). The paragraphs are then\nformatted by breaking lines at word boundaries. The target\ncolumns for wrapping lines and the indentation of the first and\nall subsequent lines of a paragraph can be controlled\nindependently.";
str2 <- rep(str1, 2);
str2;
strwrap(str2, width = 80, indent = 2)

7.3 match 和 charmatch

match("xx", c("abc", "xx", "xxx", "xx")) /2
charmatch("xx", "xx"); /1
charmatch("xx", "xxa"); /1 
charmatch("xx", "axx")  /NA