Pandas 텍스트 처리

Pandas 텍스트 처리 작업 예제

이 장에서는 기본적인 Series / Index에서 문자열 작업을 논의합니다. 이후의 장에서는 DataFrame에 이 문자열 함수를 적용하는 방법을 배울 것입니다.

Pandas는 문자열 데이터를 쉽게 처리할 수 있는 문자열 함수 집합을 제공합니다. 중요한 것은, 이 함수들은 누락된 (또는 제외된)/ NaN 값.

几乎所有这些方法都可用于Python字符串函数（请参阅： https://docs.python.org/3/library/stdtypes.html#string-methods)。因此，将Series对象转换为String对象，然后执行该操作。

我们看看每个操作如何执行。

方法	说明
lower()	将系列/索引中的字符串转换为小写。
upper()	将系列/索引中的字符串转换为大写。
len()	计算字符串length()。
strip()	帮助从两侧从系列/索引中的每个字符串中去除空格（包括换行符）。
split(' ')	用给定的模式分割每个字符串。
cat(sep=' ')/td>	用给定的分隔符连接系列/索引元素。
get_dummies()	返回具有一键编码值的DataFrame。
contains(pattern)	如果子字符串包含在元素中，则为每个元素返回一个布尔值True，否则返回False。
replace(a,b)	a值替换成b。
repeat(value)	以指定的次数重复每个元素。
count(pattern)	返回每个元素中模式出现的次数。
startswith(pattern)	如果系列/索引中的元素以模式开头，则返回true。
endswith(pattern)	如果系列/索引中的元素以模式结尾，则返回true。
find(pattern)	返回模式首次出现的第一个位置。
findall(pattern)	返回所有出现的模式的列表。
swapcase	大小写互换
islower()<	检查“系列/索引中每个字符串的所有字符是否都小写。返回布尔值
isupper()	检查“系列/索引中每个字符串的所有字符是否都大写。返回布尔值。
isnumeric()	检查“系列/索引中每个字符串的所有字符是否都是数字。返回布尔值。

我们来创建一个Series，看看以上所有功能如何工作。

예제

　import　pandas　as　pd
　import numpy as np
　s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])
　print s

실행 결과：

　0 Tom
　1　William Rick
　2　John
　3　Alber@t
　4　NaN
　5　1234
　6　Steve Smith
　dtype:　object

lower()

예제

　import　pandas　as　pd
　import numpy as np
　s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])
　print s.str.lower()

실행 결과：

　0 tom
　1　william rick
　2　john
　3　alber@t
　4　NaN
　5　1234
　6　steve smith
　dtype:　object

upper()

예제

　import　pandas　as　pd
　import numpy as np
　s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])
　print s.str.upper()

실행 결과：

　0 TOM
　1　WILLIAM RICK
　2　JOHN
　3　ALBER@T
　4　NaN
　5　1234
　6　STEVE SMITH
　dtype:　object

len()

예제

　import　pandas　as　pd
　import numpy as np
　s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])
　print s.str.len()

실행 결과：

　0　3.0
　1　12.0
　2　4.0
　3　7.0
　4　NaN
　5　4.0
　6　10.0
　dtype: float64

strip()

예제

　import　pandas　as　pd
　import numpy as np
　s　=　pd.Series(['Tom　',　'　William　Rick',　'John',　'Alber@t'])
　print s
　print ("After Stripping:")
　print s.str.strip()

실행 결과：

　0 Tom
　1　William Rick
　2　John
　3　Alber@t
　dtype:　object
　After Stripping:
　0 Tom
　1　William Rick
　2　John
　3　Alber@t
　dtype:　object

split(pattern)

예제

　import　pandas　as　pd
　import numpy as np
　s　=　pd.Series(['Tom　',　'　William　Rick',　'John',　'Alber@t'])
　print s
　print ("Split Pattern:")
　print s.str.split('　')

실행 결과：

　0 Tom
　1　William Rick
　2　John
　3　Alber@t
　dtype:　object
　Split Pattern:
　0 [Tom, , , , , , , , , , ]
　1　[, , , , , William, Rick]
　2　[John]
　3　[Alber@t]
　dtype:　object

cat(sep=pattern)

예제

　import　pandas　as　pd
　import numpy as np
　s　=　pd.Series(['Tom　',　'　William　Rick',　'John',　'Alber@t'])
　print s.str.cat(sep='_')

실행 결과：

　　　Tom _ William Rick_John_Alber@t

get_dummies()

예제

　import　pandas　as　pd
　import numpy as np
　s　=　pd.Series(['Tom　',　'　William　Rick',　'John',　'Alber@t'])
　print s.str.get_dummies()

실행 결과：

　　　William Rick     Alber@t     John     Tom
0         0         0         0         0         0         0　　　　　1
1　　　　　　　　　　　　　1　　　　　　　　　0         0         0
2　　　　　　　　　　　　　0         0　　　　　　1　　　　　0
3　　　　　　　　　　　　　0　　　　　　　　　1　　　　　　0         0

contains ()

예제

　import　pandas　as　pd
　s　=　pd.Series(['Tom　',　'　William　Rick',　'John',　'Alber@t'])
　print s.str.contains('　')

실행 결과：

　0    True
　1　　True
　2　　False
　3　　False
　dtype:　bool

replace(a,b)

예제

　import　pandas　as　pd
　s　=　pd.Series(['Tom　',　'　William　Rick',　'John',　'Alber@t'])
　print s
　print ("After replacing @ with $:")
　print s.str.replace('@',')
　)

실행 결과：

　0 Tom
　1　William Rick
　2　John
　3　Alber@t
　dtype:　object
　After replacing @ with $:
　0 Tom
　1　William Rick
　2　John
　3　Alber$t
　dtype:　object

repeat(value)

예제

　import　pandas　as　pd
　s　=　pd.Series(['Tom　',　'　William　Rick',　'John',　'Alber@t'])
　print s.str.repeat(2)

실행 결과：

0     Tom             Tom
1　　　William Rick     William Rick
2　　　　　　　　　　　　　　　　　　JohnJohn
3　　　　　　　　　　　　　　　　　　Alber@tAlber@t
dtype:　object

count(pattern)

예제

　import　pandas　as　pd
　　
　s　=　pd.Series(['Tom　',　'　William　Rick',　'John',　'Alber@t'])
　print ("每个字符串中的“　m”数:")
　print s.str.count('m')

실행 결과：

　每个字符串中的“　m”数:
　0　1
　1　1
　2　0
　3　0

startswith(pattern)

예제

　import　pandas　as　pd
　s　=　pd.Series(['Tom　',　'　William　Rick',　'John',　'Alber@t'])
　print ("Strings that start with 'T':")
　print s.str.  startswith ('T')

실행 결과：

　0    True
　1　　False
　2　　False
　3　　False
　dtype:　bool

endswith(pattern)

예제

　import　pandas　as　pd
　s　=　pd.Series(['Tom　',　'　William　Rick',　'John',　'Alber@t'])
　print ("Strings that end with 't':")
　print s.str.endswith('t')

실행 결과：

　t:로 끝나는 문자열
　0　　False
　1　　False
　2　　False
　3　　True
　dtype:　bool

find(pattern)

예제

　import　pandas　as　pd
　s　=　pd.Series(['Tom　',　'　William　Rick',　'John',　'Alber@t'])
　print　s.str.find('e')

실행 결과：

　0　-1
　1　-1
　2　-1
　3　3
　dtype:　int64

” -1”는 요소에 매치되지 않은 것을 나타냅니다.

findall(pattern)

예제

　import　pandas　as　pd
　s　=　pd.Series(['Tom　',　'　William　Rick',　'John',　'Alber@t'])
　print　s.str.findall('e')

실행 결과：

　0　[]
　1　[]
　2　[]
　3　[e]
　dtype:　object

빈 리스트([])는 요소에 매치되지 않은 것을 나타냅니다.

swapcase()

예제

　import　pandas　as　pd
　s　=　pd.Series(['Tom',　'William　Rick',　'John',　'Alber@t'])
　print　s.str.swapcase()

실행 결과：

　0　tOM
　1　wILLIAM　rICK
　2　jOHN
　3　aLBER@T
　dtype:　object

islower()

예제

　import　pandas　as　pd
　s　=　pd.Series(['Tom',　'William　Rick',　'John',　'Alber@t'])
　print　s.str.islower()

실행 결과：

　0　　False
　1　　False
　2　　False
　3　　False
　dtype:　bool

isupper()

예제

　import　pandas　as　pd
　s　=　pd.Series(['Tom',　'William　Rick',　'John',　'Alber@t'])
　print　s.str.isupper()

실행 결과：

　0　　False
　1　　False
　2　　False
　3　　False
　dtype:　bool

isnumeric()

예제

　import　pandas　as　pd
　s　=　pd.Series(['Tom',　'William　Rick',　'John',　'Alber@t'])
　print　s.str.isnumeric()

실행 결과：

　0　　False
　1　　False
　2　　False
　3　　False
　dtype:　bool

Pandas SQL 작업 Pandas 정렬

Pandas 교육

Pandas 텍스트 처리

lower()

upper()

len()

strip()

split(pattern)

cat(sep=pattern)

get_dummies()

contains ()

replace(a,b)

repeat(value)

count(pattern)

startswith(pattern)

endswith(pattern)

find(pattern)

findall(pattern)

swapcase()

islower()

isupper()

isnumeric()