Working With Text

text1 = "Ethics are built right into the ideals and objectives of the United Nations "
len(text1) # The length of text1
76
text2 = text1.split(' ') # Return a list of the words in text2, separating by ' '.
len(text2)
14
text2
['Ethics', 'are', 'built', 'right', 'into', 'the', 'ideals', 'and', 'objectives', 'of', 'the', 'United', 'Nations', '']

List comprehension allows us to find specific words:

[w for w in text2 if len(w) > 3] # Words that are greater than 3 letters long in text2
['Ethics', 'built', 'right', 'into', 'ideals', 'objectives', 'United', 'Nations']
[w for w in text2 if w.istitle()] # Capitalized words in text2
[w for w in text2 if w.endswith('s')] # Words in text2 that end in 's'

We can find unique words using set().

text3 = 'To be or not to be'
text4 = text3.split(' ')
len(text4)
6
len(set(text4))	
5
set(text4)
{'To', 'be', 'not', 'or', 'to'}
len(set([w.lower() for w in text4])) # .lower converts the string to lowercase.
4
set([w.lower() for w in text4])
{'be', 'not', 'or', 'to'}

Processing free-text

text5 = '"Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text6 = text5.split(' ')
text6
['"Ethics', 'are', 'built', 'right', 'into', 'the', 'ideals', 'and', 'objectives', 'of', 'the', 'United', 'Nations"', '#UNSG', '@', 'NY', 'Society', 'for', 'Ethical', 'Culture', 'bit.ly/2guVelr']

Finding hastags:

[w for w in text6 if w.startswith('#')]
['#UNSG']

Finding callouts:

[w for w in text6 if w.startswith('@')]	
['@']
We can use regular expressions to help us with more complex parsing.

For example ‘@[A-Za-z0-9_]+’ will return all words that:

  • start with ‘@’ and are followed by at least one:
  • capital letter (‘A-Z’)
  • lowercase letter (‘a-z’)
  • number (‘0-9’)
  • or underscore (’_’)
text7 = '@UN @UN_Women "Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text8 = text7.split(' ')
import re # import re - a module that provides support for regular expressions
[w for w in text8 if re.search('@[A-Za-z0-9_]+', w)]
['@UN', '@UN_Women']

Working with Text Data in pandas

import pandas as pd

time_sentences = ["Monday: The doctor's appointment is at 2:45pm.", 
                  "Tuesday: The dentist's appointment is at 11:30 am.",
                  "Wednesday: At 7:00pm, there is a basketball game!",
                  "Thursday: Be back home by 11:15 pm at the latest.",
                  "Friday: Take the train at 08:10 am, arrive at 09:00am."]

df = pd.DataFrame(time_sentences, columns=['text'])
df

# find the number of characters for each string in df['text']
df['text'].str.len()
0    46
1    50
2    49
3    49
4    54
Name: text, dtype: int64

# find the number of tokens for each string in df['text']
df['text'].str.split().str.len()
0     7
1     8
2     8
3    10
4    10
Name: text, dtype: int64

# find which entries contain the word 'appointment'
df['text'].str.contains('appointment')
0     True
1     True
2    False
3    False
4    False
Name: text, dtype: bool

# find how many times a digit occurs in each string
df['text'].str.count(r'\d')
0    3
1    4
2    3
3    4
4    8
Name: text, dtype: int64

# find all occurances of the digits
df['text'].str.findall(r'\d')
0                   [2, 4, 5]
1                [1, 1, 3, 0]
2                   [7, 0, 0]
3                [1, 1, 1, 5]
4    [0, 8, 1, 0, 0, 9, 0, 0]
Name: text, dtype: object

# group and find the hours and minutes
df['text'].str.findall(r'(\d?\d):(\d\d)')
0               [(2, 45)]
1              [(11, 30)]
2               [(7, 00)]
3              [(11, 15)]
4    [(08, 10), (09, 00)]
Name: text, dtype: object

# replace weekdays with '???'
df['text'].str.replace(r'\w+day\b', '???')
0          ???: The doctor's appointment is at 2:45pm.
1       ???: The dentist's appointment is at 11:30 am.
2          ???: At 7:00pm, there is a basketball game!
3         ???: Be back home by 11:15 pm at the latest.
4    ???: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object

# replace weekdays with 3 letter abbrevations
df['text'].str.replace(r'(\w+day\b)', lambda x: x.groups()[0][:3])
0          Mon: The doctor's appointment is at 2:45pm.
1       Tue: The dentist's appointment is at 11:30 am.
2          Wed: At 7:00pm, there is a basketball game!
3         Thu: Be back home by 11:15 pm at the latest.
4    Fri: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object

# create new columns from first match of extracted groups
df['text'].str.extract(r'(\d?\d):(\d\d)')
    0   1
0   2  45
1  11  30
2   7  00
3  11  15
4  08  10

# extract the entire time, the hours, the minutes, and the period
df['text'].str.extractall(r'((\d?\d):(\d\d) ?([ap]m))')

# extract the entire time, the hours, the minutes, and the period with group names
df['text'].str.extractall(r'(?P<time>(?P<hour>\d?\d):(?P<minute>\d\d) ?(?P<period>[ap]m))')

更多推荐

Applied Text Mining in Python Week 1(notes)