Applied Text Mining in Python Week 1(notes)

Working With Text

text1 = "Ethics are built right into the ideals and objectives of the United Nations "
len(text1) # The length of text1
76

text2 = text1.split(' ') # Return a list of the words in text2, separating by ' '.
len(text2)
14
text2
['Ethics', 'are', 'built', 'right', 'into', 'the', 'ideals', 'and', 'objectives', 'of', 'the', 'United', 'Nations', '']

List comprehension allows us to find specific words:

[w for w in text2 if len(w) > 3] # Words that are greater than 3 letters long in text2
['Ethics', 'built', 'right', 'into', 'ideals', 'objectives', 'United', 'Nations']

[w for w in text2 if w.istitle()] # Capitalized words in text2
[w for w in text2 if w.endswith('s')] # Words in text2 that end in 's'

We can find unique words using set().

text3 = 'To be or not to be'
text4 = text3.split(' ')
len(text4)
6
len(set(text4))	
5
set(text4)
{'To', 'be', 'not', 'or', 'to'}
len(set([w.lower() for w in text4])) # .lower converts the string to lowercase.
4
set([w.lower() for w in text4])
{'be', 'not', 'or', 'to'}

Processing free-text

text5 = '"Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text6 = text5.split(' ')
text6
['"Ethics', 'are', 'built', 'right', 'into', 'the', 'ideals', 'and', 'objectives', 'of', 'the', 'United', 'Nations"', '#UNSG', '@', 'NY', 'Society', 'for', 'Ethical', 'Culture', 'bit.ly/2guVelr']

Finding hastags:

[w for w in text6 if w.startswith('#')]
['#UNSG']

Finding callouts:

[w for w in text6 if w.startswith('@')]	
['@']

We can use regular expressions to help us with more complex parsing.

For example ‘@[A-Za-z0-9_]+’ will return all words that:

start with ‘@’ and are followed by at least one:
capital letter (‘A-Z’)
lowercase letter (‘a-z’)
number (‘0-9’)
or underscore (’_’)

text7 = '@UN @UN_Women "Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text8 = text7.split(' ')
import re # import re - a module that provides support for regular expressions
[w for w in text8 if re.search('@[A-Za-z0-9_]+', w)]
['@UN', '@UN_Women']

Working with Text Data in pandas

import pandas as pd

time_sentences = ["Monday: The doctor's appointment is at 2:45pm.", 
                  "Tuesday: The dentist's appointment is at 11:30 am.",
                  "Wednesday: At 7:00pm, there is a basketball game!",
                  "Thursday: Be back home by 11:15 pm at the latest.",
                  "Friday: Take the train at 08:10 am, arrive at 09:00am."]

df = pd.DataFrame(time_sentences, columns=['text'])
df

# find the number of characters for each string in df['text']
df['text'].str.len()
0    46
1    50
2    49
3    49
4    54
Name: text, dtype: int64

# find the number of tokens for each string in df['text']
df['text'].str.split().str.len()
0     7
1     8
2     8
3    10
4    10
Name: text, dtype: int64

# find which entries contain the word 'appointment'
df['text'].str.contains('appointment')
0     True
1     True
2    False
3    False
4    False
Name: text, dtype: bool

# find how many times a digit occurs in each string
df['text'].str.count(r'\d')
0    3
1    4
2    3
3    4
4    8
Name: text, dtype: int64

# find all occurances of the digits
df['text'].str.findall(r'\d')
0                   [2, 4, 5]
1                [1, 1, 3, 0]
2                   [7, 0, 0]
3                [1, 1, 1, 5]
4    [0, 8, 1, 0, 0, 9, 0, 0]
Name: text, dtype: object

# group and find the hours and minutes
df['text'].str.findall(r'(\d?\d):(\d\d)')
0               [(2, 45)]
1              [(11, 30)]
2               [(7, 00)]
3              [(11, 15)]
4    [(08, 10), (09, 00)]
Name: text, dtype: object

# replace weekdays with '???'
df['text'].str.replace(r'\w+day\b', '???')
0          ???: The doctor's appointment is at 2:45pm.
1       ???: The dentist's appointment is at 11:30 am.
2          ???: At 7:00pm, there is a basketball game!
3         ???: Be back home by 11:15 pm at the latest.
4    ???: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object

# replace weekdays with 3 letter abbrevations
df['text'].str.replace(r'(\w+day\b)', lambda x: x.groups()[0][:3])
0          Mon: The doctor's appointment is at 2:45pm.
1       Tue: The dentist's appointment is at 11:30 am.
2          Wed: At 7:00pm, there is a basketball game!
3         Thu: Be back home by 11:15 pm at the latest.
4    Fri: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object

# create new columns from first match of extracted groups
df['text'].str.extract(r'(\d?\d):(\d\d)')
    0   1
0   2  45
1  11  30
2   7  00
3  11  15
4  08  10

# extract the entire time, the hours, the minutes, and the period
df['text'].str.extractall(r'((\d?\d):(\d\d) ?([ap]m))')

# extract the entire time, the hours, the minutes, and the period with group names
df['text'].str.extractall(r'(?P<time>(?P<hour>\d?\d):(?P<minute>\d\d) ?(?P<period>[ap]m))')

更多推荐

Applied Text Mining in Python Week 1(notes)

Applied Text Mining in Python Week 1(notes)

Working With Text

List comprehension allows us to find specific words:

We can find unique words using set().

Processing free-text

Finding hastags:

Finding callouts:

We can use regular expressions to help us with more complex parsing.

Working with Text Data in pandas

发布评论取消回复

最近发表

热门文章

标签列表

Applied Text Mining in Python Week 1(notes)

Working With Text

List comprehension allows us to find specific words:

We can find unique words using set().

Processing free-text

Finding hastags:

Finding callouts:

We can use regular expressions to help us with more complex parsing.

Working with Text Data in pandas

相关文章

发布评论取消回复

最近发表

热门文章

标签列表