使用NLTK做文本分析

随笔日记
20年11月9日
编辑

筑楼站长

释放双眼，带上耳机，听听看~！

NLTK（Natural Language Toolkit）是一个功能强大的Python包，它提供了一组自然语言算法，例如切分词（Tokenize），词性标注(Part-Of-Speech Tagging)，词干分析(Stem)和命名实体识别(Named Entity Recognition)，分类算法（classification）。安装和引用NLTK

pip install nltk

import nltk

一，切词

文本是由段落（Paragraph）构成的，段落是由句子（Sentence）构成的，句子是由单词构成的。切词是文本分析的第一步，它把文本段落分解为较小的实体（如单词或句子），每一个实体叫做一个Token，Token是构成句子（sentence ）的单词，是段落（paragraph）的句子。NLTK能够实现句子切分和单词切分两种功能。

1，句子切分

句子切分是指把段落切分成句子：

from nltk.tokenize import sent_tokenize
text=\"\"\"Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome.
The sky is pinkish-blue. You shouldn\'t eat cardboard\"\"\"
tokenized_text=sent_tokenize(text)
print(tokenized_text)

句子切分的结果：

[\'Hello Mr. Smith, how are you doing today?\', \'The weather is great, and city is awesome.\', \'The sky is pinkish-blue.\', \"You shouldn\'t eat cardboard\"]

2，单词切分

单词切分是把句子切分成单词

from nltk.tokenize import word_tokenize
text=\"\"\"Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome.
The sky is pinkish-blue. You shouldn\'t eat cardboard\"\"\"
tokenized_text=word_tokenize(text)
print(tokenized_text)

单词切分的结果是：

[\'Hello\', \'Mr.\', \'Smith\', \',\', \'how\', \'are\', \'you\', \'doing\', \'today\', \'?\', 
\'The\', \'weather\', \'is\', \'great\', \',\', \'and\', \'city\', \'is\', \'awesome\', \'.\',
\'The\', \'sky\', \'is\', \'pinkish-blue\', \'.\', \'You\', \'should\', \"n\'t\", \'eat\', \'cardboard\']

可以发现，切词之后，标点符号也包括在结果中。

二，处理切词

对切词的处理，需要移除标点符号和移除停用词和词汇规范化。

1，移除标点符号

对每个切词调用该函数，移除字符串中的标点符号，string.punctuation包含了所有的标点符号，从切词中把这些标点符号替换为空格。

import string

s=\'abc.\'
s.translate(str.maketrans(string.punctuation,\" \"*len(string.punctuation),\"\")

2，移除停用词

停用词（stopword）是文本中的噪音单词，没有任何意义，常用的英语停用词，例如：is, am, are, this, a, an, the。NLTK的语料库中由一个停用词，用户必须从切词列表中把停用词去掉。

from nltk.corpus import stopwords

stop_words = stopwords.words(\"english\")

word_tokens = nltk.tokenize.word_tokenize(text.strip())
filtered_sentence = [w for w in word_tokens if not w in stop_words]

三，词汇规范化（Lexicon Normalization）

词汇规范化是指把词的各种派生形式转换为词根，stem是把单词转换为词干，在NLTK中存在两种抽取词干的方法porter和wordnet。

from nltk.stem.wordnet import WordNetLemmatizer
lem = WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer
stem = PorterStemmer()

word = \"flying\"
print(\"Lemmatized Word:\",lem.lemmatize(word,\"v\"))
print(\"Stemmed Word:\",stem.stem(word))

四，词性标注

词性（POS）标记的主要目标是识别给定单词的语法组，POS标记查找句子内的关系，并为该单词分配相应的标签。

sent = \"Albert Einstein was born in Ulm, Germany in 1879.\"
tokens=nltk.word_tokenize(sent)
nltk.pos_tag(tokens)

五，分类

略

参考文档：

NLTK in Python

Text Analytics for Beginners using NLTK

NLTK学习笔记 -- 字符串操作

【NLP】Python NLTK 走进大秦帝国

{{userData.name}}已认证

使用NLTK做文本分析

一，切词

二，处理切词

三，词汇规范化（Lexicon Normalization）

四，词性标注

五，分类

实例解析forEach、for...in与for...of

增长中的时间序列存储(Scaling Time Series Data Storage) - Part I

国家安全生产特种作业资格证书查询系统【官方查询】

1一18风力等级符号图片

404少女前线4k动漫壁纸

《建筑地基基础工程施工质量验收标准规范 GB50202-2018》

顶丝外露长度规范要求

疫情防控来访人员登记表Excel模板