NLTK(Natural Language Toolkit)是一个功能强大的Python包,它提供了一组自然语言算法,例如切分词(Tokenize),词性标注(Part-Of-Speech Tagging),词干分析(Stem)和命名实体识别(Named Entity Recognition),分类算法(classification)。 安装和引用NLTK
pip install nltk import nltk
一,切词
文本是由段落(Paragraph)构成的,段落是由句子(Sentence)构成的,句子是由单词构成的。切词是文本分析的第一步,它把文本段落分解为较小的实体(如单词或句子),每一个实体叫做一个Token,Token是构成句子(sentence )的单词,是段落(paragraph)的句子。NLTK能够实现句子切分和单词切分两种功能。
1,句子切分
句子切分是指把段落切分成句子:
from nltk.tokenize import sent_tokenize text=\"\"\"Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome. The sky is pinkish-blue. You shouldn\'t eat cardboard\"\"\" tokenized_text=sent_tokenize(text) print(tokenized_text)
句子切分的结果:
[\'Hello Mr. Smith, how are you doing today?\', \'The weather is great, and city is awesome.\', \'The sky is pinkish-blue.\', \"You shouldn\'t eat cardboard\"]
2,单词切分
单词切分是把句子切分成单词
from nltk.tokenize import word_tokenize text=\"\"\"Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome. The sky is pinkish-blue. You shouldn\'t eat cardboard\"\"\" tokenized_text=word_tokenize(text) print(tokenized_text)
单词切分的结果是:
[\'Hello\', \'Mr.\', \'Smith\', \',\', \'how\', \'are\', \'you\', \'doing\', \'today\', \'?\',
\'The\', \'weather\', \'is\', \'great\', \',\', \'and\', \'city\', \'is\', \'awesome\', \'.\',
\'The\', \'sky\', \'is\', \'pinkish-blue\', \'.\', \'You\', \'should\', \"n\'t\", \'eat\', \'cardboard\']
可以发现,切词之后,标点符号也包括在结果中。
二,处理切词
对切词的处理,需要移除标点符号和移除停用词和词汇规范化。
1,移除标点符号
对每个切词调用该函数,移除字符串中的标点符号,string.punctuation包含了所有的标点符号,从切词中把这些标点符号替换为空格。
import string s=\'abc.\' s.translate(str.maketrans(string.punctuation,\" \"*len(string.punctuation),\"\")
2,移除停用词
停用词(stopword)是文本中的噪音单词,没有任何意义,常用的英语停用词,例如:is, am, are, this, a, an, the。NLTK的语料库中由一个停用词,用户必须从切词列表中把停用词去掉。
from nltk.corpus import stopwords stop_words = stopwords.words(\"english\") word_tokens = nltk.tokenize.word_tokenize(text.strip()) filtered_sentence = [w for w in word_tokens if not w in stop_words]
三,词汇规范化(Lexicon Normalization)
词汇规范化是指把词的各种派生形式转换为词根,stem是把单词转换为词干,在NLTK中存在两种抽取词干的方法porter和wordnet。
from nltk.stem.wordnet import WordNetLemmatizer lem = WordNetLemmatizer() from nltk.stem.porter import PorterStemmer stem = PorterStemmer() word = \"flying\" print(\"Lemmatized Word:\",lem.lemmatize(word,\"v\")) print(\"Stemmed Word:\",stem.stem(word))
四,词性标注
词性(POS)标记的主要目标是识别给定单词的语法组,POS标记查找句子内的关系,并为该单词分配相应的标签。
sent = \"Albert Einstein was born in Ulm, Germany in 1879.\" tokens=nltk.word_tokenize(sent) nltk.pos_tag(tokens)
五,分类
略
参考文档: