在机器学习领域中,朴素贝叶斯是一种基于贝叶斯定理的简单概率分类器, 朴素贝叶斯在处理文本数据时可以得到较好的分类结果,被广泛应用于文本分类/垃圾邮件过滤/自然语言处理等场景。

使用Python进行文本分类

判断某句话是否为正常言论

导入数据

1
2
3
4
5
6
7
8
9
10
def loadDataSet():
postingList =
[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
['stop', 'posting', 'stupid', 'worthless', 'garbage'],
['mr', 'licks', 'ate', 'my', 'steak','how', 'to', 'stop', 'him'],
['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
classVec = [0, 1, 0, 1, 0, 1] # 1 侮辱性文字, 0 正常言论
return postingList, classVec

创建字典

1
2
3
4
5
def createVocabList(dataSet):
vocabSet = set([]) # create empty set
for document in dataSet:
vocabSet = vocabSet | set(document) # union of the two sets
return list(vocabSet)

将某句话转换为向量

1
2
3
4
5
6
7
8
9
def setOfWords2Vec(vocabList, inputSet):
returnVec = [0]*len(vocabList)
for word in inputSet:
if word in vocabList:
returnVec[vocabList.index(word)] = 1
#词集模型 某个单词只能出现一次
else:
print("the word: %s is not in my Vocabulary!" % word)
return returnVec

从词向量计算概率

1
2
from numpy import *
import math
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def trainNB0(trainMatrix, trainCategory):
numTrainDocs = len(trainMatrix)
numWords = len(trainMatrix[0])
pAbusive = sum(trainCategory)/float(numTrainDocs)
p0Num = ones(numWords)
p1Num = ones(numWords) # change to ones()
p0Denom = 2.0
p1Denom = 2.0 # change to 2.0
for i in range(numTrainDocs):
if trainCategory[i] == 1:
p1Num += trainMatrix[i]
p1Denom += sum(trainMatrix[i])
else:
p0Num += trainMatrix[i]
p0Denom += sum(trainMatrix[i])
p1Vect = log(p1Num/p1Denom) # change to log()
p0Vect = log(p0Num/p0Denom) # change to log()
return p0Vect, p1Vect, pAbusive

参考文档