Latent Semantic Anal...
潜语义分析(LSA,Latent Semantic Analysis; 有时也称作,LSI, Latent Semantic Indexing)是文本建模和分析中常用的模型。LSA是Deerwester, Dumais等在1990年提出的,它引入了矩阵计算中的奇异值分解(SVD, Singular Value Decomposition)和低秩矩阵近似(Low Rank Approximation),对文档集的表示矩阵进行分析,将文档映射到潜语义空间,达到文档潜语义表示和降维的目的。
1.文档的表示
文档表示是文本分析中的基本问题,它可以将人可读文档转化为机器可读数据结构。常用的文本表示方法有向量空间模型(VSM, Vector Space Model),它是一种简单且有效的文档表示方法,广泛的应用于信息检索系统中。
向量空间模型将文档表示为向量,向量的每个维度代表词表中的一个词,对应的值代表的是该词在文档中出现的(加权)次数。
向量空间模型可以很方便的计算出文档与文档之间,文档与查询(query)之间的相似度,度量一般采用cosine相似度(cosine非严格的距离度量方式,其不满足距离度量的三个要素—非负、对称、三角不等式)。
其中,
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_858f72216d118c3e1f15781f1923f8da.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_493682e52f6c14e4c587b85829ffc519.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_f1290186a5d0b1ceab27f4e77c0c5d68.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_865c0c0b4ab0e063e5caa3387c1a8741.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_a31d2f0d9cd723824073d28d35e038da.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_ccfcd347d0bf65dc77afe01a3306a96b.gif)
VSM模型简单,计算方便,可以方便的计算出文档的相关系数(
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_a31d2f0d9cd723824073d28d35e038da.gif)
2.SVD
奇异值分解是将矩阵分解成三个特定矩阵——正交矩阵
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_953279f86ae4c1ab1a348e1f9254e5c3.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_e859ea39cd200b2fd393b0a7325513cd.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_8b55bfb905cf65ce006bf2de19e72aea.gif)
其中,
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_1cdfa65fd6f15eb73de3f26824f45e11.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_953279f86ae4c1ab1a348e1f9254e5c3.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_66b5fd560349489461b644d848d4cfda.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_8b55bfb905cf65ce006bf2de19e72aea.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_a6d1cce133ca74e7818873e189c26502.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_cb8e811f6802dcdb3614df559e25183b.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_e859ea39cd200b2fd393b0a7325513cd.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_03807a73ed1e734e49aa240b70e0c8ef.gif)
3.Low Rank Approximation
低秩矩阵近似是一个最小化问题,它是在满足近似矩阵的秩的约束下,代价函数(cost function)最小,它的代价函数是原始矩阵(数据矩阵)和近似矩阵之间的差值。
其中,
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_cb8e811f6802dcdb3614df559e25183b.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_4dc25e5999f71df3abcb74632568aa79.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_fd9b3a225aed41fd845258779c7ac5b7.gif)
其中,
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_e056f01dfbd846fd86de31fccbb94d60.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_cb8e811f6802dcdb3614df559e25183b.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_88a47fcfdb5cc03be56a7b079a966517.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_14501430b410f8d0325b11b9ec816eba.gif)
4.LSA
前文提到LSA是引入了SVD和低秩矩阵近似进行文本分析。在LSA中,模型通过对文档集表示矩阵(文档-词矩阵)进行奇异值分解,并低秩矩阵近似对文档集表示矩阵进行近似,得到K维空间中的文档-潜语义、潜语义-词的表示矩阵。
其中,
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_19e35ea1a6783943dd6a485e1b59d836.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_4df5474e51d44efabdb205bf3f893247.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_c897e4ac260fcc522208b0fb47c5495a.gif)
![1-1](http://www.52cs.org/wp-content/uploads/2014/08/1-1-300x93.png)
5.Extension
LSI局限于
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_953279f86ae4c1ab1a348e1f9254e5c3.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_8b55bfb905cf65ce006bf2de19e72aea.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_dfc5c5dd0d03d2ea90fe45d8c58d5124.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_69691c7bdcc3ce6d5d8a1361f22d04ac.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_8d9c307cb7f3c4a32822a51922d1ceaa.gif)
![1-2](http://www.52cs.org/wp-content/uploads/2014/08/1-2-300x201.png)
5.1 RLSI(Regularized LSI)
在LSI的基础上,RLSI[Wang etc., 2013]对文档的主题表示和主题-词之间关系上加入稀疏化控制,同时去除LSI中
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_953279f86ae4c1ab1a348e1f9254e5c3.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_8b55bfb905cf65ce006bf2de19e72aea.gif)
其中,
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_4b0997444587dc479343eb60ccaf2a67.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_7a03964d5fea806525a024507969e063.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_4b0997444587dc479343eb60ccaf2a67.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_953279f86ae4c1ab1a348e1f9254e5c3.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_9ae55c4e46e89c871aedc5e70d4f1b0e.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_ef8865f6e97b1f942ba13021e6302cb4.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_20e1c88a20e1a3911e06cdf442e3c1f3.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_ef8865f6e97b1f942ba13021e6302cb4.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_9ae55c4e46e89c871aedc5e70d4f1b0e.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_fa114695aec226f8062b6702f7c89dd8.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_9ae55c4e46e89c871aedc5e70d4f1b0e.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_9ae55c4e46e89c871aedc5e70d4f1b0e.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_9ae55c4e46e89c871aedc5e70d4f1b0e.gif)
![1-3](http://www.52cs.org/wp-content/uploads/2014/08/1-3-300x108.png)
上式对于
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_953279f86ae4c1ab1a348e1f9254e5c3.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_8b55bfb905cf65ce006bf2de19e72aea.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_953279f86ae4c1ab1a348e1f9254e5c3.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_8b55bfb905cf65ce006bf2de19e72aea.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_953279f86ae4c1ab1a348e1f9254e5c3.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_8b55bfb905cf65ce006bf2de19e72aea.gif)
![1-4](http://www.52cs.org/wp-content/uploads/2014/08/1-4-300x81.png)
(1)给定
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_8b55bfb905cf65ce006bf2de19e72aea.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_953279f86ae4c1ab1a348e1f9254e5c3.gif)
令
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_5c55c48c02cb329cf16b43e99f580a54.gif)
式(8)可以分解成M个独立的优化问题,每个问题形式为:
式(9)是一个
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_9ec4c0afd450ceac7adb81c3bcfc9732.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_1b717867004a1d7e6dbb59a0df9a0d9a.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_5553ea2c907a1817f0bf4274b42fec60.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_5553ea2c907a1817f0bf4274b42fec60.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_32905b2ffe1f5162e2119672349c4c8c.gif)
其中,
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_f97559ab8fd1c3d6a011efb4a7ecac45.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_001e37a6336dbdddd5ac30dfc8964b0d.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_5c4de06cfc35899da9dce131e545a420.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_0bb1a774b2097bc396e83fd39a47cd9d.gif)
其中,
![1-5](http://www.52cs.org/wp-content/uploads/2014/08/1-5-300x138.png)
(2)给定
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_953279f86ae4c1ab1a348e1f9254e5c3.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_8b55bfb905cf65ce006bf2de19e72aea.gif)
类似问题(1),更新
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_8b55bfb905cf65ce006bf2de19e72aea.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_8d9c307cb7f3c4a32822a51922d1ceaa.gif)
式(12)为
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_7e6aa2d53f6ee2b1a34b017fa403cb76.gif)
![1-6](http://www.52cs.org/wp-content/uploads/2014/08/1-6-300x89.png)
5.2 Online-RLSI
在处理流式文档数据时,文档是按序一一到达,且文档中包含的主题是不断变化的。Online-RLSI可以有效的跟踪话题演化(Topic tracking)。类似式(6),目标方程为:
式(14)中第一项是
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_8d9c307cb7f3c4a32822a51922d1ceaa.gif)
在online-RLSI中,文档是相互独立的。当文档
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_7308059d24554c2b6039abe804075603.gif)
Online-RLSI通过式(15)得到文档
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_7308059d24554c2b6039abe804075603.gif)
其中,
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_3b91c5081188423f2f90e4852bea41c4.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_e167fce44fc5d5911fdce6b5edf4ae83.gif)
类比于batch-RLSI,式(15)是式(12)的近似,式(16)是式(14)的近似。
(1)文档投影
文档投影问题是一个L2正则的最小二乘问题,其解为
(2)更新主题-词矩阵
式(16)等同于下式,
其中,
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_3458a507d9e41bdab7dda314c487d9de.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_69691c7bdcc3ce6d5d8a1361f22d04ac.gif)
令
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_75b2735754b20682babd04ee5569d4f2.gif)
![](https://www.52cs.com/wp-content/plugins/latex/cache/tex_310ab98101724a788c92dd769ab22480.gif)
![1-7](http://www.52cs.org/wp-content/uploads/2014/08/1-7-300x134.png)
6.Conclusion
LSI模型(扩展模型)相对后续介绍的pLSA,LDA,DP等比较简单,仅用常用矩阵分解方式对文档-词矩阵进行分析,同时LSI模型没有对应的概率生成过程描述,无法用概率图模型表示。
7.参考文献
[1].Dumais, S. T. (2004). Latent semantic analysis. Annual review of information science and technology, 38(1), 188-230.
[2].Wikipedia. Latent Semantic Analysis, http://en.wikipedia.org/wiki/Latent_s
emantic_indexing.
[3].Lee, D. D., & Seung, H. S. (2001). Algorithms for non-negative matrix factorization. In Advances in neural information processing systems (pp. 556-562).
[4].Hoyer, P. O. (2004). Non-negative matrix factorization with sparseness constraints. The Journal of Machine Learning Research, 5, 1457-1469.
[5].Strang, G. (2003). Introduction to linear algebra. Cambridge Publication.
[6].Wikipedia. Non-negative Matrix Factorization. http://en.wikipedia.org/wiki/Non-negative_matrix_factorization
[7].Wikipedia. Singular Value Decomposition. http://en.wikipedia.org/wiki/Singular_
value_decomposition
[8].Quan Wang, Jun Xu, Hang Li, and Nick Craswell. (2013). Regularized Latent Semantic Indexing: A New Approach to Large-Scale Topic Modeling. ACM Trans. Inf. Syst. 31
留下你的评论