Bag of-ngram生成更多不同的ngram。它增加了特征存储成本,以及模型训练和预测阶段的计算成本。虽然数据点的数量保持不变,但特征空间的维度现在更大。因此数据密度更为稀疏。 n越高,存储和计算成本越高,数据越稀疏。由于这些原因,较长的n-gram并不总是会使模型精度的得到提高(或任何其他性能指标)。人们通常在n = 2或3时停止。较少的n-gram很少被使用。
Manning,Christopher D. and Hinrich Schütze. 1999. Foundations of StatisticalNatural Language Processing . Cambridge, Massachusettes: MIT Press.
Sometimes people call it the document “vector.” The vector extends from the original and ends at the specified point. For our purposes, “vector” and “point” are the same thing.
Manning,Christopher D. and Hinrich Schütze. 1999. Foundations of StatisticalNatural Language Processing . Cambridge, Massachusettes: MIT Press.
Sometimes people call it the document “vector.” The vector extends from the original and ends at the specified point. For our purposes, “vector” and “point” are the same thing.