超淺的 Word sense disambiguation 學習心得

( 這篇是幾個月前寫的, 補貼上來備份資料, 內文可能有誤, 請以原始資料為準 )


最近面試的人是 NLP 背景, 就簡單地惡補一下相關知識,
Wikipedia 永遠是你的好老師! 只是長期這樣眼睛很累

參考資料

WSD (word sense disambiguation) is harder than POS.
  • accuracy 70% vs. 95%
  • WSD is harder to tag. Humans are hard to know all senses of a word.

Both areas are dominated by machine learning methods.
  • WSD: SVM
  • POS: HMM
近年來 ML / Statistics 四處亂踢館啊, 還滿好奇為啥 POS 和 WSD 這麼相像,
Wikipedia 上卻說 WSD 用 classifier, 而 POS tagging 卻用 HMM 這類有考慮文字順序的 model。

Two kinds of approaches to WSD

  • Deep approach:
    • code human knowledge into computer-readable format.
    • it's very hard to use in practice.
  • Shallow approach:
    • a statistical way.
    • it works most times; when facing ambiguity, use window to reduce ambiguity.
Example
  • bass: low frequency or a kind of fish.
    • bass can be distinguished by counting word co-occurrence.
    • bass + sound -> low frequency
    • bass + fish or sea -> fish
  • A hard example: A dog barks at a tree.
    • bark: 吠 or 樹皮?
    • If we use window size <= 2, it's related to dog.

Misc
  • A naive approach (simply match POS by the most possible one) achieves 90%. 這樣的成果還挺令人無言的, 一開始最多就只有 10% 進步空間。


留言

這個網誌中的熱門文章

(C/C++ ) 如何在 Linux 上使用自行編譯的第三方函式庫

熟悉系統工具好處多多

virtualbox 使用 USB 裝置