2010年2月2日 星期二

超淺的 Word sense disambiguation 學習心得

( 這篇是幾個月前寫的, 補貼上來備份資料, 內文可能有誤, 請以原始資料為準 )


最近面試的人是 NLP 背景, 就簡單地惡補一下相關知識,
Wikipedia 永遠是你的好老師! 只是長期這樣眼睛很累

參考資料

WSD (word sense disambiguation) is harder than POS.
  • accuracy 70% vs. 95%
  • WSD is harder to tag. Humans are hard to know all senses of a word.

Both areas are dominated by machine learning methods.
  • WSD: SVM
  • POS: HMM
近年來 ML / Statistics 四處亂踢館啊, 還滿好奇為啥 POS 和 WSD 這麼相像,
Wikipedia 上卻說 WSD 用 classifier, 而 POS tagging 卻用 HMM 這類有考慮文字順序的 model。

Two kinds of approaches to WSD

  • Deep approach:
    • code human knowledge into computer-readable format.
    • it's very hard to use in practice.
  • Shallow approach:
    • a statistical way.
    • it works most times; when facing ambiguity, use window to reduce ambiguity.
Example
  • bass: low frequency or a kind of fish.
    • bass can be distinguished by counting word co-occurrence.
    • bass + sound -> low frequency
    • bass + fish or sea -> fish
  • A hard example: A dog barks at a tree.
    • bark: 吠 or 樹皮?
    • If we use window size <= 2, it's related to dog.

Misc
  • A naive approach (simply match POS by the most possible one) achieves 90%. 這樣的成果還挺令人無言的, 一開始最多就只有 10% 進步空間。


沒有留言:

張貼留言

在 Fedora 下裝 id-utils

Fedora 似乎因為執行檔撞名,而沒有提供 id-utils 的套件 ,但這是使用 gj 的必要套件,只好自己編。從官網抓好 tarball ,解開來編譯 (./configure && make)就是了。 但編譯後會遇到錯誤: ./stdio.h:10...