Monday, March 26, 2012

Latent Semantic Analysis (LSA)

Does SQL 2005 offer any tools to perform Latent Semantic Analysis on large data sets? Say I have millions of daily search queries and I'd like to link queries to one another based on semantic content with a goal of mapping them to larger "categories" . One way of doing this is to compare word frequency and proximity to construct a semantic "weight space". This is the approach generally used in LSA.

Anyone using SQL 2005 for this? Anyone see a way it could be done?

We do have the text mining transforms in Integration Services for determining words and phrases from a corpus of documents. We don't have anything specific to LSA, but if you have categories and run a logistic regression you may be able to pull the coefficients to get the information you need. For an example, see http://www.sqlserverdatamining.com/DMCommunity/TipsNTricks/1509.aspx. If all else fails, you can write custom algorithms that plug in to the architecture using either C++ or any .NET language.|||

Had a look at the Text Mining tutorial. It appears to me that the term extraction and term lookup transforms destroy the notion of proximity between the words in the phrase (seems to just create a frequency count of terms). I believe proximity is the key to deriving semantic meaning using techniques like LSA. I might be missing something, though. Can you see a way of using the transforms while maintaining this info?

FYI, here are some good LSA (LSI) links:

http://lsi.research.telcordia.com/lsi/papers/JASIS90.pdf

http://en.wikipedia.org/wiki/Latent_semantic_indexing

|||

You're right, the transforms do eliminate any proximity, except that our transforms extract key term and phrases. That is, it will extract "Data Mining" as a single entity rather than necessarity seperating into "Data" and "Mining". You can specify the maximum phrase length in the options for the transform.

I could imagine that you could break up these phrases with a post process and indicate their proximity - I don't know if that would help though - you would still lose the proximity between the phrases an dothe rowrds and phrases.

No comments:

Post a Comment