aplicación simple de N-Gram, TF-IDF y la similitud del coseno en Python

45

Salida paquete NLTK: http://www.nltk.org que tiene todo lo que necesita

Para la similitud del coseno:


def cosine_distance(u, v): 
    """ 
    Returns the cosine of the angle between vectors v and u. This is equal to 
    u.v/|u||v|. 
    """ 
    return numpy.dot(u, v)/(math.sqrt(numpy.dot(u, u)) * math.sqrt(numpy.dot(v, v)))

Para N-gramas:


def ngrams(sequence, n, pad_left=False, pad_right=False, pad_symbol=None): 
    """ 
    A utility that produces a sequence of ngrams from a sequence of items. 
    For example: 

    >>> ngrams([1,2,3,4,5], 3) 
    [(1, 2, 3), (2, 3, 4), (3, 4, 5)] 

    Use ingram for an iterator version of this function. Set pad_left 
    or pad_right to true in order to get additional ngrams: 

    >>> ngrams([1,2,3,4,5], 2, pad_right=True) 
    [(1, 2), (2, 3), (3, 4), (4, 5), (5, None)] 

    @param sequence: the source data to be converted into ngrams 
    @type sequence: C{sequence} or C{iterator} 
    @param n: the degree of the ngrams 
    @type n: C{int} 
    @param pad_left: whether the ngrams should be left-padded 
    @type pad_left: C{boolean} 
    @param pad_right: whether the ngrams should be right-padded 
    @type pad_right: C{boolean} 
    @param pad_symbol: the symbol to use for padding (default is None) 
    @type pad_symbol: C{any} 
    @return: The ngrams 
    @rtype: C{list} of C{tuple}s 
    """ 

    if pad_left: 
     sequence = chain((pad_symbol,) * (n-1), sequence) 
    if pad_right: 
     sequence = chain(sequence, (pad_symbol,) * (n-1)) 
    sequence = list(sequence) 

    count = max(0, len(sequence) - n + 1) 
    return [tuple(sequence[i:i+n]) for i in range(count)]

para TF-IDF tendrá que calcular la distribución en primer lugar, estoy usando Lucene para hacer eso, pero que muy bien puede hacer algo similar con NLTK, utilice FreqDist:

http://nltk.googlecode.com/svn/trunk/doc/book/ch01.html#frequency_distribution_index_term

si te gusta PyLucene, esto le dirá cómo comute tf.idf

# reader = lucene.IndexReader(FSDirectory.open(index_loc)) 
    docs = reader.numDocs() 
    for i in xrange(docs): 
     tfv = reader.getTermFreqVector(i, fieldname) 
     if tfv: 
      rec = {} 
      terms = tfv.getTerms() 
      frequencies = tfv.getTermFrequencies() 
      for (t,f,x) in zip(terms,frequencies,xrange(maxtokensperdoc)): 
        df= searcher.docFreq(Term(fieldname, t)) # number of docs with the given term 
         tmap.setdefault(t, len(tmap)) 
         rec[t] = sim.tf(f) * sim.idf(df, max_doc) #compute TF.IDF 
      # and normalize the values using cosine normalization 
      if cosine_normalization: 
       denom = sum([x**2 for x in rec.values()])**0.5 
       for k,v in rec.items(): 
        rec[k] = v/denom

Fuente

2010-05-02 17:20:58 roman

+1

No hay necesidad de realizar sqrt() dos veces, ya sqrt (a) * sqrt (b) = sqrt (a * b). –

3

Para nuestro curso de recuperación de información, utilizamos algunos códigos escritos por nuestro profesor en Java. Lo sentimos, no hay puerto de Python. "Se lanza con fines educativos y de investigación solo bajo la Licencia Pública General de GNU".

Se puede extraer de la documentación http://userweb.cs.utexas.edu/~mooney/ir-course/doc/

Pero más específicamente echa un vistazo a: http://userweb.cs.utexas.edu/users/mooney/ir-course/doc/ir/vsr/HashMapVector.html

puede descargarlo http://userweb.cs.utexas.edu/users/mooney/ir-course/

Fuente

2010-03-04 16:21:39 Penang

3

En caso de que todavía esté interesado en este problema, he hecho algo muy similar usando Lucene Java y Jython. Aquí hay algunos fragmentos de mi código.

Lucene preprocesa documentos y consultas utilizando los denominados analizadores. Éste utiliza de Lucene filtro incorporado de n-gramas:

class NGramAnalyzer(Analyzer): 
    '''Analyzer that yields n-grams for minlength <= n <= maxlength''' 
    def __init__(self, minlength, maxlength): 
     self.minlength = minlength 
     self.maxlength = maxlength 
    def tokenStream(self, field, reader): 
     lower = ASCIIFoldingFilter(LowerCaseTokenizer(reader)) 
     return NGramTokenFilter(lower, self.minlength, self.maxlength)

Para activar una lista de ngrams en un Document:

doc = Document() 
doc.add(Field('n-grams', ' '.join(ngrams), 
     Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES))

Para guardar un documento en un índice:

wr = IndexWriter(index_dir, NGramAnalyzer(), True, 
       IndexWriter.MaxFieldLength.LIMITED) 
wr.addDocument(doc)

Crear consultas es un poco más difícil ya que QueryParser de Lucene espera un lenguaje de consulta con operadores especiales, citas, etc., pero se puede eludir (como parcialmente explicó here).

Fuente

2010-07-29 11:59:08

19

Si está interesado, he hecho una serie de tutoriales (Part I y Part II) hablando de tf-idf y utilizando el módulo Scikits.learn (sklearn) de Python.

Part 3 tiene similitud de coseno.

Fuente

2011-10-22 00:11:07 Tarantula

6

Aquí es una respuesta con sólo python + numpy, en pocas palabras:

coseno:

def cosine_sim(u,v): 
    return np.dot(u,v)/(sqrt(np.dot(u,u)) * sqrt(np.dot(v,v)))

N-gramas:

def ngrams(sentence, n): 
    return zip(*[sentence.split()[i:] for i in range(n)])

TF-IDF (que es una poco raro pero obras):

def tfidf(corpus, vocab): 
    """ 
    INPUT: 

    corpus = [('this is a foo bar', [1, 1, 0, 1, 1, 0, 0, 1]), 
    ('foo bar bar black sheep', [0, 2, 1, 1, 0, 0, 1, 0]), 
    ('this is a sentence', [1, 0, 0, 0, 1, 1, 0, 1])] 

    vocab = ['a', 'bar', 'black', 'foo', 'is', 'sentence', 
    'sheep', 'this'] 

    OUTPUT: 

    [[0.300, 0.300, 0.0, 0.300, 0.300, 0.0, 0.0, 0.300], 
    [0.0, 0.600, 0.600, 0.300, 0.0, 0.0, 0.600, 0.0], 
    [0.375, 0.0, 0.0, 0.0, 0.375, 0.75, 0.0, 0.375]] 

    """ 
    def termfreq(matrix, doc, term): 
     try: return matrix[doc][term]/float(sum(matrix[doc].values())) 
     except ZeroDivisionError: return 0 
    def inversedocfreq(matrix, term): 
     try: 
      return float(len(matrix)) /sum([1 for i,_ in enumerate(matrix) if matrix[i][term] > 0]) 
     except ZeroDivisionError: return 0 

    matrix = [{k:v for k,v in zip(vocab, i[1])} for i in corpus] 
    tfidf = defaultdict(dict) 
    for doc,_ in enumerate(matrix): 
     for term in matrix[doc]: 
      tf = termfreq(matrix,doc,term) 
      idf = inversedocfreq(matrix, term) 
      tfidf[doc][term] = tf*idf 

    return [[tfidf[doc][term] for term in vocab] for doc,_ in enumerate(tfidf)]

Aquí está la respuesta de largo con las pruebas:

import numpy as np 
from math import sqrt, log 
from itertools import chain, product 
from collections import defaultdict 

def cosine_sim(u,v): 
    return np.dot(u,v)/(sqrt(np.dot(u,u)) * sqrt(np.dot(v,v))) 

def ngrams(sentence, n): 
    return zip(*[sentence.split()[i:] for i in range(n)]) 

def tfidf(corpus, vocab): 
    """ 
    INPUT: 

    corpus = [('this is a foo bar', [1, 1, 0, 1, 1, 0, 0, 1]), 
    ('foo bar bar black sheep', [0, 2, 1, 1, 0, 0, 1, 0]), 
    ('this is a sentence', [1, 0, 0, 0, 1, 1, 0, 1])] 

    vocab = ['a', 'bar', 'black', 'foo', 'is', 'sentence', 
    'sheep', 'this'] 

    OUTPUT: 

    [[0.300, 0.300, 0.0, 0.300, 0.300, 0.0, 0.0, 0.300], 
    [0.0, 0.600, 0.600, 0.300, 0.0, 0.0, 0.600, 0.0], 
    [0.375, 0.0, 0.0, 0.0, 0.375, 0.75, 0.0, 0.375]] 

    """ 
    def termfreq(matrix, doc, term): 
     try: return matrix[doc][term]/float(sum(matrix[doc].values())) 
     except ZeroDivisionError: return 0 
    def inversedocfreq(matrix, term): 
     try: 
      return float(len(matrix)) /sum([1 for i,_ in enumerate(matrix) if matrix[i][term] > 0]) 
     except ZeroDivisionError: return 0 

    matrix = [{k:v for k,v in zip(vocab, i[1])} for i in corpus] 
    tfidf = defaultdict(dict) 
    for doc,_ in enumerate(matrix): 
     for term in matrix[doc]: 
      tf = termfreq(matrix,doc,term) 
      idf = inversedocfreq(matrix, term) 
      tfidf[doc][term] = tf*idf 

    return [[tfidf[doc][term] for term in vocab] for doc,_ in enumerate(tfidf)] 


def corpus2vectors(corpus): 
    def vectorize(sentence, vocab): 
     return [sentence.split().count(i) for i in vocab] 
    vectorized_corpus = [] 
    vocab = sorted(set(chain(*[i.lower().split() for i in corpus]))) 
    for i in corpus: 
     vectorized_corpus.append((i, vectorize(i, vocab))) 
    return vectorized_corpus, vocab 

def create_test_corpus(): 
    sent1 = "this is a foo bar" 
    sent2 = "foo bar bar black sheep" 
    sent3 = "this is a sentence" 

    all_sents = [sent1,sent2,sent3] 
    corpus, vocab = corpus2vectors(all_sents) 
    return corpus, vocab 

def test_cosine(): 
    corpus, vocab = create_test_corpus() 

    for sentx, senty in product(corpus, corpus): 
     print sentx[0] 
     print senty[0] 
     print "cosine =", cosine_sim(sentx[1], senty[1]) 
     print 

def test_ngrams(): 
    corpus, vocab = create_test_corpus() 
    for sentx in corpus: 
     print sentx[0] 
     print ngrams(sentx[0],2) 
     print ngrams(sentx[0],3) 
     print 

def test_tfidf(): 
    corpus, vocab = create_test_corpus() 
    print corpus 
    print vocab 
    print tfidf(corpus, vocab) 

print "Testing cosine..." 
test_cosine() 
print 
print "Testing ngrams..." 
test_ngrams() 
print 
print "Testing tfidf..." 
test_tfidf() 
print

[fuera]:

Testing cosine... 
this is a foo bar 
this is a foo bar 
cosine = 1.0 

this is a foo bar 
foo bar bar black sheep 
cosine = 0.507092552837 

this is a foo bar 
this is a sentence 
cosine = 0.67082039325 

foo bar bar black sheep 
this is a foo bar 
cosine = 0.507092552837 

foo bar bar black sheep 
foo bar bar black sheep 
cosine = 1.0 

foo bar bar black sheep 
this is a sentence 
cosine = 0.0 

this is a sentence 
this is a foo bar 
cosine = 0.67082039325 

this is a sentence 
foo bar bar black sheep 
cosine = 0.0 

this is a sentence 
this is a sentence 
cosine = 1.0 


Testing ngrams... 
this is a foo bar 
[('this', 'is'), ('is', 'a'), ('a', 'foo'), ('foo', 'bar')] 
[('this', 'is', 'a'), ('is', 'a', 'foo'), ('a', 'foo', 'bar')] 

foo bar bar black sheep 
[('foo', 'bar'), ('bar', 'bar'), ('bar', 'black'), ('black', 'sheep')] 
[('foo', 'bar', 'bar'), ('bar', 'bar', 'black'), ('bar', 'black', 'sheep')] 

this is a sentence 
[('this', 'is'), ('is', 'a'), ('a', 'sentence')] 
[('this', 'is', 'a'), ('is', 'a', 'sentence')] 


Testing tfidf... 
[('this is a foo bar', [1, 1, 0, 1, 1, 0, 0, 1]), ('foo bar bar black sheep', [0, 2, 1, 1, 0, 0, 1, 0]), ('this is a sentence', [1, 0, 0, 0, 1, 1, 0, 1])] 
['a', 'bar', 'black', 'foo', 'is', 'sentence', 'sheep', 'this'] 
[[0.30000000000000004, 0.30000000000000004, 0.0, 0.30000000000000004, 0.30000000000000004, 0.0, 0.0, 0.30000000000000004], [0.0, 0.6000000000000001, 0.6000000000000001, 0.30000000000000004, 0.0, 0.0, 0.6000000000000001, 0.0], [0.375, 0.0, 0.0, 0.0, 0.375, 0.75, 0.0, 0.375]]

Fuente

2014-03-22 12:02:40 alvas

aplicación simple de N-Gram, TF-IDF y la similitud del coseno en Python

Respuesta

Cuestiones relacionadas