2010-03-04 21 views
45

necesito para comparar los documentos almacenados en una base de datos y llegar a una puntuación de similitud entre 0 y 1.aplicación simple de N-Gram, TF-IDF y la similitud del coseno en Python

El método que necesito para usar tiene ser muy simple Implementando una versión estándar de n-grams (donde es posible definir cuántos gramos usar), junto con una implementación simple de tf-idf y similitud Cosine.

¿Hay algún programa que pueda hacer esto? ¿O debería empezar a escribir esto desde cero?

Respuesta

45

Salida paquete NLTK: http://www.nltk.org que tiene todo lo que necesita

Para la similitud del coseno:


def cosine_distance(u, v): 
    """ 
    Returns the cosine of the angle between vectors v and u. This is equal to 
    u.v/|u||v|. 
    """ 
    return numpy.dot(u, v)/(math.sqrt(numpy.dot(u, u)) * math.sqrt(numpy.dot(v, v))) 

Para N-gramas:


def ngrams(sequence, n, pad_left=False, pad_right=False, pad_symbol=None): 
    """ 
    A utility that produces a sequence of ngrams from a sequence of items. 
    For example: 

    >>> ngrams([1,2,3,4,5], 3) 
    [(1, 2, 3), (2, 3, 4), (3, 4, 5)] 

    Use ingram for an iterator version of this function. Set pad_left 
    or pad_right to true in order to get additional ngrams: 

    >>> ngrams([1,2,3,4,5], 2, pad_right=True) 
    [(1, 2), (2, 3), (3, 4), (4, 5), (5, None)] 

    @param sequence: the source data to be converted into ngrams 
    @type sequence: C{sequence} or C{iterator} 
    @param n: the degree of the ngrams 
    @type n: C{int} 
    @param pad_left: whether the ngrams should be left-padded 
    @type pad_left: C{boolean} 
    @param pad_right: whether the ngrams should be right-padded 
    @type pad_right: C{boolean} 
    @param pad_symbol: the symbol to use for padding (default is None) 
    @type pad_symbol: C{any} 
    @return: The ngrams 
    @rtype: C{list} of C{tuple}s 
    """ 

    if pad_left: 
     sequence = chain((pad_symbol,) * (n-1), sequence) 
    if pad_right: 
     sequence = chain(sequence, (pad_symbol,) * (n-1)) 
    sequence = list(sequence) 

    count = max(0, len(sequence) - n + 1) 
    return [tuple(sequence[i:i+n]) for i in range(count)] 

para TF-IDF tendrá que calcular la distribución en primer lugar, estoy usando Lucene para hacer eso, pero que muy bien puede hacer algo similar con NLTK, utilice FreqDist:

http://nltk.googlecode.com/svn/trunk/doc/book/ch01.html#frequency_distribution_index_term

si te gusta PyLucene, esto le dirá cómo comute tf.idf

# reader = lucene.IndexReader(FSDirectory.open(index_loc)) 
    docs = reader.numDocs() 
    for i in xrange(docs): 
     tfv = reader.getTermFreqVector(i, fieldname) 
     if tfv: 
      rec = {} 
      terms = tfv.getTerms() 
      frequencies = tfv.getTermFrequencies() 
      for (t,f,x) in zip(terms,frequencies,xrange(maxtokensperdoc)): 
        df= searcher.docFreq(Term(fieldname, t)) # number of docs with the given term 
         tmap.setdefault(t, len(tmap)) 
         rec[t] = sim.tf(f) * sim.idf(df, max_doc) #compute TF.IDF 
      # and normalize the values using cosine normalization 
      if cosine_normalization: 
       denom = sum([x**2 for x in rec.values()])**0.5 
       for k,v in rec.items(): 
        rec[k] = v/denom 
+1

No hay necesidad de realizar sqrt() dos veces, ya sqrt (a) * sqrt (b) = sqrt (a * b). –

3

Para nuestro curso de recuperación de información, utilizamos algunos códigos escritos por nuestro profesor en Java. Lo sentimos, no hay puerto de Python. "Se lanza con fines educativos y de investigación solo bajo la Licencia Pública General de GNU".

Se puede extraer de la documentación http://userweb.cs.utexas.edu/~mooney/ir-course/doc/

Pero más específicamente echa un vistazo a: http://userweb.cs.utexas.edu/users/mooney/ir-course/doc/ir/vsr/HashMapVector.html

puede descargarlo http://userweb.cs.utexas.edu/users/mooney/ir-course/

3

En caso de que todavía esté interesado en este problema, he hecho algo muy similar usando Lucene Java y Jython. Aquí hay algunos fragmentos de mi código.

Lucene preprocesa documentos y consultas utilizando los denominados analizadores. Éste utiliza de Lucene filtro incorporado de n-gramas:

class NGramAnalyzer(Analyzer): 
    '''Analyzer that yields n-grams for minlength <= n <= maxlength''' 
    def __init__(self, minlength, maxlength): 
     self.minlength = minlength 
     self.maxlength = maxlength 
    def tokenStream(self, field, reader): 
     lower = ASCIIFoldingFilter(LowerCaseTokenizer(reader)) 
     return NGramTokenFilter(lower, self.minlength, self.maxlength) 

Para activar una lista de ngrams en un Document:

doc = Document() 
doc.add(Field('n-grams', ' '.join(ngrams), 
     Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES)) 

Para guardar un documento en un índice:

wr = IndexWriter(index_dir, NGramAnalyzer(), True, 
       IndexWriter.MaxFieldLength.LIMITED) 
wr.addDocument(doc) 

Crear consultas es un poco más difícil ya que QueryParser de Lucene espera un lenguaje de consulta con operadores especiales, citas, etc., pero se puede eludir (como parcialmente explicó here).

6

Aquí es una respuesta con sólo python + numpy, en pocas palabras:

coseno:

def cosine_sim(u,v): 
    return np.dot(u,v)/(sqrt(np.dot(u,u)) * sqrt(np.dot(v,v))) 

N-gramas:

def ngrams(sentence, n): 
    return zip(*[sentence.split()[i:] for i in range(n)]) 

TF-IDF (que es una poco raro pero obras):

def tfidf(corpus, vocab): 
    """ 
    INPUT: 

    corpus = [('this is a foo bar', [1, 1, 0, 1, 1, 0, 0, 1]), 
    ('foo bar bar black sheep', [0, 2, 1, 1, 0, 0, 1, 0]), 
    ('this is a sentence', [1, 0, 0, 0, 1, 1, 0, 1])] 

    vocab = ['a', 'bar', 'black', 'foo', 'is', 'sentence', 
    'sheep', 'this'] 

    OUTPUT: 

    [[0.300, 0.300, 0.0, 0.300, 0.300, 0.0, 0.0, 0.300], 
    [0.0, 0.600, 0.600, 0.300, 0.0, 0.0, 0.600, 0.0], 
    [0.375, 0.0, 0.0, 0.0, 0.375, 0.75, 0.0, 0.375]] 

    """ 
    def termfreq(matrix, doc, term): 
     try: return matrix[doc][term]/float(sum(matrix[doc].values())) 
     except ZeroDivisionError: return 0 
    def inversedocfreq(matrix, term): 
     try: 
      return float(len(matrix)) /sum([1 for i,_ in enumerate(matrix) if matrix[i][term] > 0]) 
     except ZeroDivisionError: return 0 

    matrix = [{k:v for k,v in zip(vocab, i[1])} for i in corpus] 
    tfidf = defaultdict(dict) 
    for doc,_ in enumerate(matrix): 
     for term in matrix[doc]: 
      tf = termfreq(matrix,doc,term) 
      idf = inversedocfreq(matrix, term) 
      tfidf[doc][term] = tf*idf 

    return [[tfidf[doc][term] for term in vocab] for doc,_ in enumerate(tfidf)] 

Aquí está la respuesta de largo con las pruebas:

import numpy as np 
from math import sqrt, log 
from itertools import chain, product 
from collections import defaultdict 

def cosine_sim(u,v): 
    return np.dot(u,v)/(sqrt(np.dot(u,u)) * sqrt(np.dot(v,v))) 

def ngrams(sentence, n): 
    return zip(*[sentence.split()[i:] for i in range(n)]) 

def tfidf(corpus, vocab): 
    """ 
    INPUT: 

    corpus = [('this is a foo bar', [1, 1, 0, 1, 1, 0, 0, 1]), 
    ('foo bar bar black sheep', [0, 2, 1, 1, 0, 0, 1, 0]), 
    ('this is a sentence', [1, 0, 0, 0, 1, 1, 0, 1])] 

    vocab = ['a', 'bar', 'black', 'foo', 'is', 'sentence', 
    'sheep', 'this'] 

    OUTPUT: 

    [[0.300, 0.300, 0.0, 0.300, 0.300, 0.0, 0.0, 0.300], 
    [0.0, 0.600, 0.600, 0.300, 0.0, 0.0, 0.600, 0.0], 
    [0.375, 0.0, 0.0, 0.0, 0.375, 0.75, 0.0, 0.375]] 

    """ 
    def termfreq(matrix, doc, term): 
     try: return matrix[doc][term]/float(sum(matrix[doc].values())) 
     except ZeroDivisionError: return 0 
    def inversedocfreq(matrix, term): 
     try: 
      return float(len(matrix)) /sum([1 for i,_ in enumerate(matrix) if matrix[i][term] > 0]) 
     except ZeroDivisionError: return 0 

    matrix = [{k:v for k,v in zip(vocab, i[1])} for i in corpus] 
    tfidf = defaultdict(dict) 
    for doc,_ in enumerate(matrix): 
     for term in matrix[doc]: 
      tf = termfreq(matrix,doc,term) 
      idf = inversedocfreq(matrix, term) 
      tfidf[doc][term] = tf*idf 

    return [[tfidf[doc][term] for term in vocab] for doc,_ in enumerate(tfidf)] 


def corpus2vectors(corpus): 
    def vectorize(sentence, vocab): 
     return [sentence.split().count(i) for i in vocab] 
    vectorized_corpus = [] 
    vocab = sorted(set(chain(*[i.lower().split() for i in corpus]))) 
    for i in corpus: 
     vectorized_corpus.append((i, vectorize(i, vocab))) 
    return vectorized_corpus, vocab 

def create_test_corpus(): 
    sent1 = "this is a foo bar" 
    sent2 = "foo bar bar black sheep" 
    sent3 = "this is a sentence" 

    all_sents = [sent1,sent2,sent3] 
    corpus, vocab = corpus2vectors(all_sents) 
    return corpus, vocab 

def test_cosine(): 
    corpus, vocab = create_test_corpus() 

    for sentx, senty in product(corpus, corpus): 
     print sentx[0] 
     print senty[0] 
     print "cosine =", cosine_sim(sentx[1], senty[1]) 
     print 

def test_ngrams(): 
    corpus, vocab = create_test_corpus() 
    for sentx in corpus: 
     print sentx[0] 
     print ngrams(sentx[0],2) 
     print ngrams(sentx[0],3) 
     print 

def test_tfidf(): 
    corpus, vocab = create_test_corpus() 
    print corpus 
    print vocab 
    print tfidf(corpus, vocab) 

print "Testing cosine..." 
test_cosine() 
print 
print "Testing ngrams..." 
test_ngrams() 
print 
print "Testing tfidf..." 
test_tfidf() 
print 

[fuera]:

Testing cosine... 
this is a foo bar 
this is a foo bar 
cosine = 1.0 

this is a foo bar 
foo bar bar black sheep 
cosine = 0.507092552837 

this is a foo bar 
this is a sentence 
cosine = 0.67082039325 

foo bar bar black sheep 
this is a foo bar 
cosine = 0.507092552837 

foo bar bar black sheep 
foo bar bar black sheep 
cosine = 1.0 

foo bar bar black sheep 
this is a sentence 
cosine = 0.0 

this is a sentence 
this is a foo bar 
cosine = 0.67082039325 

this is a sentence 
foo bar bar black sheep 
cosine = 0.0 

this is a sentence 
this is a sentence 
cosine = 1.0 


Testing ngrams... 
this is a foo bar 
[('this', 'is'), ('is', 'a'), ('a', 'foo'), ('foo', 'bar')] 
[('this', 'is', 'a'), ('is', 'a', 'foo'), ('a', 'foo', 'bar')] 

foo bar bar black sheep 
[('foo', 'bar'), ('bar', 'bar'), ('bar', 'black'), ('black', 'sheep')] 
[('foo', 'bar', 'bar'), ('bar', 'bar', 'black'), ('bar', 'black', 'sheep')] 

this is a sentence 
[('this', 'is'), ('is', 'a'), ('a', 'sentence')] 
[('this', 'is', 'a'), ('is', 'a', 'sentence')] 


Testing tfidf... 
[('this is a foo bar', [1, 1, 0, 1, 1, 0, 0, 1]), ('foo bar bar black sheep', [0, 2, 1, 1, 0, 0, 1, 0]), ('this is a sentence', [1, 0, 0, 0, 1, 1, 0, 1])] 
['a', 'bar', 'black', 'foo', 'is', 'sentence', 'sheep', 'this'] 
[[0.30000000000000004, 0.30000000000000004, 0.0, 0.30000000000000004, 0.30000000000000004, 0.0, 0.0, 0.30000000000000004], [0.0, 0.6000000000000001, 0.6000000000000001, 0.30000000000000004, 0.0, 0.0, 0.6000000000000001, 0.0], [0.375, 0.0, 0.0, 0.0, 0.375, 0.75, 0.0, 0.375]] 
Cuestiones relacionadas