¿La forma más eficaz de encontrar el recuento de coincidencias que tiene una cadena con una matriz de palabras?

digamos que tengo una cadena¿La forma más eficaz de encontrar el recuento de coincidencias que tiene una cadena con una matriz de palabras?

String test = "This is a test string and I have some stopwords in here";

y yo quiero ver cuántas veces las palabras de la matriz por debajo de partido en contra de mi cadena

psudocode

array = a,and,the,them,they,I

por lo que la respuesta sería "3"

curiosidad lo que la forma más eficiente de hacer eso en java es?

Fuente

2010-07-09 James

Interesante pregunta, a ver si puedo llegar a algo mejor que el algoritmo ingenuo – quantumSoup

¿Qué hay de repeticiones? Las respuestas a continuación leen datos en Conjuntos, que puntuarían 3 para "a y el" pero solo 1 para "a a a".¿Es ese el comportamiento deseado, o ambos deberían reportar 3? – Chadwick

probablemente me guardo las palabras de entrada en un HashSet y luego iterar sobre la matriz y ver si cada palabra de la matriz es .contains en el conjunto.

Aquí está en el código ... la entrada es "Around the world in 80 days".

import java.io.File; 
import java.io.FileNotFoundException; 
import java.util.ArrayList; 
import java.util.Arrays; 
import java.util.HashSet; 
import java.util.List; 
import java.util.Scanner; 
import java.util.Set; 

public class Main 
{ 
    public static void main(final String[] argv) 
     throws FileNotFoundException 
    { 
     final File  file; 
     final String[] wordsToFind; 

     file  = new File(argv[0]); 
     wordsToFind = getWordsToFind(file); 
     a(file, wordsToFind); 
     b(file, wordsToFind); 
     c(file, wordsToFind); 
     d(file, wordsToFind); 
    } 

    // this just reads the file into the disk cache 
    private static String[] getWordsToFind(final File file) 
     throws FileNotFoundException 
    { 
     final Scanner  scanner; 
     final Set<String> words; 

     scanner = new Scanner(file); 
     words = new HashSet<String>(); 

     while(scanner.hasNext()) 
     { 
      final String word; 

      word = scanner.next(); 
      words.add(word); 
     } 

     return (words.toArray(new String[words.size()])); 
    } 

    // bad way, read intpo a list and then iterate over the list until you find a match 
    private static void a(final File  file, 
          final String[] wordsToFind) 
     throws FileNotFoundException 
    { 
     final long start; 
     final long end; 
     final long total; 
     final Scanner  scanner; 
     final List<String> words; 
     int    matches; 

     scanner = new Scanner(file); 
     words = new ArrayList<String>(); 

     while(scanner.hasNext()) 
     { 
      final String word; 

      word = scanner.next(); 
      words.add(word); 
     } 

     start = System.nanoTime(); 

     { 
      matches = 0; 

      for(final String wordToFind : wordsToFind) 
      { 
       for(final String word : words) 
       { 
        if(word.equals(wordToFind)) 
        { 
         matches++; 
         break; 
        } 
       } 
      } 

      System.out.println(matches); 
     } 

     end = System.nanoTime(); 
     total = end - start; 
     System.out.println("a: " + total); 
    } 

    // slightly better way, read intpo a list and then iterate over the set (which reduces the number of things you progbably 
    // have to read until you find a match), until you find a match 
    private static void b(final File  file, 
          final String[] wordsToFind) 
     throws FileNotFoundException 
    { 
     final long start; 
     final long end; 
     final long total; 
     final Scanner  scanner; 
     final Set<String> words; 
     int    matches; 

     scanner = new Scanner(file); 
     words = new HashSet<String>(); 

     while(scanner.hasNext()) 
     { 
      final String word; 

      word = scanner.next(); 
      words.add(word); 
     } 

     start = System.nanoTime(); 

     { 
      matches = 0; 

      for(final String wordToFind : wordsToFind) 
      { 
       for(final String word : words) 
       { 
        if(word.equals(wordToFind)) 
        { 
         matches++; 
         break; 
        } 
       } 
      } 

      System.out.println(matches); 
     } 

     end = System.nanoTime(); 
     total = end - start; 
     System.out.println("b: " + total); 
    } 

    // my way 
    private static void c(final File  file, 
          final String[] wordsToFind) 
     throws FileNotFoundException 
    { 
     final long start; 
     final long end; 
     final long total; 
     final Scanner  scanner; 
     final Set<String> words; 
     int    matches; 

     scanner = new Scanner(file); 
     words = new HashSet<String>(); 

     while(scanner.hasNext()) 
     { 
      final String word; 

      word = scanner.next(); 
      words.add(word); 
     } 

     start = System.nanoTime(); 

     { 
      matches = 0; 

      for(final String wordToFind : wordsToFind) 
      { 
       if(words.contains(wordToFind)) 
       { 
        matches++; 
       } 
      } 

      System.out.println(matches); 
     } 

     end = System.nanoTime(); 
     total = end - start; 
     System.out.println("c: " + total); 
    } 

    // Nikita Rybak way 
    private static void d(final File  file, 
          final String[] wordsToFind) 
     throws FileNotFoundException 
    { 
     final long start; 
     final long end; 
     final long total; 
     final Scanner  scanner; 
     final Set<String> words; 
     int    matches; 

     scanner = new Scanner(file); 
     words = new HashSet<String>(); 

     while(scanner.hasNext()) 
     { 
      final String word; 

      word = scanner.next(); 
      words.add(word); 
     } 

     start = System.nanoTime(); 

     { 
      words.retainAll(new HashSet<String>(Arrays.asList(wordsToFind))); 
      matches = words.size(); 
      System.out.println(matches); 
     } 

     end = System.nanoTime(); 
     total = end - start; 
     System.out.println("d: " + total); 
    } 
}

resultados (después de algunas carreras, cada carrera es más o menos lo mismo aunque):

12596 
a: 2440699000 
12596 
b: 2531635000 
12596 
c: 4507000 
12596 
d: 5597000

Si modifica mediante la adición de "XXX" a cada una de las palabras getWordsToFind (por lo que no palabras se encuentran) se obtiene:

0 
a: 7415291000 
0 
b: 4688973000 
0 
c: 2849000 
0 
d: 7981000

y como complemento, he intentado sólo la búsqueda de la palabra "yo", y los resultados son los siguientes:

1 
a: 235000 
1 
b: 351000 
1 
c: 75000 
1 
d: 10725000

Fuente

2010-07-09 00:13:34 TofuBeer

¿Algo como esto? No estoy seguro acerca de "lo más eficiente", pero lo suficientemente simple.

Set<String> s1 = new HashSet<String>(Arrays.asList("This is a test string and I have some stopwords in here".split("\\s"))); 
Set<String> s2 = new HashSet<String>(Arrays.asList("a", "and", "the", "them", "they", "I")); 
s1.retainAll(s2); 
System.out.println(s1.size());

Solo intersección de dos conjuntos de palabras.

Fuente

2010-07-09 00:16:44

Definitivamente, la intersección de los conjuntos es la forma correcta de ir – Eric

lo más eficaz de hacerlo es una especie tanto 'prueba' y 'array' y luego iterar sobre ambos: n.log (n) + n

prueba -> [ 'a', 'y', 'tener', 'aquí', en, es, ..., 'Este'] gama -> [ 'a', 'y', 'la', 'ellos', 'ellos', 'I']

prueba array partidos 'a' 'a' 1 'a' 'y' 1 'y' 'y' 2 'y' 'tener' 2 'la' 'aquí' 2 'el' 'en' 2 'el' 'es' 2 ...

Fuente

2010-07-09 00:24:01

@Aircule, ya que ambas matrices ordenadas se atraviesan simultáneamente. El único problema con este algoritmo es que no es tan trivial implementar :) –

+1, pero no es la forma más eficiente. Mientras que aquí tenemos O (n log (n)), las operaciones hash toman un tiempo constante y, por lo tanto, la meta se puede lograr en O (n). –

@Aircule no es así. tiene dos índices diferentes (i y j) y aumenta el que tiene el menor valor: _if (a [i] <= b [j]) {++ i} else {++ j}; _ –

Una variación menor en la respuesta de Nikita (hasta 1 para Nikita). Si usa una Lista para s1, obtiene el número de ocurrencias (en caso de que una palabra aparezca varias veces en la oración).

List<String> s1 = new ArrayList<String>(Arrays.asList("This is a test string and I have some stopwords in here".split("\\s"))); 
Set<String> s2 = new HashSet<String>(Arrays.asList("a", "and", "the", "them", "they", "I")); 
s1.retainAll(s2); 
System.out.println(s1.size());

Fuente

2010-07-09 01:31:35 xagyg

almacenar sus cadenas en la tabla hash (HashMap de (String e Integer)), entonces iterador sobre el texto y aumentar el valor entero de la palabra coincidente en la tabla hash. luego iterador sobre hashtable y suma todos los valores enteros.

Fuente

2010-07-09 06:52:26

¿La forma más eficaz de encontrar el recuento de coincidencias que tiene una cadena con una matriz de palabras?

Respuesta

Cuestiones relacionadas