Cómo analizar solo texto de HTML

16

del libro de cocina jsoup: http://jsoup.org/cookbook/extracting-data/attributes-text-html

String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>"; 
Document doc = Jsoup.parse(html); 
String text = doc.body().text(); // "An example link"

Fuente

2010-08-17 22:13:45

+0

cómo excluir elementos invisibles? (por ejemplo, pantalla: ninguna) – Ehsan

0

Bueno, aquí es un método rápido Tiré juntos una vez. Utiliza expresiones regulares para realizar el trabajo. La mayoría de las personas estarán de acuerdo en que esta no es una buena manera de hacerlo. ASÍ, use bajo su propio riesgo.

public static String getPlainText(String html) { 
    String htmlBody = html.replaceAll("<hr>", ""); // one off for horizontal rule lines 
    String plainTextBody = htmlBody.replaceAll("<[^<>]+>([^<>]*)<[^<>]+>", "$1"); 
    plainTextBody = plainTextBody.replaceAll("<br ?/>", ""); 
    return decodeHtml(plainTextBody); 
}

Esto se utilizó originalmente en mi envoltorio de API para la API Stack Overflow. Por lo tanto, solo se probó en un pequeño subconjunto de etiquetas html.

Fuente

2010-08-17 22:15:07 jjnguy

+0

Hmmm ... ¿por qué no usa simple regexp: 'replaceAll (" <[^>] +> "," ")'? – Crozin

+0

@Crozin, bueno, me estaba enseñando a mí mismo cómo usar las referencias anteriores, supongo. Parece que el tuyo probablemente también funcione. – jjnguy

+0

esto duele! -> http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – sleeplessnerd

1

Utilización de las clases que forman parte del JDK:

import java.io.*; 
import java.net.*; 
import javax.swing.text.*; 
import javax.swing.text.html.*; 

class GetHTMLText 
{ 
    public static void main(String[] args) 
     throws Exception 
    { 
     EditorKit kit = new HTMLEditorKit(); 
     Document doc = kit.createDefaultDocument(); 

     // The Document class does not yet handle charset's properly. 
     doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE); 

     // Create a reader on the HTML content. 

     Reader rd = getReader(args[0]); 

     // Parse the HTML. 

     kit.read(rd, doc, 0); 

     // The HTML text is now stored in the document 

     System.out.println(doc.getText(0, doc.getLength())); 
    } 

    // Returns a reader on the HTML data. If 'uri' begins 
    // with "http:", it's treated as a URL; otherwise, 
    // it's assumed to be a local filename. 

    static Reader getReader(String uri) 
     throws IOException 
    { 
     // Retrieve from Internet. 
     if (uri.startsWith("http:")) 
     { 
      URLConnection conn = new URL(uri).openConnection(); 
      return new InputStreamReader(conn.getInputStream()); 
     } 
     // Retrieve from file. 
     else 
     { 
      return new FileReader(uri); 
     } 
    } 
}

Fuente

2010-08-17 23:14:11 camickr

Respuesta

Cuestiones relacionadas