Retrieve KEYWORDS from META tag in a HTML WebPage using JAVA -


i want retrieve content words html webpage , keywords contained in meta tag of same html webpage using java.
example, consider html source code:

<html> <head> <meta name = "keywords" content = "deception, intricacy, treachery"> </head> <body> short html document.  <br> has 2 'lines'. </body> </html> 

the content words here are: my, very, short, html, document, it, has, just, lines

note: punctuation , number '2' ruled out.

the keywords here are: deception, intricacy, treachery

i have created class purpose called webdoc, far have been able get.

import java.io.bufferedreader; import java.io.ioexception; import java.io.inputstreamreader; import java.net.url; import java.util.set; import java.util.treeset;  public class webdoc {      protected url _url;     protected set<string> _contentwords;     protected set<string> _keywords      public webdoc(url paramurl) {         _url = paramurl;     }      public set<string> getcontents() throws ioexception {         //url url = new url(url);         set<string> contentwords = new treeset<string>();         bufferedreader in = new bufferedreader(new inputstreamreader(_url.openstream()));         string inputline;          while ((inputline = in.readline()) != null) {             // process each line.             contentwords.add(removetag(inputline));             //system.out.println(removetag(inputline));         }         in.close();         system.out.println(contentwords);         _contentwords = contentwords;         return contentwords;     }          public string removetag(string html) {         html = html.replaceall("\\<.*?>","");         html = html.replaceall("&","");         return html;     }        public set<string> getkeywords() {         //no idea !         return null;     }      public url geturl() {         return _url;     }      @override     public string tostring() {         return null;     } } 

so, after answer redsoxfan meta-keywords, need split content lines. can use similar method there:

instead of

contentwords.add(removetag(inputline)); 

use

contentwords.addall(arrays.aslist(removetag(inputline).split("[^\\p{l}]+"))); 
  • .split(...) splits line @ non-letters (i hope works, please try , report), giving array of substrings, each should contain of letters, , empty strings between.
  • arrays.aslist(...) wraps array in list.
  • addall(...) adds elements of array set, not duplicates).

at end should delete empty string "" contentwords-set.


Comments

Popular posts from this blog

linux - Mailx and Gmail nss config dir -

c# - Is it possible to remove an existing registration from Autofac container builder? -

php - Mysql PK and FK char(36) vs int(10) -