Retrieve KEYWORDS from META tag in a HTML WebPage using JAVA -
i want retrieve content words html webpage , keywords contained in meta tag of same html webpage using java.
example, consider html source code:
<html> <head> <meta name = "keywords" content = "deception, intricacy, treachery"> </head> <body> short html document. <br> has 2 'lines'. </body> </html>
the content words here are: my, very, short, html, document, it, has, just, lines
note: punctuation , number '2' ruled out.
the keywords here are: deception, intricacy, treachery
i have created class purpose called webdoc, far have been able get.
import java.io.bufferedreader; import java.io.ioexception; import java.io.inputstreamreader; import java.net.url; import java.util.set; import java.util.treeset; public class webdoc { protected url _url; protected set<string> _contentwords; protected set<string> _keywords public webdoc(url paramurl) { _url = paramurl; } public set<string> getcontents() throws ioexception { //url url = new url(url); set<string> contentwords = new treeset<string>(); bufferedreader in = new bufferedreader(new inputstreamreader(_url.openstream())); string inputline; while ((inputline = in.readline()) != null) { // process each line. contentwords.add(removetag(inputline)); //system.out.println(removetag(inputline)); } in.close(); system.out.println(contentwords); _contentwords = contentwords; return contentwords; } public string removetag(string html) { html = html.replaceall("\\<.*?>",""); html = html.replaceall("&",""); return html; } public set<string> getkeywords() { //no idea ! return null; } public url geturl() { return _url; } @override public string tostring() { return null; } }
so, after answer redsoxfan meta-keywords, need split content lines. can use similar method there:
instead of
contentwords.add(removetag(inputline));
use
contentwords.addall(arrays.aslist(removetag(inputline).split("[^\\p{l}]+")));
.split(...)
splits line @ non-letters (i hope works, please try , report), giving array of substrings, each should contain of letters, , empty strings between.arrays.aslist(...)
wraps array in list.addall(...)
adds elements of array set, not duplicates).
at end should delete empty string ""
contentwords-set.
Comments
Post a Comment