c# - Strip ALL HTML from a String? -


i've seen regex can remove tags, great, have stuff like

  

etc.

this isn't html file. it's string. i'm pulling down data sharepoint web services, gives me html users might use/get generated like

<div>hello! please remember clean break room!!! &quot;bob&quote; <br> </div> 

so, i'm parsing through 100-900 rows 8-20 columns each.

take @ html agility pack, it's html parser can use extract innertext html nodes in document.

as has been pointed out many times here on so, can't trust html parsing regular expression. there times when might considered appropriate (for extremely limited tasks); in general, html complex , prone irregularity. bad things can happen when try parse html regular expressions.

using parser such hap gives more flexibility. (rough) example of might use task:

htmlagilitypack.htmldocument doc = new htmlagilitypack.htmldocument(); doc.load("path html document");  stringbuilder content = new stringbuilder(); foreach (var node in doc.documentnode.descendantnodesandself()) {     if (!node.haschildnodes)     {         sb.appendline(node.innertext);     } } 

you can perform xpath queries on document, in case you're interested in specific node or set of nodes:

var nodes = doc.documentnode.selectnodes("your xpath query here"); 

hope helps.


Comments

Popular posts from this blog

linux - Mailx and Gmail nss config dir -

c# - Is it possible to remove an existing registration from Autofac container builder? -

php - Mysql PK and FK char(36) vs int(10) -