c# - Strip ALL HTML from a String? -

- August 15, 2011

i've seen regex can remove tags, great, have stuff like

&nbsp;

etc.

this isn't html file. it's string. i'm pulling down data sharepoint web services, gives me html users might use/get generated like

<div>hello! please remember clean break room!!! &quot;bob&quote; <br> </div>

so, i'm parsing through 100-900 rows 8-20 columns each.

take @ html agility pack, it's html parser can use extract innertext html nodes in document.

as has been pointed out many times here on so, can't trust html parsing regular expression. there times when might considered appropriate (for extremely limited tasks); in general, html complex , prone irregularity. bad things can happen when try parse html regular expressions.

using parser such hap gives more flexibility. (rough) example of might use task:

htmlagilitypack.htmldocument doc = new htmlagilitypack.htmldocument(); doc.load("path html document");  stringbuilder content = new stringbuilder(); foreach (var node in doc.documentnode.descendantnodesandself()) {     if (!node.haschildnodes)     {         sb.appendline(node.innertext);     } }

you can perform xpath queries on document, in case you're interested in specific node or set of nodes:

var nodes = doc.documentnode.selectnodes("your xpath query here");

hope helps.

Search This Blog

Return

c# - Strip ALL HTML from a String? -

Comments

Post a Comment

Popular posts from this blog

Show multiple (2,3,4,…) images in the same window in OpenCV -

c# - Is it possible to remove an existing registration from Autofac container builder? -

php - Mysql PK and FK char(36) vs int(10) -