C# 取出HTML里面的文字

最新推荐文章于 2025-06-28 13:33:37 发布

原创最新推荐文章于 2025-06-28 13:33:37 发布 · 2.9k 阅读

0 ·

CC 4.0 BY-SA版权

本文为博主原创文章，未经博主允许不得转载。

文章标签：

#c#

C# 专栏收录该内容

59 篇文章

订阅专栏

本文介绍了一种使用C#实现的去除HTML标签的方法，通过正则表达式匹配并替换常见的HTML标签及特殊字符实体，确保纯文本内容的正确提取。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

        public static string StripHTML(string strHtml)
        {
            string[] aryReg ={
          @"<script[^>]*?>.*?</script>",

          @"<(\/\s*)?!?((\w+:)?\w+)(\w+(\s*=?\s*(([""'])(\\[""'tbnr]|[^\7])*?\7|\w+)|.{0})|\s)*?(\/\s*)?>",
          @"([\r\n])[\s]+",
          @"&(quot|#34);",
          @"&(amp|#38);",
          @"&(lt|#60);",
          @"&(gt|#62);", 
          @"&(nbsp|#160);", 
          @"&(iexcl|#161);",
          @"&(cent|#162);",
          @"&(pound|#163);",
          @"&(copy|#169);",
          @"&#(\d+);",
          @"-->",
          @"<!--.*\n"

         };

            string[] aryRep = {
           "",
           "",
           "",
           "\"",
           "&",
           "<",
           ">",
           " ",
           "\xa1",//chr(161),
           "\xa2",//chr(162),
           "\xa3",//chr(163),
           "\xa9",//chr(169),
           "",
           "\r\n",
           ""
          };

            string newReg = aryReg[0];
            string strOutput = strHtml;
            for (int i = 0; i < aryReg.Length; i++)
            {
                Regex regex = new Regex(aryReg[i], RegexOptions.IgnoreCase);
                strOutput = regex.Replace(strOutput, aryRep[i]);
            }

            strOutput.Replace("<", "");
            strOutput.Replace(">", "");
            strOutput.Replace("\r\n", "");


            return strOutput;
        }