C#中编程读取Word文档和Pdf的内容2013-11-13最近项目需要实现一个功能:读取doc,docx,pdf文件内容。在网上搜罗许久,还是发现有些好东西可以直接拿来使用,要不然就得自己发明轮子了。接下来我就简单介绍了用了哪些组件来实现这个功能的。Doc文档:Microsoft Word 14.0 Object Library (GAC对象,调用前需要安装word。安装的word版本不同,COM的版本号也会不同)Docx文档:Microsoft Word 14.0 Object Library (GAC对象,调用前需要安装word。安装的word版本不同,COM的版本号也会不同)Pdf文档:PDFBoxDEMO
/*作者:GhostBear* 博客地址:Http://blog.csdn.net/ghostbear*/using System;using System.Collections.Generic;using System.Linq;using System.Text;using System.IO;using System.Text.RegularExpressions;using org.pdfbox.pdmodel;using org.pdfbox.util;using Microsoft.Office.Interop.Word;namespace TestPdfReader{class Program{static void Main(string[] args){//PDFPDDocument doc = PDDocument.load(@"C:
esume.pdf");PDFTextStripper pdfStripper = new PDFTextStripper();string text = pdfStripper.getText(doc);string result = text.Replace(" ", " ").Replace("
", " ").Replace("
", " ").Replace(" ", ""); Console.WriteLine(result);//Doc,Docxobject docPath = @"C:
esume.doc";object docxPath = @"C:
esume.docx";object missing=System.Reflection.Missing.Value;object readOnly=true;Application wordApp;wordApp = new Application();Document wordDoc = wordApp.Documents.Open(ref docPath,ref missing,ref readOnly,ref missing,ref missing,ref missing,ref missing,ref missing,ref missing,ref missing,ref missing,ref missing,ref missing,ref missing,ref missing,ref missing);string text2 = FilterString(wordDoc.Content.Text);wordDoc.Close(ref missing, ref missing, ref missing);wordApp.Quit(ref missing, ref missing, ref missing);Console.WriteLine(text2);Console.Read();}private static string FilterString(string input){return Regex.Replace(input, @"(a| |
|s+)", "");}}}
小结如果需要在IIS上运行该代码,则需要配置组件“Microsoft Word 14.0 Object Library”的DCOM配置。具体细节可以参考文章:Word组件的DCOM配置。代码下载:http://download.csdn.net/detail/ghostbear/4847887