well you can always open theses files up with a hex editor or a text editor and see what is contained inside. rtf is fairly straight forward. it uses a sort of markup language to format it's output, where as most office products use a binary form of markup. open it up and see what it looks like. Open Office has successfully reversed the encoding, maybe you should look at their source code and see how they read and write the .doc, .xls, .rtf file formats (www.openoffice.org)
I'm actually working on a project that reads these .doc and .rtf and microsoft give you a lot of help. The main problem is that the word object libraries are unmanaged code while the c# is managed code, so you need a bridge so your code can talk to the work object libraries. Luckly the good people at microsoft have already developed this bridge and they are called the Office XP Primary Interop Assemblies, which you can download from the microsoft web site. You have to reference these. I'll give you some code that will open and read a word doc into memory and then close the file, just wait a second. Microsoft.Office.Interop.Word.Application wordApp = new Microsoft.Office.Interop.Word.ApplicationClass(); //WordApp.Visible = true; object fileName = Globals.FileDirectory + filename; object objFalse = false; object objTrue = true; object missing = System.Reflection.Missing.Value; object emptyData = string.Empty; try { Microsoft.Office.Interop.Word.Document aDoc = wordApp.Documents.Open(ref fileName,ref objFalse,ref objTrue, ref objFalse, ref missing, ref missing,ref missing,ref missing,ref missing,ref missing,ref missing,ref objTrue, ref missing,ref missing,ref missing); aDoc.ActiveWindow.Selection.WholeStory(); aDoc.ActiveWindow.Selection.Copy(); System.Windows.Forms.IDataObject data = System.Windows.Forms.Clipboard.GetDataObject(); filetext = data.GetData(System.Windows.Forms.DataFormats.Text).ToString(); System.Windows.Forms.Clipboard.SetDataObject(string.Empty); } catch (Exception e) { // Let the user know what went wrong. Console.Writeline(e.Message); } finally { wordApp.Documents.Close(ref missing,ref missing,ref missing); wordApp.Application.Quit(ref missing,ref missing,ref missing); } Don't forget to reference the Office XP PIA libraries in your project and use this for good and not evil.
One more thing you have to have word installed on the server. and yes i took me a while to figure this out. someone should write a article on this. *cough* *cough* Salman.
Thanks, Thanks for your help. But the problem is , i have to install this application where word is not installed. is there any licensing issue regarding microsoft office iterop dll file. please let me know. Thanks Atul
well i don't know how you are going to open a word document without word. Opening word docos as a binary file will give you a lot of junk used form formating which will be very very very hard to sort out what you want and what you want to bin. I would look into what mad hatter suggested and using open office.
Hi, I have one programme which reads the .doc , .rtf, .xls but problem is i am not able to extract the exact content. the code which is given below. StreamReader reader = null; StreamWriter writer = null; SortedList table = new SortedList(); //Hashtable table = new Hashtable(); string logFile = "logfile.txt"; try { //iterate one word at a time. Each word/count gets updated for each instance that gets encountered. reader = new StreamReader(textBox1.Text);//opens the file writer = new StreamWriter(logFile, false); int h=0;
for (string line = reader.ReadLine(); line != null; line = reader.ReadLine()) { string[] words = GetWords (line);
foreach (string word in words) { string iword = word.ToLower();
h++; if (table.ContainsKey (iword)) { table[iword] = table[iword] + "," + "'" + h + "'"; } else { table[iword] = "'" + h + "'"; } } } foreach (DictionaryEntry entry in table) { writer.WriteLine ("{0} ({1})", entry.Key, entry.Value); } catch (Exception c) { writer.WriteLine(c.Message); } finally { if (reader != null) reader.Close(); if (writer != null) writer.Close();
} Please try this code and let me know. I have also one C++ code which reads the .doc file without using word library. Only file object is used. I am trying to convert that programe to C# give your comments or suggestions. Thanks, Atul
hi, i am forgot to include these two methoods static string[] GetWords(string line) { ArrayList al = new ArrayList(); //for intermediate results int i = 0; string word; char[] characters = line.ToCharArray(); while ((word = GetNextWord(line, characters, ref i)) != null) al.Add(word); string[]words = new string[al.Count]; al.CopyTo(words); return words; } static string GetNextWord (string line, char[] characters, ref int i) { while (i < characters.Length && !Char.IsLetterOrDigit (characters[i])) i++; if (i == characters.Length) return null; int start = i; //find the end of the word while (i< characters.Length && Char.IsLetterOrDigit (characters[i])) i++;
//return the word return line.Substring (start, i - start); }
Guys I am having a stupid problem with Automation in C# I can't access the items collection of Documents by using [] I get a compiler error.. I tried using the .GetEnumerator() and looping using the IEnumertator.MoveNext() but even that returns a 'undefined value' this is funny cause during debugging at run time I can simply type the a.Documents[0] and it returns a value but the compiler won't take it. how can Iterate through the Documents collection of an application lets say Microsoft.Office.Interop.Word.Application a = new ApplicationClass(); ???
Hi, Well if you don't want to use third party components, you are stuck with automation. I think automation has many issues so maybe you should think about creating methods for reading documents by yourself. Here is an <a target="_new" href="http://www.gemboxsoftware.com/Excel2007/DemoApp.htm">article that explains how to read/write XLSX (Excel 2007) files without automation</a>. Mario <a target="_new" href="http://www.gemboxsoftware.com">GemBox Software</a> -- <a target="_new" href="http://www.gemboxsoftware.com/GBSpreadsheet.htm">GemBox.Spreadsheet for .NET - Easily read and write Excel (XLS, XLSX or CSV) or export to HTML files from your .NET apps</a>. --
1 answers
well you can always open theses files up with a hex editor or a text editor and see what is contained inside. rtf is fairly straight forward. it uses a sort of markup language to format it's output, where as most office products use a binary form of markup. open it up and see what it looks like. Open Office has successfully reversed the encoding, maybe you should look at their source code and see how they read and write the .doc, .xls, .rtf file formats (www.openoffice.org)
answered 9 months ago by:
1754
I'm actually working on a project that reads these .doc and .rtf and microsoft give you a lot of help. The main problem is that the word object libraries are unmanaged code while the c# is managed code, so you need a bridge so your code can talk to the work object libraries. Luckly the good people at microsoft have already developed this bridge and they are called the Office XP Primary Interop Assemblies, which you can download from the microsoft web site.
You have to reference these. I'll give you some code that will open and read a word doc into memory and then close the file, just wait a second.
Microsoft.Office.Interop.Word.Application wordApp = new Microsoft.Office.Interop.Word.ApplicationClass();
//WordApp.Visible = true;
object fileName = Globals.FileDirectory + filename;
object objFalse = false;
object objTrue = true;
object missing = System.Reflection.Missing.Value;
object emptyData = string.Empty;
try
{
Microsoft.Office.Interop.Word.Document aDoc = wordApp.Documents.Open(ref fileName,ref objFalse,ref objTrue,
ref objFalse, ref missing, ref missing,ref missing,ref missing,ref missing,ref missing,ref missing,ref objTrue,
ref missing,ref missing,ref missing);
aDoc.ActiveWindow.Selection.WholeStory();
aDoc.ActiveWindow.Selection.Copy();
System.Windows.Forms.IDataObject data = System.Windows.Forms.Clipboard.GetDataObject();
filetext = data.GetData(System.Windows.Forms.DataFormats.Text).ToString();
System.Windows.Forms.Clipboard.SetDataObject(string.Empty);
}
catch (Exception e)
{
// Let the user know what went wrong.
Console.Writeline(e.Message); }
finally
{
wordApp.Documents.Close(ref missing,ref missing,ref missing);
wordApp.Application.Quit(ref missing,ref missing,ref missing);
}
Don't forget to reference the Office XP PIA libraries in your project and use this for good and not evil.
answered 9 months ago by:
0
One more thing you have to have word installed on the server.
and yes i took me a while to figure this out.
someone should write a article on this. *cough* *cough* Salman.
answered 9 months ago by:
0
Thanks,
Thanks for your help. But the problem is , i have to install this application where word is not installed. is there any licensing issue regarding microsoft office iterop dll file. please let me know.
Thanks
Atul
answered 9 months ago by:
0
well i don't know how you are going to open a word document without word.
Opening word docos as a binary file will give you a lot of junk used form formating which will be very very very hard to sort out what you want and what you want to bin.
I would look into what mad hatter suggested and using open office.
answered 9 months ago by:
0
>is there any licensing issue regarding microsoft office iterop dll file.
No they did it for us.
Doesn't that make you feel special?
answered 9 months ago by:
0
Hi,
I have one programme which reads the .doc , .rtf, .xls but problem is i am not able to extract the exact content. the code which is given below.
StreamReader reader = null;
StreamWriter writer = null;
SortedList table = new SortedList();
//Hashtable table = new Hashtable();
string logFile = "logfile.txt";
try
{
//iterate one word at a time. Each word/count gets updated for each instance that gets encountered.
reader = new StreamReader(textBox1.Text);//opens the file
writer = new StreamWriter(logFile, false);
int h=0;
for (string line = reader.ReadLine(); line != null; line = reader.ReadLine())
{
string[] words = GetWords (line);
foreach (string word in words)
{
string iword = word.ToLower();
h++;
if (table.ContainsKey (iword))
{
table[iword] = table[iword] + "," + "'" + h + "'";
}
else
{
table[iword] = "'" + h + "'";
}
}
}
foreach (DictionaryEntry entry in table)
{
writer.WriteLine ("{0} ({1})", entry.Key, entry.Value);
}
catch (Exception c)
{
writer.WriteLine(c.Message);
}
finally
{
if (reader != null)
reader.Close();
if (writer != null)
writer.Close();
}
Please try this code and let me know. I have also one C++ code which reads the .doc file without using word library. Only file object is used. I am trying to convert that programe to C#
give your comments or suggestions.
Thanks,
Atul
answered 9 months ago by:
0
hi,
i am forgot to include these two methoods
static string[] GetWords(string line)
{
ArrayList al = new ArrayList(); //for intermediate results
int i = 0;
string word;
char[] characters = line.ToCharArray();
while ((word = GetNextWord(line, characters, ref i)) != null)
al.Add(word);
string[]words = new string[al.Count];
al.CopyTo(words);
return words;
}
static string GetNextWord (string line, char[] characters, ref int i)
{
while (i < characters.Length && !Char.IsLetterOrDigit (characters[i]))
i++;
if (i == characters.Length)
return null;
int start = i;
//find the end of the word
while (i< characters.Length && Char.IsLetterOrDigit (characters[i]))
i++;
//return the word
return line.Substring (start, i - start);
}
answered 9 months ago by:
0
Were you able to do it? I need to read Excel file too. Let me know how you did it.
answered 9 months ago by:
0
Guys I am having a stupid problem with Automation in C#
I can't access the items collection of Documents by using []
I get a compiler error..
I tried using the .GetEnumerator() and looping using the IEnumertator.MoveNext() but even that returns a 'undefined value'
this is funny cause during debugging at run time I can simply type
the a.Documents[0] and it returns a value
but the compiler won't take it.
how can Iterate through the Documents collection of an application lets say
Microsoft.Office.Interop.Word.Application a = new ApplicationClass();
???
answered 9 months ago by:
0
Hi,
Well if you don't want to use third party components, you are stuck with automation. I think automation has many issues so maybe you should think about creating methods for reading documents by yourself. Here is an <a target="_new" href="http://www.gemboxsoftware.com/Excel2007/DemoApp.htm">article that explains how to read/write XLSX (Excel 2007) files without automation</a>.
Mario
<a target="_new" href="http://www.gemboxsoftware.com">GemBox Software</a>
--
<a target="_new" href="http://www.gemboxsoftware.com/GBSpreadsheet.htm">GemBox.Spreadsheet for .NET - Easily read and write Excel (XLS, XLSX or CSV) or export to HTML files from your .NET apps</a>.
--
answered 9 months ago by:
0
This post was imported from csharpfriends, if you have a similiar question please ask it again.
All previous members have been migrated, hope you enjoy the new platform!