Hi all,
I'm trying to read a PDF file and return the body as a string. I've found this code:
public static string ParsePdfText(string sourcePDF, int fromPageNum, int toPageNum)
{
System.Text.StringBuilder sb = new System.Text.StringBuilder();
try
{
PdfReader reader = new PdfReader(sourcePDF);
byte[] pageBytes = null;
PRTokeniser token = null;
int tknType = -1;
string tknValue = string.Empty;
if (fromPageNum == 0)
{
fromPageNum = 1;
}
if (toPageNum == 0)
{
toPageNum = reader.NumberOfPages;
}
if (fromPageNum > toPageNum)
{
throw new ApplicationException("Parameter error: The value of fromPageNum can " + "not be larger than the value of toPageNum");
}
for (int i = fromPageNum; i <= toPageNum; i += 1)
{
pageBytes = reader.GetPageContent(i);
if ((pageBytes != null))
{
token = new PRTokeniser(pageBytes);
while (token.NextToken())
{
tknType = token.TokenType;
tknValue = token.StringValue;
if (tknType == PRTokeniser.TK_STRING)
{
sb.Append(token.StringValue);
}
//I need to add these additional tests to properly add whitespace to the output string
else if (tknType == 1 && tknValue == "-600")
{
sb.Append(" ");
}
else if (tknType == 10 && tknValue == "TJ")
{
sb.Append(" ");
}
}
}
}
}
catch (Exception ex)
{
return string.Empty;
}
return sb.ToString();
}
Unfortunately, the text returned is nothing more than some strange chars. The PDF I'm using to test is a simple file containing only one line of text.
The output is something similar to this:

5 answers
Sorry, I've no idea what could cause the empty strings but, reading through the comments, the iTextSharp approach seems to be very 'hit and miss' to me.
I'd have thought that the files for the first solution could be downloaded from this page and from this one. I'd just use the latest versions.
Rather than struggling with this rigmarole, have you thought about using a free utility such as this one which you could start via System.Diagnostics.Process? There's actually a command line version of this particular one which would enable you to do the conversion 'invisibly' - only drawback, it costs $35.
answered 5 months ago by:
11603
Judging by what it says in this article and the comments thereto (which contain a VB.Net version of your code), trying to use iTextSharp for this task is unreliable at best.
The author suggests using PDFBox/IKVM.NET instead though a drawback here is that it's on the big side (16MB).
Disappointing that the .NET Framework doesn't have any PDF tools, though this is probably because it's a proprietary format.
answered 5 months ago by:
11603
I actually tried the solution in that article, and this one, but I wasn't able to find the correct files for the first one, and the second one just returns empty strings.
answered 5 months ago by:
1822
Any idea what would cause the empty strings? Also, where the **** do I get the files I need for the first solution? :-P
answered 5 months ago by:
1822
Thanks :)
I've been doing some further testing and it seems only some files return these strange chars / empty strings - your assumption of "hit and miss" seems dead on :-P
For now, I'll keep my iTextSharp solution for the beta, and I think I'll go with the command line version of a-pdf for the full version of my app once its been tested ... who knows, the project might get scrapped anyway ;)
answered 5 months ago by:
1822