Perl Read Pdf File

Active3 years, 5 months ago

Perl Read A File
Read Pdf File Online
How To Read Pdf File

Errors reading PDF with CAM-PDF: use of uninitialized value in addition line 667 Hot Network Questions What does a Nintendo Game Boy do when turned on without a game cartridge inserted? Hi all, I am in the process of writing a small perl code which can read a PDF file. I want to know which perl library should i use for this. Like PDF::API2 or which one.

Perl Read A File

How does Perl read in files, how does it tell it to advance to the next line in the text file, and how does it make it read all lines in the .txt file until, for example, it reaches item 'banana'?

The basics of handling files are simple: you associate a filehandle with an external entity (usually a file) and then use a variety of operators and functions within Perl to read and update the data stored within the data stream associated with the filehandle. A filehandle is a named internal Perl.
I am trying to extract text from PDF files using Perl. I have been using pdftotext.exe from command line (i.e using Perl system function) for extracting text from PDF files, this method works fine. The problem is that we have symbols like α, β and other special characters in the PDF files which are not being displayed in the generated txt file.
This Perl Text::PDF module is licensed under the Perl Artistic License. Module Install Instructions To install Text::PDF::File, simply copy and paste either of the commands in to your terminal.

Peter Mortensen

14.5k19 gold badges89 silver badges118 bronze badges

BefallBefall

Read Pdf File Online

2,6067 gold badges20 silver badges27 bronze badges

3 Answers

Basically, there are two ways of reading files:

Slurping a file means reading the file all at once. This uses a lot of memory and takes a while, but afterwards the whole file contents are in memory and you can do what you want with it.
Reading a file line-per-line (in a while loop) is better if you don't want to read the entire file (for example, stop when you reach 'banana').

For both ways you need to create a FILEHANDLE using the 'open' command, like so:

Then you can either slurp the file by putting it into an array:

or read the file one by one using a while loop

Afterwards, don't forget to close the file.

That's just the basics.. there's a lot to do with files, especially in Exception handling (what to do when the file does not exist, is not readable, is being written to), so you'll have to read up or ask away :)

OmarOthman

1,1082 gold badges14 silver badges31 bronze badges

KonerakKonerak

33.9k10 gold badges82 silver badges111 bronze badges

René and Konerak wrote a couple of pretty good responses that show how to open and read a file. Unfortunately they have some issues in terms of promoting best practices. So, I'll come late to the party and try to add clear explanation of best practices approach and why it is better to use the best practice approach.

What is a file handle?

A file handle is a name we use that represents the file itself. When you want to operate on a file (read it, write to it, move around, etc.) use the file handle to indicate which file to operate on. A file handle is distinct from the file's name or path.

Variable scope and file handles

A variable's scope determines in what parts of a program the variable can be seen. In general, it is a good idea to keep the scope on every variable as small possible so that different parts of a complex program don't break each other.

The easiest way to strictly control a variable's scope in Perl is to make it a lexical variable. Lexical variables are only visible inside the block in which they are declared. Use my to declare a lexical variable: my $foo;

Perl file handles can be global or lexical. When you use open with a bare word (a literal string without quotes or a sigil), you create a global handle. When you open on an undefined lexical scalar, you create a lexical handle.

The big problem with global file handles is that they are visible anywhere in the program. So if I create a file handle named FOO in subroutine, I have to very careful to ensure that I don't use the same name in another routine, or if I use the same name, I must be absolutely certain that under no circumstances can they conflict with each other. The simple alternative is to use a lexical handle that cannot have the same kind of name conflicts.

Another benefit of lexical handles is that it is easy to pass them around as subroutine arguments.

The open function

The open function has all kinds of features. It can run subprocesses, read files, and even provide a handle for the contents of a scalar. You can feed it many different types of argument lists. It is very powerful and flexible, but these features come with some gotchas (executing subprocesses is not something you want to do by accident).

For the simple case of opening a file, it is best to always use the 3-argument form because it prevents unintended activation of all those special features:

FILEHANDLE is the file handle to open.

MODE is how to open the file, > for overwrite, '>>for write in append mode,+>for read and write, and<` for read.

FILEPATH is the path to the file to open.

On success, open returns a true value. On failure, $! is set to indicate the error, and a false value is returned.

So, to make a lexical file handle with a 3-argument open that we can use to read a file:

The logical return values make it easy to check for errors:

I like to bring the error handling down to a new line and indent it, but that's personal style.

How To Read Pdf File

Closing handles

When you use global handles it is critical to carefully, explicitly close each and every handle when you are done with it. Failure to do so can lead to odd bugs and maintainability problems.

Lexical handles automatically close when the variable is destroyed (when the reference count drops to 0, usually when the variable goes out of scope).

When using lexical handles it is common to rely on the implicit closure of handles rather than explicitly closing them.

Diamonds are a Perl's best friend.

The diamond operator, <>, allows us to iterate over a file handle. Like open it has superpowers. We'll ignore most of them for now. (Search for info on the input record separator, output record separator and the NULL file handle to learn about them, though.)

The important thing is that in scalar context (e.g. assigning to a scalar) it acts like a readline function. In list context (e.g. assigning to an array) it acts like a read_all_lines function.

Imagine you want to read a data file with three header lines (date, time and location) and a bunch of data lines:

It's common in to hear people talk about slurping a file. This means to read the whole file into a variable at once.

Putting it all together

Putting it all together - special extra credit edition

Why so many different ways? Why so many gotchas?

Perl is an old language; it has baggage dating all the way back to 1987. Over the years various design issues were found and fixes were made--but only rarely were fixes allowed to harm backwards compatibility.

Further, Perl is designed to give you the flexibility to do what you want to, when you want to. It is very permissive. The good thing about this is that you can reach down into the murky depths and do really cool magical stuff. The bad thing is that it is easy to shoot yourself in the foot if you forget to temper your exuberance and fail to focus on producing readable code.

Just because you've got more than enough rope, doesn't mean that you have to hang yourself.

Peter Mortensen

14.5k19 gold badges89 silver badges118 bronze badges

daotoaddaotoad

22.5k7 gold badges49 silver badges98 bronze badges

First, you have to open the file:

You might want to check if the opening of the file was successful:

After opening the file, you can read line per line from $SOME_FILEHANDLE. You get the next line with the <$SOME_FILEHANDLE> construct:

$next_line is undefined after reading the last line. So, you can put the entire thing into a while loop:

This works because an undefined value evaluates to false in the while condition.

If you want to exit the loop when 'banana' is encountered, you will probably use a regular expression to check for banana:

The last operator exits the while loop and it is 'triggered' when $next_line matches banana.

René NyffeneggerRené Nyffenegger

27.5k18 gold badges119 silver badges192 bronze badges

Not the answer you're looking for? Browse other questions tagged perlfile or ask your own question.

Active3 years, 4 months ago

I am trying to extract text from PDF files using Perl. I have been using pdftotext.exe from command line (i.e using Perl system function) for extracting text from PDF files, this method works fine.

The problem is that we have symbols like α, β and other special characters in the PDF files which are not being displayed in the generated txt file. Also few extra spaces are being added randomly in the text.

Is there a better and more reliable way to extract text from PDF files such that the text will include all the symbols like α, β etc and the text will exactly match the text in the PDF (i.e without extra spaces)?

brian d foy

105k30 gold badges178 silver badges485 bronze badges

Pawan RaoPawan Rao

4022 gold badges7 silver badges11 bronze badges

9 Answers

These modules you can acheive the extract text from pdf

From CPAN

This module attempts to extract sequential text from a PDF page. This is not a robust process, as PDF text is graphically laid out in arbitrary order. This module uses a few heuristics to try to guess what text goes next to what other text, but may be fooled easily by, say, subscripts, non-horizontal text, changes in font, form fields etc.

All those disclaimers aside, it is useful for a quick dump of text from a simple PDF file.

joejoe

17.7k29 gold badges86 silver badges129 bronze badges

You may never get an appropriate solution to your problem. The PDF format can encode text either as ASCII values with a font applied, or it can encode it as a bitmap. If the tool that created your PDF decided to encode the special characters as a bitmap, you will be out of luck (unless you want to get into OCR solutions, of course).

Andrew BarnettAndrew Barnett

3,7091 gold badge18 silver badges23 bronze badges

I'm not a Perl user but I imagine you'll struggle to find a better free text extractor than pdftotext.

pdftotext usually recognises non-ASCII characters fine, is it possible it's extracting them ok but the app you're using to view the text file isn't using the correct encoding? If pdftoetxt on windows is the same as the one on my linux system, then it defaults to exporting as utf-8.

Sinan Ünür

109k15 gold badges178 silver badges315 bronze badges

James Healy

James Healy

10.4k2 gold badges26 silver badges32 bronze badges

friedo

49k15 gold badges108 silver badges175 bronze badges

Sinan ÜnürSinan Ünür

109k15 gold badges178 silver badges315 bronze badges

Well, I tried 2-3 perl modules like CAM::PDF, API2 but the problem remains the same! I'm parsing a pdf file containing main pages. Cam or API2 parses the plain text very well. However, they are not able to parse the code snippet [code snippet usually are in different font & encoding than plain text].

harschware

6,70015 gold badges47 silver badges77 bronze badges

Mandar PandeMandar Pande

4,87114 gold badges40 silver badges63 bronze badges

PDF2TXT.pyThis is what I use, although it is Python, it works flawlessly.

Ryan WardRyan Ward

3,3186 gold badges32 silver badges43 bronze badges

James Healy is correct. After trying CAM::PDF and PDF::API2, the former of which I've had some success reading text, downloading pdftotext worked great for a number of my implementations.

If on windows go here and download xpdf precompiled binary:http://www.foolabs.com/xpdf/download.html

Then, if you need to run this within perl use system, e.g.,:system('C:Utilitiesxpdfbin-win-3.04bin64pdftotext.exe $saveName');

where $saveName is the full path to your PDF file.

This hopefully leaves you with a text file you can open and parse in perl.

harschware

6,70015 gold badges47 silver badges77 bronze badges

user3869653user3869653

i tried this module which is working fine for special characters of pdf..

selva kumarselva kumar

Take a look at PDFBox. It is a library but i think that it also comes with some tool to do text extracting.

Per ArnengPer Arneng