Bug 380429

Summary:	app-text/pdfshuffler : will not allow me to export files which it has opened into a .pdf. once I click 'save' the box just hangs indefinitely.
Product:	Gentoo Linux	Reporter:	greenbean127
Component:	Current packages	Assignee:	No maintainer - Look at https://wiki.gentoo.org/wiki/Project:Proxy_Maintainers if you want to take care of it <maintainer-needed>
Status:	RESOLVED WORKSFORME
Severity:	normal
Priority:	Normal
Version:	unspecified
Hardware:	AMD64
OS:	Linux
Whiteboard:
Package list:		Runtime testing required:	---

Description greenbean127 2011-08-24 02:37:02 UTC

I open up multiple pdf files and manipulate them as desired.  When I click 'export' to save newly built document as a .pdf, the dialog box appears, I click 'save' and the box changes color to show its been clicked but just hangs there and does nothing.  If I click 'cancel' nothing happens.  I have to actually close the dialog box using the close button on the top of the window.  

Reproducible: Always

Steps to Reproduce:
1. Open up several .pdf's and manipulate them.
2. Click export.
3. Click save.
Actual Results:  
Dialog box just hangs and file is not saved.

Expected Results:  
Should create a new .pdf in the specified directory.

none.

Comment 1 Rafał Mużyło 2011-08-24 03:50:06 UTC

Any debug messages in the terminal ?

Comment 2 greenbean127 2011-08-24 04:31:37 UTC

(In reply to comment #1)
> Any debug messages in the terminal ?

yes there are, I should have provided an example in the bug post sorry.  Here you go

Traceback (most recent call last):
  File "/usr/bin/pdfshuffler", line 417, in choose_export_pdf_name
    self.export_to_file(file_out)
  File "/usr/bin/pdfshuffler", line 438, in export_to_file
    pdfdoc_inp = PdfFileReader(file(pdfdoc.copyname, 'rb'))
  File "/usr/lib64/python2.7/site-packages/pyPdf/pdf.py", line 374, in __init__
    self.read(stream)
  File "/usr/lib64/python2.7/site-packages/pyPdf/pdf.py", line 751, in read
    offset, generation = line[:16].split(" ")
ValueError: too many values to unpack

Comment 3 greenbean127 2011-08-24 15:19:16 UTC

I was able to find this bug reproduced on both redhat and debian bug reporting sites.  On the Debian site, someone pointed out that the bug only seems to happen with pdf's created by simplescan.  I was able to reproduce this on my system. 

If I open multiple pdf's that were not created by simple scan, then I am able to export the pages and create a new .pdf document by clicking 'export' and then 'save'.  It should be noted that the pdf files I used with success in pdfshuffler were not created on my system at all.

I was able to reproduce the problem again in pdfshuffler by trying to merge and export pdf files that I did create myself using simples-scan.  Here is the output from the terminal:

Traceback (most recent call last):
  File "/usr/bin/pdfshuffler", line 417, in choose_export_pdf_name
    self.export_to_file(file_out)
  File "/usr/bin/pdfshuffler", line 438, in export_to_file
    pdfdoc_inp = PdfFileReader(file(pdfdoc.copyname, 'rb'))
  File "/usr/lib64/python2.7/site-packages/pyPdf/pdf.py", line 374, in __init__
    self.read(stream)
  File "/usr/lib64/python2.7/site-packages/pyPdf/pdf.py", line 751, in read
    offset, generation = line[:16].split(" ")
ValueError: too many values to unpack
 
I hope this helps.  Thanks again :)

Comment 4 greenbean127 2011-08-24 16:10:43 UTC

The problem certain has to do with simple-scan created .pdf's.  I just scanned 2 documents using xsane, imported them to pdfshuffler and manipulated them.  I was able to save the two documents as a single document as expected.  There appears to be something in the way that simple-scan is creating pdf's that pdfshuffle does not like.  I am not sure how to proceed from here as this is my first bug filing.  Thanks.

Comment 5 Rafał Mużyło 2011-08-24 20:23:26 UTC

Well, you may argue that it's pyPDF that's broken, but actually Simple Scan produces corrupted pdfs.
It took some googling, but the result was:
PDF Reference, section 3.4.3 
...
Following this line are the cross-reference entries themselves, one per line.
Each entry is exactly 20 bytes long, including the end-of-line marker.
The format of an in-use entry is as follows:
nnnnnnnnnn ggggg n eol

where
nnnnnnnnnn is a 10-digit byte offset
ggggg is a 5-digit generation number
n is a literal keyword identifying this as an in-use entry
eol is a 2-character end-of-line sequence

...
The cross-reference entry for a free object has essentially the same format, except that the keyword is f instead of n and the interpretation of the first item is different:
nnnnnnnnnn ggggg f eol

where
nnnnnnnnnn is the 10-digit object number of the next free object
ggggg is a 5-digit generation number
f is a literal keyword identifying this as a free entry
eol is a 2-character end-of-line sequence
...
If the file’s end-of-line marker is a single char-
acter (either a carriage return or a line feed), it is preceded by a single space; if the
marker is 2 characters (both a carriage return and a line feed), it is not preceded
by a space.

Well, many pdf files seem to forget the space, if they use just '\n',
but what Simple Scan puts in is:
"%010zu 0000 n\n".printf (offset)
so, not only it misses the space (pyPDF has a workaround for this - actually, given the above entry, it's not really correct),
but the generation number is one digit short - this is what causes the failure.

Comment 6 greenbean127 2011-08-25 06:16:14 UTC

(In reply to comment #5)
> Well, you may argue that it's pyPDF that's broken, but actually Simple Scan
> produces corrupted pdfs.
> It took some googling, but the result was:
> PDF Reference, section 3.4.3 
> ...
> Following this line are the cross-reference entries themselves, one per line.
> Each entry is exactly 20 bytes long, including the end-of-line marker.
> The format of an in-use entry is as follows:
> nnnnnnnnnn ggggg n eol
> 
> where
> nnnnnnnnnn is a 10-digit byte offset
> ggggg is a 5-digit generation number
> n is a literal keyword identifying this as an in-use entry
> eol is a 2-character end-of-line sequence
> 
> ...
> The cross-reference entry for a free object has essentially the same format,
> except that the keyword is f instead of n and the interpretation of the first
> item is different:
> nnnnnnnnnn ggggg f eol
> 
> where
> nnnnnnnnnn is the 10-digit object number of the next free object
> ggggg is a 5-digit generation number
> f is a literal keyword identifying this as a free entry
> eol is a 2-character end-of-line sequence
> ...
> If the file’s end-of-line marker is a single char-
> acter (either a carriage return or a line feed), it is preceded by a single
> space; if the
> marker is 2 characters (both a carriage return and a line feed), it is not
> preceded
> by a space.
> 
> Well, many pdf files seem to forget the space, if they use just '\n',
> but what Simple Scan puts in is:
> "%010zu 0000 n\n".printf (offset)
> so, not only it misses the space (pyPDF has a workaround for this - actually,
> given the above entry, it's not really correct),
> but the generation number is one digit short - this is what causes the failure.

Thats interesting.  Thanks for your help.  This is my first time participating in the bug process, should I file a bug with pyPDF then? or this bug sufficient?  I am sure we would want the issue resolved right?  Thanks again for all your help.  Its been a good learning experience.

Comment 7 Rafał Mużyło 2011-08-25 18:16:04 UTC

Well, it's a bit complicated.

IMHO:
- pdfshuffler might use "try...except..." block around 'pdfdoc_inp = PdfFileReader(file(pdfdoc.copyname, 'rb'))' line to drop broken files out of queue
- pyPDF should do nothing - it's not supposed to handle arbitrarily corrupted files
- simple-scan needs to be fix upstream - I suspect https://bugs.launchpad.net/simple-scan/+bug/662144 is an old report regarding this very problem


Due to the number of packages involved, I'm just CCing, instead of assigning.

Comment 8 Rafał Mużyło 2011-09-02 11:30:20 UTC

I've managed to reach author of simple-scan and its new release will no longer produce broken pdfs and have an option to fix the broken files generated by the older versions. After all, fix both for the app and the files is trivial.

Comment 9 Pacho Ramos gentoo-dev

2012-03-04 09:56:36 UTC

I guess this is solved now with newer simple-scan versions...