Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Wikipedia: Wikitech

Extension:Pdfhandler

 

 

Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded


lugusto at gmail

Dec 24, 2008, 12:57 PM

Post #1 of 6 (1238 views)
Permalink
Extension:Pdfhandler

What is the current status of Extension:Pdfhandler and bugzilla:11215? It
needs more testing or it will never get enabled due to Adobe patents? How
can geeks and non-geeks wikimedians can help on debugging that extension?

I'm asking it because I've approximately 30GB of public domain scans in .pdf
format to upload on Commons on the next months (see
http://en.wikisource.org/w/index.php?oldid=928004#Royal_Society_Digital_Archive_only_for_3_Months_FREEfor
further information on it) and because I fully agree to the reasons
listed on https://bugzilla.wikimedia.org/show_bug.cgi?id=11215#c3

[.Yeah, I'm asking it in the Christmas day... Extension:Pdfhandler and
Extension:ABC are good options of christmas gifts for Wikisource wikis ;-) ]
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


brion at wikimedia

Dec 24, 2008, 3:52 PM

Post #2 of 6 (1169 views)
Permalink
Re: Extension:Pdfhandler [In reply to]

On 12/24/08 12:57 PM, Luiz Augusto wrote:
> What is the current status of Extension:Pdfhandler and bugzilla:11215? It
> needs more testing or it will never get enabled due to Adobe patents? How
> can geeks and non-geeks wikimedians can help on debugging that extension?

AFAIK there's no patent/openness issues at stake as long as encrypted
PDFs aren't being used...

> I'm asking it because I've approximately 30GB of public domain scans in .pdf
> format to upload on Commons on the next months (see
> http://en.wikisource.org/w/index.php?oldid=928004#Royal_Society_Digital_Archive_only_for_3_Months_FREEfor
> further information on it) and because I fully agree to the reasons
> listed on https://bugzilla.wikimedia.org/show_bug.cgi?id=11215#c3

It seemed to work ok the last time I tested it and did some tweaks...
Before a live test deployment, the software dependencies will need to be
installed, which should be pretty straightforward.

> [.Yeah, I'm asking it in the Christmas day... Extension:Pdfhandler and
> Extension:ABC are good options of christmas gifts for Wikisource wikis ;-) ]

:)

-- brion

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


nospam at vyznev

Dec 25, 2008, 9:52 AM

Post #3 of 6 (1165 views)
Permalink
Re: Extension:Pdfhandler [In reply to]

Luiz Augusto wrote:
>
> I'm asking it because I've approximately 30GB of public domain scans in .pdf
> format to upload on Commons on the next months (see
> http://en.wikisource.org/w/index.php?oldid=928004#Royal_Society_Digital_Archive_only_for_3_Months_FREE
> for further information on it) and because I fully agree to the reasons
> listed on https://bugzilla.wikimedia.org/show_bug.cgi?id=11215#c3

Assuming that these are scanned documents that haven't been vectorized,
have you considered converting them to DjVu format? Not only does
Wikimedia currently have better support for it than PDF, but you might
realize some file size savings. Apparently, there's software out there
to more or less automate it.

Of course, that doesn't in any way preclude or remove the need for
_also_ improving our PDF support. But PDF, as common and useful as it
is, might not be the optimal format here.

--
Ilmari Karonen


_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


lugusto at gmail

Dec 25, 2008, 11:40 AM

Post #4 of 6 (1162 views)
Permalink
Re: Extension:Pdfhandler [In reply to]

On Thu, Dec 25, 2008 at 3:52 PM, Ilmari Karonen <nospam [at] vyznev> wrote:

> Luiz Augusto wrote:
> >
> > I'm asking it because I've approximately 30GB of public domain scans in
> .pdf
> > format to upload on Commons on the next months (see
> >
> http://en.wikisource.org/w/index.php?oldid=928004#Royal_Society_Digital_Archive_only_for_3_Months_FREE
> > for further information on it) and because I fully agree to the reasons
> > listed on https://bugzilla.wikimedia.org/show_bug.cgi?id=11215#c3
>
> Assuming that these are scanned documents that haven't been vectorized,
> have you considered converting them to DjVu format? Not only does
> Wikimedia currently have better support for it than PDF, but you might
> realize some file size savings. Apparently, there's software out there
> to more or less automate it.


Someone asked it on en.wikisource and I've replied with this:
http://en.wikisource.org/w/index.php?title=Wikisource:Scriptorium&diff=prev&oldid=928130

DjVu (or at least all conversion tools/configuration options that I've tried
in the past months, including the LizardTech Document Express Enterprise
pdf2djvu and png2djvu options) is a lossy format. If I convert a .pdf
downloaded from Google Book Search I will get a low quality file (70 dpi or
150 dpi per page), but if I extract the images from the same .pdf file using
Adobe Acrobat Pro 8 I will get a 600 dpi jpeg for each page (OCR
softwares normally
recommeds to use 300 dpi images).


>
> Of course, that doesn't in any way preclude or remove the need for
> _also_ improving our PDF support.


Surely :)


> But PDF, as common and useful as it
> is, might not be the optimal format here.
>

Well, all digitized works from all libraries that I known (from Europe,
United States and Brazil) are avaiable only in .pdf file format. The
Internet Archive is the only one to make avaiable both .pdf and .djvu for
the same book (the .djvu version from IA is also a low quality file, but it
at least is delivered with a high-quality OCR embedded at the .djvu file due
to some closed-source and pay OCR software [Abbyy FineReader, I believe]).
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


nospam at vyznev

Dec 25, 2008, 1:44 PM

Post #5 of 6 (1165 views)
Permalink
Re: Extension:Pdfhandler [In reply to]

Luiz Augusto wrote:
>
> Someone asked it on en.wikisource and I've replied with this:
> http://en.wikisource.org/w/index.php?title=Wikisource:Scriptorium&diff=prev&oldid=928130
>
> DjVu (or at least all conversion tools/configuration options that I've tried
> in the past months, including the LizardTech Document Express Enterprise
> pdf2djvu and png2djvu options) is a lossy format. If I convert a .pdf
> downloaded from Google Book Search I will get a low quality file (70 dpi or
> 150 dpi per page), but if I extract the images from the same .pdf file using
> Adobe Acrobat Pro 8 I will get a 600 dpi jpeg for each page (OCR
> softwares normally recommeds to use 300 dpi images).

Hmm... well, that sucks. :( DjVu is indeed a lossy format, but there's
lossy and then there's lossy. Minor loss of fidelity in the
reproduction of the paper grain at 600 dpi would be perfectly reasonable
-- but turning a 600 dpi scan into 150 dpi certainly isn't.

Unfortunately, I'm not familiar enough with the DjVu format to suggest
any solutions. I'd assume the format must have support for higher
resolutions, even if most conversion software might not readily support
it, but never having actually used any such software I can't really say.

--
Ilmari Karonen

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


jayvdb at gmail

Dec 25, 2008, 3:46 PM

Post #6 of 6 (1164 views)
Permalink
Re: Extension:Pdfhandler [In reply to]

On Fri, Dec 26, 2008 at 5:40 AM, Luiz Augusto <lugusto [at] gmail> wrote:
> On Thu, Dec 25, 2008 at 3:52 PM, Ilmari Karonen <nospam [at] vyznev> wrote:
>
>> Luiz Augusto wrote:
>> >
>> > I'm asking it because I've approximately 30GB of public domain scans in
>> .pdf
>> > format to upload on Commons on the next months (see
>> >
>> http://en.wikisource.org/w/index.php?oldid=928004#Royal_Society_Digital_Archive_only_for_3_Months_FREE
>> > for further information on it) and because I fully agree to the reasons
>> > listed on https://bugzilla.wikimedia.org/show_bug.cgi?id=11215#c3
>>
>> Assuming that these are scanned documents that haven't been vectorized,
>> have you considered converting them to DjVu format? Not only does
>> Wikimedia currently have better support for it than PDF, but you might
>> realize some file size savings. Apparently, there's software out there
>> to more or less automate it.

Large batches of scans should be converted to djvu, as it is a better
format. PDF support will be useful for the small tasks where the
person already has a PDF (or it is already uploaded onto commons), and
they dont want to learn lots of tools before they start seeing
results. i.e. PDF support will make wikisource more accessible.

> Someone asked it on en.wikisource and I've replied with this:
> http://en.wikisource.org/w/index.php?title=Wikisource:Scriptorium&diff=prev&oldid=928130
>
> DjVu (or at least all conversion tools/configuration options that I've tried
> in the past months, including the LizardTech Document Express Enterprise
> pdf2djvu and png2djvu options) is a lossy format. If I convert a .pdf
> downloaded from Google Book Search I will get a low quality file (70 dpi or
> 150 dpi per page), but if I extract the images from the same .pdf file using
> Adobe Acrobat Pro 8 I will get a 600 dpi jpeg for each page (OCR
> softwares normally
> recommeds to use 300 dpi images).

My understanding is that the compression is optional, and the lossy
compression is much better than the equivalent lossy compression of
PDF.

I think it is the free PDF-to-image extraction tools that are causing
your problems.

>> Of course, that doesn't in any way preclude or remove the need for
>> _also_ improving our PDF support.
>
>
> Surely :)
>
>
>> But PDF, as common and useful as it
>> is, might not be the optimal format here.
>>
>
> Well, all digitized works from all libraries that I known (from Europe,
> United States and Brazil) are avaiable only in .pdf file format. The
> Internet Archive is the only one to make avaiable both .pdf and .djvu for
> the same book (the .djvu version from IA is also a low quality file, but it
> at least is delivered with a high-quality OCR embedded at the .djvu file due
> to some closed-source and pay OCR software [Abbyy FineReader, I believe]).

I have found the djvu files from IA to be of an appropriate quality,
especially for transcription purposes. The PDFs are usually much
larger, and not much better quality.

--
John Vandenberg

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.