[Haskell-cafe] Converting wiki pages into pdf

Michael Snoyman michael at snoyman.com
Fri Sep 9 14:44:25 CEST 2011


On Fri, Sep 9, 2011 at 3:16 PM, mukesh tiwari
<mukeshtiwari.iiitm at gmail.com> wrote:
>
> Thank you all for replying. I managed to write a python script. It depends
> on PyQt4 . I am curious if we have any thing like PyQt4  in Haskell.
>
> import sys
> from PyQt4.QtCore import *
> from PyQt4.QtGui import *
> from PyQt4.QtWebKit import *
>
> #http://www.rkblog.rk.edu.pl/w/p/webkit-pyqt-rendering-web-pages/
> #http://pastebin.com/xunfQ959
> #http://bharatikunal.wordpress.com/2010/01/31/converting-html-to-pdf-with-python-and-qt/
> #http://www.riverbankcomputing.com/pipermail/pyqt/2009-January/021592.html
>
> def convertFile( ):
>                 web.print_( printer )
>                 print "done"
>                 QApplication.exit()
>
>
> if __name__=="__main__":
>         url = raw_input("enter url:")
>         filename = raw_input("enter file name:")
>         app = QApplication( sys.argv )
>         web = QWebView()
>         web.load(QUrl( url ))
>         #web.show()
>         printer = QPrinter( QPrinter.HighResolution )
>         printer.setPageSize( QPrinter.A4 )
>         printer.setOutputFormat( QPrinter.PdfFormat )
>         printer.setOutputFileName(  filename + ".pdf" )
>         QObject.connect( web ,  SIGNAL("loadFinished(bool)"), convertFile  )
>         sys.exit(app.exec_())
>
>
> On Fri, Sep 9, 2011 at 11:03 AM, Matti Oinas <matti.oinas at gmail.com> wrote:
>>
>> The whole wikipedia database can also be downloaded if that is any help.
>>
>> http://en.wikipedia.org/wiki/Wikipedia:Database_download
>>
>> There is also text in that site saying "Please do not use a web
>> crawler to download large numbers of articles. Aggressive crawling of
>> the server can cause a dramatic slow-down of Wikipedia."
>>
>> Matti
>>
>> 2011/9/9 Kyle Murphy <orclev at gmail.com>:
>> > It's worth pointing out at this point (as alluded to by Conrad) that
>> > what
>> > you're attempting might be considered somewhat rude, and possibly
>> > slightly
>> > illegal (depending on the insanity of the legal system in question).
>> > Automated site scraping (what you're essentially doing) is generally
>> > frowned
>> > upon by most hosts unless it follows some very specific guidelines,
>> > usually
>> > at a minimum respecting the restrictions specified in the robots.txt
>> > file
>> > contained in the domains root. Furthermore, depending on the type of
>> > data in
>> > question, and if a EULA was agreed to if the site requires an account,
>> > doing
>> > any kind of automated processing might be disallowed. Now, I think
>> > wikipedia
>> > has a fairly lenient policy, or at least I hope it does considering it's
>> > community driven, but depending on how much of wikipedia you're planning
>> > on
>> > crawling you might at the very least consider severly throttling the
>> > process
>> > to keep from sucking up too much bandwidth.
>> >
>> > On the topic of how to actually perform that crawl, you should probably
>> > check out the format of the link provided in the download PDF element.
>> > After
>> > looking at an article (note, I'm basing this off a quick glance at a
>> > single
>> > page) it looks like you should be able to modify the URL provided in the
>> > "Permanent link" element to generate the PDF link by changing the title
>> > argument to arttitle, adding a new title argument with the value
>> > "Special:Book", and adding the new arguments "bookcmd=render_article"
>> > and
>> > "writer=rl". For example if the permanent link to the article is:
>> >
>> > http://en.wikipedia.org/w/index.php?title=Shapinsay&oldid=449266269
>> >
>> > Then the PDF URL is:
>> >
>> >
>> > http://en.wikipedia.org/w/index.php?arttitle=Shapinsay&oldid=449266269&title=Special:Book&bookcmd=render_article&write=rl
>> >
>> > This is all rather hacky as well, and none of it has been tested so it
>> > might
>> > not actually work, although I see no reason why it shouldn't. It's also
>> > fragile, as if wikipedia changes just about anything it could all brake,
>> > but
>> > that's the risk you run anytime you resort of site scraping.
>> >
>> > -R. Kyle Murphy
>> > --
>> > Curiosity was framed, Ignorance killed the cat.
>> >
>> >
>> > On Thu, Sep 8, 2011 at 23:40, Conrad Parker <conrad at metadecks.org>
>> > wrote:
>> >>
>> >> On Sep 9, 2011 7:33 AM, "mukesh tiwari" <mukeshtiwari.iiitm at gmail.com>
>> >> wrote:
>> >> >
>> >> > Thank your for reply Daniel. Considering my limited knowledge of web
>> >> > programming and javascript , first i need to simulated the some sort
>> >> > of
>> >> > browser in my program which will run the javascript and will generate
>> >> > the
>> >> > pdf. After that i can download the pdf . Is this you mean ?  Is
>> >> > Network.Browser any helpful for this purpose ? Is there  way to solve
>> >> > this
>> >> > problem ?
>> >> > Sorry for  many questions but this  is my first web application
>> >> > program
>> >> > and i am trying hard to finish it.
>> >> >
>> >>
>> >> Have you tried finding out if simple URLs exist for this, that don't
>> >> require Javascript? Does Wikipedia have a policy on this?
>> >>
>> >> Conrad.
>> >>
>> >> >
>> >> > On Fri, Sep 9, 2011 at 4:17 AM, Daniel Patterson
>> >> > <lists.haskell at dbp.mm.st> wrote:
>> >> >>
>> >> >> It looks to me that the link is generated by javascript, so unless
>> >> >> you
>> >> >> can script an actual browser into the loop, it may not be a viable
>> >> >> approach.
>> >> >>
>> >> >> On Sep 8, 2011, at 3:57 PM, mukesh tiwari wrote:
>> >> >>
>> >> >> > I tried to use the PDF-generation facilities . I wrote a script
>> >> >> > which
>> >> >> > generates the rendering url . When i am pasting rendering url in
>> >> >> > browser its generating the download file but when i am trying to
>> >> >> > get
>> >> >> > the tags , its empty. Could some one please tell me what is wrong
>> >> >> > with
>> >> >> > code.
>> >> >> > Thank You
>> >> >> > Mukesh Tiwari
>> >> >> >
>> >> >> > import Network.HTTP
>> >> >> > import Text.HTML.TagSoup
>> >> >> > import Data.Maybe
>> >> >> >
>> >> >> > parseHelp :: Tag String -> Maybe String
>> >> >> > parseHelp ( TagOpen _ y ) = if ( filter ( \( a , b ) -> b ==
>> >> >> > "Download
>> >> >> > a PDF version of this wiki page" ) y )  /= []
>> >> >> >                            then Just $  "http://en.wikipedia.org"
>> >> >> > ++
>> >> >> >  ( snd $
>> >> >> > y !!  0 )
>> >> >> >                             else Nothing
>> >> >> >
>> >> >> >
>> >> >> > parse :: [ Tag String ] -> Maybe String
>> >> >> > parse [] = Nothing
>> >> >> > parse ( x : xs )
>> >> >> >   | isTagOpen x = case parseHelp x of
>> >> >> >                        Just s -> Just s
>> >> >> >                        Nothing -> parse xs
>> >> >> >   | otherwise = parse xs
>> >> >> >
>> >> >> >
>> >> >> > main = do
>> >> >> >       x <- getLine
>> >> >> >       tags_1 <-  fmap parseTags $ getResponseBody =<< simpleHTTP
>> >> >> > ( getRequest x ) --open url
>> >> >> >       let lst =  head . sections ( ~== "<div class=portal
>> >> >> > id=p-coll-
>> >> >> > print_export>" ) $ tags_1
>> >> >> >           url =  fromJust . parse $ lst  --rendering url
>> >> >> >       putStrLn url
>> >> >> >       tags_2 <-  fmap parseTags $ getResponseBody =<< simpleHTTP
>> >> >> > ( getRequest url )
>> >> >> >       print tags_2
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > _______________________________________________
>> >> >> > Haskell-Cafe mailing list
>> >> >> > Haskell-Cafe at haskell.org
>> >> >> > http://www.haskell.org/mailman/listinfo/haskell-cafe
>> >> >>
>> >> >
>> >> >
>> >> > _______________________________________________
>> >> > Haskell-Cafe mailing list
>> >> > Haskell-Cafe at haskell.org
>> >> > http://www.haskell.org/mailman/listinfo/haskell-cafe
>> >> >
>> >>
>> >> _______________________________________________
>> >> Haskell-Cafe mailing list
>> >> Haskell-Cafe at haskell.org
>> >> http://www.haskell.org/mailman/listinfo/haskell-cafe
>> >>
>> >
>> >
>> > _______________________________________________
>> > Haskell-Cafe mailing list
>> > Haskell-Cafe at haskell.org
>> > http://www.haskell.org/mailman/listinfo/haskell-cafe
>> >
>> >
>>
>>
>>
>> --
>> /*******************************************************************/
>>
>> try {
>>    log.trace("Id=" + request.getUser().getId() + " accesses " +
>> manager.getPage().getUrl().toString())
>> } catch(NullPointerException e) {}
>>
>> /*******************************************************************/
>>
>> This is a real code, but please make the world a bit better place and
>> don’t do it, ever.
>>
>> *
>> http://www.javacodegeeks.com/2011/01/10-tips-proper-application-logging.html
>> *
>
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
>

I've actually used wkhtmltopdf[1] for this kind of stuff in the past.

[1] http://code.google.com/p/wkhtmltopdf/



More information about the Haskell-Cafe mailing list