[Haskell-cafe] Converting wiki pages into pdf

Fri Sep 9 14:16:34 CEST 2011

Thank you all for replying. I managed to write a python script. It depends
on PyQt4 . I am curious if we have any thing like PyQt4  in Haskell.

import sys
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import *

#http://www.rkblog.rk.edu.pl/w/p/webkit-pyqt-rendering-web-pages/
#http://pastebin.com/xunfQ959
#
http://bharatikunal.wordpress.com/2010/01/31/converting-html-to-pdf-with-python-and-qt/
#http://www.riverbankcomputing.com/pipermail/pyqt/2009-January/021592.html

def convertFile( ):
                web.print_( printer )
                print "done"
                QApplication.exit()

if __name__=="__main__":
        url = raw_input("enter url:")
        filename = raw_input("enter file name:")
        app = QApplication( sys.argv )
        web = QWebView()
        web.load(QUrl( url ))
        #web.show()
        printer = QPrinter( QPrinter.HighResolution )
        printer.setPageSize( QPrinter.A4 )
        printer.setOutputFormat( QPrinter.PdfFormat )
        printer.setOutputFileName(  filename + ".pdf" )
        QObject.connect( web ,  SIGNAL("loadFinished(bool)"), convertFile  )
        sys.exit(app.exec_())

On Fri, Sep 9, 2011 at 11:03 AM, Matti Oinas <matti.oinas at gmail.com> wrote:

> The whole wikipedia database can also be downloaded if that is any help.
>
> http://en.wikipedia.org/wiki/Wikipedia:Database_download
>
> There is also text in that site saying "Please do not use a web
> crawler to download large numbers of articles. Aggressive crawling of
> the server can cause a dramatic slow-down of Wikipedia."
>
> Matti
>
> 2011/9/9 Kyle Murphy <orclev at gmail.com>:
> > It's worth pointing out at this point (as alluded to by Conrad) that what
> > you're attempting might be considered somewhat rude, and possibly
> slightly
> > illegal (depending on the insanity of the legal system in question).
> > Automated site scraping (what you're essentially doing) is generally
> frowned
> > upon by most hosts unless it follows some very specific guidelines,
> usually
> > at a minimum respecting the restrictions specified in the robots.txt file
> > contained in the domains root. Furthermore, depending on the type of data
> in
> > question, and if a EULA was agreed to if the site requires an account,
> doing
> > any kind of automated processing might be disallowed. Now, I think
> wikipedia
> > has a fairly lenient policy, or at least I hope it does considering it's
> > community driven, but depending on how much of wikipedia you're planning
> on
> > crawling you might at the very least consider severly throttling the
> process
> > to keep from sucking up too much bandwidth.
> >
> > On the topic of how to actually perform that crawl, you should probably
> > check out the format of the link provided in the download PDF element.
> After
> > looking at an article (note, I'm basing this off a quick glance at a
> single
> > page) it looks like you should be able to modify the URL provided in the
> > "Permanent link" element to generate the PDF link by changing the title
> > argument to arttitle, adding a new title argument with the value
> > "Special:Book", and adding the new arguments "bookcmd=render_article" and
> > "writer=rl". For example if the permanent link to the article is:
> >
> > http://en.wikipedia.org/w/index.php?title=Shapinsay&oldid=449266269
> >
> > Then the PDF URL is:
> >
> >
> http://en.wikipedia.org/w/index.php?arttitle=Shapinsay&oldid=449266269&title=Special:Book&bookcmd=render_article&write=rl
> >
> > This is all rather hacky as well, and none of it has been tested so it
> might
> > not actually work, although I see no reason why it shouldn't. It's also
> > fragile, as if wikipedia changes just about anything it could all brake,
> but
> > that's the risk you run anytime you resort of site scraping.
> >
> > -R. Kyle Murphy
> > --
> > Curiosity was framed, Ignorance killed the cat.
> >
> >
> > On Thu, Sep 8, 2011 at 23:40, Conrad Parker <conrad at metadecks.org>
> wrote:
> >>
> >> On Sep 9, 2011 7:33 AM, "mukesh tiwari" <mukeshtiwari.iiitm at gmail.com>
> >> wrote:
> >> >
> >> > Thank your for reply Daniel. Considering my limited knowledge of web
> >> > programming and javascript , first i need to simulated the some sort
> of
> >> > browser in my program which will run the javascript and will generate
> the
> >> > pdf. After that i can download the pdf . Is this you mean ?  Is
> >> > Network.Browser any helpful for this purpose ? Is there  way to solve
> this
> >> > problem ?
> >> > Sorry for  many questions but this  is my first web application
> program
> >> > and i am trying hard to finish it.
> >> >
> >>
> >> Have you tried finding out if simple URLs exist for this, that don't
> >> require Javascript? Does Wikipedia have a policy on this?
> >>
> >> Conrad.
> >>
> >> >
> >> > On Fri, Sep 9, 2011 at 4:17 AM, Daniel Patterson
> >> > <lists.haskell at dbp.mm.st> wrote:
> >> >>
> >> >> It looks to me that the link is generated by javascript, so unless
> you
> >> >> can script an actual browser into the loop, it may not be a viable
> approach.
> >> >>
> >> >> On Sep 8, 2011, at 3:57 PM, mukesh tiwari wrote:
> >> >>
> >> >> > I tried to use the PDF-generation facilities . I wrote a script
> which
> >> >> > generates the rendering url . When i am pasting rendering url in
> >> >> > browser its generating the download file but when i am trying to
> get
> >> >> > the tags , its empty. Could some one please tell me what is wrong
> >> >> > with
> >> >> > code.
> >> >> > Thank You
> >> >> > Mukesh Tiwari
> >> >> >
> >> >> > import Network.HTTP
> >> >> > import Text.HTML.TagSoup
> >> >> > import Data.Maybe
> >> >> >
> >> >> > parseHelp :: Tag String -> Maybe String
> >> >> > parseHelp ( TagOpen _ y ) = if ( filter ( \( a , b ) -> b ==
> >> >> > "Download
> >> >> > a PDF version of this wiki page" ) y )  /= []
> >> >> >                            then Just $  "http://en.wikipedia.org"
> ++
> >> >> >  ( snd $
> >> >> > y !!  0 )
> >> >> >                             else Nothing
> >> >> >
> >> >> >
> >> >> > parse :: [ Tag String ] -> Maybe String
> >> >> > parse [] = Nothing
> >> >> > parse ( x : xs )
> >> >> >   | isTagOpen x = case parseHelp x of
> >> >> >                        Just s -> Just s
> >> >> >                        Nothing -> parse xs
> >> >> >   | otherwise = parse xs
> >> >> >
> >> >> >
> >> >> > main = do
> >> >> >       x <- getLine
> >> >> >       tags_1 <-  fmap parseTags $ getResponseBody =<< simpleHTTP
> >> >> > ( getRequest x ) --open url
> >> >> >       let lst =  head . sections ( ~== "<div class=portal
> id=p-coll-
> >> >> > print_export>" ) $ tags_1
> >> >> >           url =  fromJust . parse $ lst  --rendering url
> >> >> >       putStrLn url
> >> >> >       tags_2 <-  fmap parseTags $ getResponseBody =<< simpleHTTP
> >> >> > ( getRequest url )
> >> >> >       print tags_2
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> > _______________________________________________
> >> >> > Haskell-Cafe mailing list
> >> >> > Haskell-Cafe at haskell.org
> >> >> > http://www.haskell.org/mailman/listinfo/haskell-cafe
> >> >>
> >> >
> >> >
> >> > _______________________________________________
> >> > Haskell-Cafe mailing list
> >> > Haskell-Cafe at haskell.org
> >> > http://www.haskell.org/mailman/listinfo/haskell-cafe
> >> >
> >>
> >> _______________________________________________
> >> Haskell-Cafe mailing list
> >> Haskell-Cafe at haskell.org
> >> http://www.haskell.org/mailman/listinfo/haskell-cafe
> >>
> >
> >
> > _______________________________________________
> > Haskell-Cafe mailing list
> > Haskell-Cafe at haskell.org
> > http://www.haskell.org/mailman/listinfo/haskell-cafe
> >
> >
>
>
>
> --
> /*******************************************************************/
>
> try {
>    log.trace("Id=" + request.getUser().getId() + " accesses " +
> manager.getPage().getUrl().toString())
> } catch(NullPointerException e) {}
>
> /*******************************************************************/
>
> This is a real code, but please make the world a bit better place and
> don’t do it, ever.
>
> *
> http://www.javacodegeeks.com/2011/01/10-tips-proper-application-logging.html*
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20110909/568d7f5b/attachment.htm>