[Haskell-cafe] Converting wiki pages into pdf

Matti Oinas matti.oinas at gmail.com
Fri Sep 9 07:33:48 CEST 2011


The whole wikipedia database can also be downloaded if that is any help.

http://en.wikipedia.org/wiki/Wikipedia:Database_download

There is also text in that site saying "Please do not use a web
crawler to download large numbers of articles. Aggressive crawling of
the server can cause a dramatic slow-down of Wikipedia."

Matti

2011/9/9 Kyle Murphy <orclev at gmail.com>:
> It's worth pointing out at this point (as alluded to by Conrad) that what
> you're attempting might be considered somewhat rude, and possibly slightly
> illegal (depending on the insanity of the legal system in question).
> Automated site scraping (what you're essentially doing) is generally frowned
> upon by most hosts unless it follows some very specific guidelines, usually
> at a minimum respecting the restrictions specified in the robots.txt file
> contained in the domains root. Furthermore, depending on the type of data in
> question, and if a EULA was agreed to if the site requires an account, doing
> any kind of automated processing might be disallowed. Now, I think wikipedia
> has a fairly lenient policy, or at least I hope it does considering it's
> community driven, but depending on how much of wikipedia you're planning on
> crawling you might at the very least consider severly throttling the process
> to keep from sucking up too much bandwidth.
>
> On the topic of how to actually perform that crawl, you should probably
> check out the format of the link provided in the download PDF element. After
> looking at an article (note, I'm basing this off a quick glance at a single
> page) it looks like you should be able to modify the URL provided in the
> "Permanent link" element to generate the PDF link by changing the title
> argument to arttitle, adding a new title argument with the value
> "Special:Book", and adding the new arguments "bookcmd=render_article" and
> "writer=rl". For example if the permanent link to the article is:
>
> http://en.wikipedia.org/w/index.php?title=Shapinsay&oldid=449266269
>
> Then the PDF URL is:
>
> http://en.wikipedia.org/w/index.php?arttitle=Shapinsay&oldid=449266269&title=Special:Book&bookcmd=render_article&write=rl
>
> This is all rather hacky as well, and none of it has been tested so it might
> not actually work, although I see no reason why it shouldn't. It's also
> fragile, as if wikipedia changes just about anything it could all brake, but
> that's the risk you run anytime you resort of site scraping.
>
> -R. Kyle Murphy
> --
> Curiosity was framed, Ignorance killed the cat.
>
>
> On Thu, Sep 8, 2011 at 23:40, Conrad Parker <conrad at metadecks.org> wrote:
>>
>> On Sep 9, 2011 7:33 AM, "mukesh tiwari" <mukeshtiwari.iiitm at gmail.com>
>> wrote:
>> >
>> > Thank your for reply Daniel. Considering my limited knowledge of web
>> > programming and javascript , first i need to simulated the some sort of
>> > browser in my program which will run the javascript and will generate the
>> > pdf. After that i can download the pdf . Is this you mean ?  Is
>> > Network.Browser any helpful for this purpose ? Is there  way to solve this
>> > problem ?
>> > Sorry for  many questions but this  is my first web application program
>> > and i am trying hard to finish it.
>> >
>>
>> Have you tried finding out if simple URLs exist for this, that don't
>> require Javascript? Does Wikipedia have a policy on this?
>>
>> Conrad.
>>
>> >
>> > On Fri, Sep 9, 2011 at 4:17 AM, Daniel Patterson
>> > <lists.haskell at dbp.mm.st> wrote:
>> >>
>> >> It looks to me that the link is generated by javascript, so unless you
>> >> can script an actual browser into the loop, it may not be a viable approach.
>> >>
>> >> On Sep 8, 2011, at 3:57 PM, mukesh tiwari wrote:
>> >>
>> >> > I tried to use the PDF-generation facilities . I wrote a script which
>> >> > generates the rendering url . When i am pasting rendering url in
>> >> > browser its generating the download file but when i am trying to get
>> >> > the tags , its empty. Could some one please tell me what is wrong
>> >> > with
>> >> > code.
>> >> > Thank You
>> >> > Mukesh Tiwari
>> >> >
>> >> > import Network.HTTP
>> >> > import Text.HTML.TagSoup
>> >> > import Data.Maybe
>> >> >
>> >> > parseHelp :: Tag String -> Maybe String
>> >> > parseHelp ( TagOpen _ y ) = if ( filter ( \( a , b ) -> b ==
>> >> > "Download
>> >> > a PDF version of this wiki page" ) y )  /= []
>> >> >                            then Just $  "http://en.wikipedia.org" ++
>> >> >  ( snd $
>> >> > y !!  0 )
>> >> >                             else Nothing
>> >> >
>> >> >
>> >> > parse :: [ Tag String ] -> Maybe String
>> >> > parse [] = Nothing
>> >> > parse ( x : xs )
>> >> >   | isTagOpen x = case parseHelp x of
>> >> >                        Just s -> Just s
>> >> >                        Nothing -> parse xs
>> >> >   | otherwise = parse xs
>> >> >
>> >> >
>> >> > main = do
>> >> >       x <- getLine
>> >> >       tags_1 <-  fmap parseTags $ getResponseBody =<< simpleHTTP
>> >> > ( getRequest x ) --open url
>> >> >       let lst =  head . sections ( ~== "<div class=portal id=p-coll-
>> >> > print_export>" ) $ tags_1
>> >> >           url =  fromJust . parse $ lst  --rendering url
>> >> >       putStrLn url
>> >> >       tags_2 <-  fmap parseTags $ getResponseBody =<< simpleHTTP
>> >> > ( getRequest url )
>> >> >       print tags_2
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > _______________________________________________
>> >> > Haskell-Cafe mailing list
>> >> > Haskell-Cafe at haskell.org
>> >> > http://www.haskell.org/mailman/listinfo/haskell-cafe
>> >>
>> >
>> >
>> > _______________________________________________
>> > Haskell-Cafe mailing list
>> > Haskell-Cafe at haskell.org
>> > http://www.haskell.org/mailman/listinfo/haskell-cafe
>> >
>>
>> _______________________________________________
>> Haskell-Cafe mailing list
>> Haskell-Cafe at haskell.org
>> http://www.haskell.org/mailman/listinfo/haskell-cafe
>>
>
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
>



-- 
/*******************************************************************/

try {
   log.trace("Id=" + request.getUser().getId() + " accesses " +
manager.getPage().getUrl().toString())
} catch(NullPointerException e) {}

/*******************************************************************/

This is a real code, but please make the world a bit better place and
don’t do it, ever.

* http://www.javacodegeeks.com/2011/01/10-tips-proper-application-logging.html *



More information about the Haskell-Cafe mailing list