Let's have a look at the a config file!
<ripping-session>
<epub-title><![CDATA[File epub title]]></epub-title>
<epub-language>it</epub-language>
<epub-filename>BERTFilename.epub</epub-filename>
<!-- Optional proxy information required if you are behind a proxy -->
<net-proxy-host>10.1.1.1</net-proxy-host>
<net-proxy-port>8080</net-proxy-port>
<net-proxy-username><![CDATA[proxyuser]]></net-proxy-username>
<net-proxy-password><![CDATA[proxypwd]]></net-proxy-password>
<list-provider id="Booksblog">
<provider-title><![CDATA[Booksblog]]></provider-title>
<list>
<page-url>
<![CDATA[http://www.booksblog.it/post/6560/confessioni-da-lettore]]>
</page-url>
<page-url>
<![CDATA[http://www.booksblog.it/post/6562/quali-sono-i-vostri-guru-della-lettura]]>
</page-url>
</list>
<generic-processor processImages="true">
<contents-selector><![CDATA[div.articolo]]></contents-selector>
<title-selector><![CDATA[h1]]></title-selector>
<sub-title-selector><![CDATA[]]></sub-title-selector>
<meta-info-selector><![CDATA[small]]></meta-info-selector>
<body-paragraphs-selector><![CDATA[div.contenuto > p]]></body-paragraphs-selector>
<comment-selector><![CDATA[li[id^=comment-]]]></comment-selector>
<comment-author-selector>
<![CDATA[div.comment_head_left > small]]>
</comment-author-selector>
<comment-meta-info-selector>
<![CDATA[div.comment_head_left > h4]]>
</comment-meta-info-selector>
<comment-body-paragraphs-selector>
<![CDATA[div.comment_text]]>
</comment-body-paragraphs-selector>
</generic-processor>
</list-provider>
<feed-provider id="NazioneIndiana" maxEntries="5">
<provider-title><![CDATA[[BLOG] Nazione Indiana]]></provider-title>
<feed-url><![CDATA[http://feeds2.feedburner.com/NazioneIndiana]]></feed-url>
<processor processImages="true" className="org.bert.ebooks.processors.NazioneIndiana"/>
</feed-provider>
</ripping-session>
This represents an exaustive example of BERT's xml config file.
epub-*
tags set epub meta infos (title and language) and the output filename
net-proxy-*
are useful tags if you want to use BERT behind a network proxy. You should remove or comment those lines if not (if you are not behind a proxy and you leave those tags the ripping session will be very slow)
Now you can find a one or more Provider
definitions as many as you want. In this example we find two Providers
the first one is a ListProvider
in which you can define under page-url
tags one or more (as many as you wish) single post urls that you want to rip, the second one is a FeedProvider
in which you have to specify, under feed-url
tag, the RSS resource from which read post's urls. In FeedProvider
you should limit the number of urls to process (potentially a RSS source can produce thousand of urls) by setting the maxEntries
property. Each Provider
must have an ID and it should be unique in a single ripping session. If you repeat the same ID in two (or more) different Provider
, the last defined win (it's a LinkedHashMap).
Each Provider
has it's own Processor
in our example the NazioneIndiana
Provider
has it's own java class implementing org.bert.ebooks.BlogEntryProcessor
all processing logic is incapsulated in the class org.bert.ebooks..processors.NazioneIndiana
. In the Booksblog
Provider
the Processor
is an Object of org.bert.ebooks..processors.GenericProcessor
(see the detail section) less powerfull but much more easy to be implemented: no java Know-How is required.
The CDATA
"element" is required for those tags in which the content could contain special XML chars (like, for exmple in a
URL: "&")