# Pastebin 3ky2SOp2 ## Dokuwiki Dokuwiki is a simple PHP-based wiki engine used often by small projects that don't need the overhead of Mediawiki. Scraping wikis can be a challenge for web scrapers, because it often leads them into an infinite redirect loop with many different generated pages. By the end of it, you may end up archiving a lot of useless or duplicate pages, or even nothing useful at all. The right way to do things is to grab all pages from the sitemap, then export each page to Dokuwiki source code. Thankfully, [Dokuwiki has just this function](https://www.dokuwiki.org/export) (if it's not disabled). 1. First, grab the sitemap page, and obtain all the page links out of it. It can usually be found by clicking the Sitemap button on the bottom toolbar, using the link `doku.php?do=index`. 2. Next, append the URL argument `&do=export_raw` to the end of each page link. It will look something like: `doku.php?id=shii&do=export_raw`. This is the equivalent of clicking "View Pagesource" on the top toolbar. * If you are unable to obtain this page, you'll have to just scrape the HTML and convert it to Markdown on your own. 3. Now, you can scrape this list of page links to obtain `.txt` files, containing the Dokuwiki source code of each page. 4. The next step is to search for and grab all the image links in the `.txt` files, which are in the format `{{image.jpg?30x30}}`. Then, you can scrape those images. * If you scraped the HTML, obviously you can just extract all the image links easily.