# Pastebin S9yyaJSA Dagobah is a major Flash site in 4chan's history that retains many of the Flash swfs made by /f/ from 2008 to today. While it seems lively now, it could have issues in the future. I'm putting together this scraper to grab any kind of paginated gallery and scrape the tags into a database. ## Macrochan Scraper The Macrochan Scraper was a set of quick and dirty scraper scripts that pulls off of the site and ## DB Schema for Dagobah ### Gallery Parsing Parsing the gallery item is harder, because the items are separated by `

`, so index based scraping will have to be used instead. All of these can be found under `
`. ``` F 3 anime songs tags: Loop, anime, epilepsy
rating: 2.5 (34)
views: 4729
comments: 4
size: 1.3 MB
date: 26.01.2012
``` If you forget to grab it from the gallery but have all the filenames, you can also grab it from comments as well. ### Positional Scraping The technique I use to serialize this poorly designed list is Positional Scraping. This processes the list of elements in an item based solely on it's index in the list. Since we know that a robot and not a human generated this list, it always has a pattern. Thankfully, the size and format of this list doesn't change, so we don't need to do crazy heuteristics based on amount of elements. In the gallery view above, the links are encapsulated by a `
`, which is easy to grab with pyquery's `div#flashlist`. But a better method to grab the elements of the flashlist is to find only links that match `a.flash` (`flash` class attribute) inside the `div`. Now, we get to Positional Scraping. Using PyQuery, I selected all `a.flash` links, and filtered out all `span.value` tags so I could see the bare element titles. I used an enumerated `for` loop to index all displayed elements. ``` [0] 4chan vs reddit [1] tags: [3] rating: [5] (26) [6] views: [10] size: [12] date: ``` From here, we can tell that index 0 and index 5 have the value in the element title. For the others, index 2 corresponds to `description` (actually, tags), index 3 corresponds to `rating`, index 6 corresponds to `views`, and so on. The `span` tags at those indicies should contain the values corresponding to the preceding element. Thus, we just grab that element and display it's text contents. That data can also be processed before displaying. For example, the string of comma separated tags is converted into a Python list using [list comprehension.](http://stackoverflow.com/a/4071407) We also converted the date to ISO format. ``` # index 1+2: tags if i == 2: # extract tags using list comprehension: http://stackoverflow.com/a/4071407 tags = [tag.strip() for tag in span.text().split(',')] print("tags: %s" % tags) # index 3+4: rating if i == 4: print("rating: %s" % span.text()) # index 10+11: filesize if i == 11: print("size: %s" % span.text()) # index 12+13: date if i == 13: # date format is DD.MM.YYYY, which is weird. Convert to YYYY-MM-DD datetime object date_raw = [int(tag) for tag in span.text().split('.')] date = datetime(date_raw[2], date_raw[1], date_raw[0]) print("date: %s" % date.date()) ``` For index 0 and 5 that have the value in the element title, we can just match the corresponding span tags directly. Some text processing with a bit of regex can be useful to remove unnecessary elements (such as parentheses). ``` # index 0: flashname if span.hasClass("flashname"): print("title: %s" % span.text()) # index 5: amount if span.hasClass("amount"): # remove parentheses print("amount: %s" % re.sub(r'[()]', '', span.text())) ``` Notice that some `span.value` tags actually contain hidden elements in the gap between index 6-10: comments. ``` # index 6+7: # of views if i == 7: print("views: %s" % span.text()) # index 8+9: # of comments if i == 9: print("comments: %s" % span.text()) ``` Finally, to serialize the data for programs to use: instead of printing the data, we simply store it in a dictionary. ### Table `files` * `filename` (Primary Key) - The filename is always unique, and forms the URL. * _Gallery_ - Found in `` * `id` - Flash ID as displayed in the gallery. It is also a candidate for the primary key, but since filename is always unique we chose to use that instead, since it's more descriptive. * _Gallery_ - Found in `` * `title` - The Title of the flash file. * _Gallery_ - Found in `3 anime songs` * _Viewer_ - Found in `404`, where `404` is the name * `rating` - A float value from 1-5. * _Gallery_ - * _Viewer_ - Found in `
  • Currently 3.39/5
  • `. * `raters` - Amount of people who actually rated the post. * _Gallery_ * `votes` - Amount of user votes reported. * Found in `(123 votes)` * `youtubeid` (optional) - Some newer Dagobah links provide a YouTube embed instead. Get the ID from it. * _Viewer_ ### Table `tags` A table listing all possible tags. * `tagname` (Primary Key) - The name of the tag itself. ### Table `taglink` A one-to-many linking table should be used to store tags. * `filename` (Foreign Key) - The filename that the tag will be associated with. * `tag` (Foreign Key) - The tag that the filename will be associated with. ### URL Schema * `http://dagobah.net/flash/404.swf` - File viewing URL. * `http://dagobah.net/flash/download/404.swf` - File download URL. * `http://dagobah.net/flashswf/404.swf` - File embed URL. * `http://dagobah.net/flash/404.swf/comments` - Comment URL. * `http://dagobah.net/t100/404.swf` - 100x100px thumbnail URL. ## The Depaginator The proposed **BASC Depaginator** will be an all purpose, fully configurable tool for archiving paginated image/file galleries. These sites do not lend themselves well to archival with a brute-force scraper, since they often end up falling into infinite loops, or fail to grab everything, often ending up with piles of useless data. In addition, they often rely on complex display systems that were made to be easy for humans to use, but create horrific spaghetti code for any robot to parse through. Finally, these robots add significant strain to the server. The Depaginator has several tasks: 1. Create a SQLite database with a schema that fits the task of storing and reporting the metadata we want. 2. Parse the paginated galleries one page at a time. This will usually lead you to a URL of the image/file viewer (which can be stored in the database), though sometimes you may be lucky enough to get some extra metadata. * Store the URLs in the database along with the filename as the primary key. 3. Grab all URLs from the database and access each image/file viewer page. From here, grab as much metadata as possible, grab a link to the actual image, and store it in the database. * For items such as tags or categories, a separate linking table with a one to many relationship from tag to file may be necessary. 4. Download all images from the website. It might be helpful to generate a txt dump of all file URLs, so they can be grabbed using Wget or grab-site. ### Database Manager This time, this is a good chance to learn some SQLAlchemy. It's the better way to work with SQL. ### Gallery Parser This parser generates a list of all files in the gallery. Generally, it grabs a certain amount of image URLs from one page, then jumps to the next page and grabs again, and so on. The user sets the total amount of pages in the site. If the site doesn't tell you, you'll have to figure it out by going to the final page possible. On Dagobah, the format is as follows for page 3: `http://dagobah.net/?page=3&o=n&d=i` ### Image Nabber In this step, we actually go into the Image View URLs in question and extract any metadata from there, especially the actual Image URLs. You could also download images in this set in this step, but it is better to grab all the URLs and metadata first here. ### Comment Scraper Thankfully, Dagobah has a simple URL scheme where every single image at `http://dagobah.net/flash/404.swf` has a comments section at `http://dagobah.net/flash/404.swf/comments`. Thus, we can create a `comments` table with a foreign key `filename` that links each comment to the record in the `files` table. ### Export to JSON or TXT The final step is to export all image URLs to TXT, or a JSON file which can have the Image/File Viewer URLs as well. This makes it possible to use an external tool such as `wget` for the images, or grab-site, which creates WARCs. ### Scraper Alternatively, we could make a specific downloader which will handle the scraping. This makes it possible to set certain ranges to download, to spread the archival workload across machines for parallel scraping and diversity in IP ranges.