`, which is easy to grab with pyquery's `div#flashlist`.
But a better method to grab the elements of the flashlist is to find only links that match `a.flash` (`flash` class attribute) inside the `div`.
Now, we get to Positional Scraping. Using PyQuery, I selected all `a.flash` links, and filtered out all `span.value` tags so I could see the bare element titles. I used an enumerated `for` loop to index all displayed elements.
```
[0]
4chan vs reddit
[1]
tags:
[3]
rating:
[5]
(26)
[6]
views:
[10]
size:
[12]
date:
```
From here, we can tell that index 0 and index 5 have the value in the element title. For the others, index 2 corresponds to `description` (actually, tags), index 3 corresponds to `rating`, index 6 corresponds to `views`, and so on.
The `span` tags at those indicies should contain the values corresponding to the preceding element. Thus, we just grab that element and display it's text contents.
That data can also be processed before displaying. For example, the string of comma separated tags is converted into a Python list using [list comprehension.](http://stackoverflow.com/a/4071407) We also converted the date to ISO format.
```
# index 1+2: tags
if i == 2:
# extract tags using list comprehension: http://stackoverflow.com/a/4071407
tags = [tag.strip() for tag in span.text().split(',')]
print("tags: %s" % tags)
# index 3+4: rating
if i == 4:
print("rating: %s" % span.text())
# index 10+11: filesize
if i == 11:
print("size: %s" % span.text())
# index 12+13: date
if i == 13:
# date format is DD.MM.YYYY, which is weird. Convert to YYYY-MM-DD datetime object
date_raw = [int(tag) for tag in span.text().split('.')]
date = datetime(date_raw[2], date_raw[1], date_raw[0])
print("date: %s" % date.date())
```
For index 0 and 5 that have the value in the element title, we can just match the corresponding span tags directly. Some text processing with a bit of regex can be useful to remove unnecessary elements (such as parentheses).
```
# index 0: flashname
if span.hasClass("flashname"):
print("title: %s" % span.text())
# index 5: amount
if span.hasClass("amount"):
# remove parentheses
print("amount: %s" % re.sub(r'[()]', '', span.text()))
```
Notice that some `span.value` tags actually contain hidden elements in the gap between index 6-10: comments.
```
# index 6+7: # of views
if i == 7:
print("views: %s" % span.text())
# index 8+9: # of comments
if i == 9:
print("comments: %s" % span.text())
```
Finally, to serialize the data for programs to use: instead of printing the data, we simply store it in a dictionary.
### Table `files`
* `filename` (Primary Key) - The filename is always unique, and forms the URL.
* _Gallery_ - Found in `
`
* `id` - Flash ID as displayed in the gallery. It is also a candidate for the primary key, but since filename is always unique we chose to use that instead, since it's more descriptive.
* _Gallery_ - Found in `
`
* `title` - The Title of the flash file.
* _Gallery_ - Found in `
3 anime songs`
* _Viewer_ - Found in `
404`, where `404` is the name
* `rating` - A float value from 1-5.
* _Gallery_ -
* _Viewer_ - Found in `
Currently 3.39/5`.
* `raters` - Amount of people who actually rated the post.
* _Gallery_
* `votes` - Amount of user votes reported.
* Found in `
(123 votes)`
* `youtubeid` (optional) - Some newer Dagobah links provide a YouTube embed instead. Get the ID from it.
* _Viewer_
### Table `tags`
A table listing all possible tags.
* `tagname` (Primary Key) - The name of the tag itself.
### Table `taglink`
A one-to-many linking table should be used to store tags.
* `filename` (Foreign Key) - The filename that the tag will be associated with.
* `tag` (Foreign Key) - The tag that the filename will be associated with.
### URL Schema
* `http://dagobah.net/flash/404.swf` - File viewing URL.
* `http://dagobah.net/flash/download/404.swf` - File download URL.
* `http://dagobah.net/flashswf/404.swf` - File embed URL.
* `http://dagobah.net/flash/404.swf/comments` - Comment URL.
* `http://dagobah.net/t100/404.swf` - 100x100px thumbnail URL.
## The Depaginator
The proposed **BASC Depaginator** will be an all purpose, fully configurable tool for archiving paginated image/file galleries.
These sites do not lend themselves well to archival with a brute-force scraper, since they often end up falling into infinite loops, or fail to grab everything, often ending up with piles of useless data.
In addition, they often rely on complex display systems that were made to be easy for humans to use, but create horrific spaghetti code for any robot to parse through. Finally, these robots add significant strain to the server.
The Depaginator has several tasks:
1. Create a SQLite database with a schema that fits the task of storing and reporting the metadata we want.
2. Parse the paginated galleries one page at a time. This will usually lead you to a URL of the image/file viewer (which can be stored in the database), though sometimes you may be lucky enough to get some extra metadata.
* Store the URLs in the database along with the filename as the primary key.
3. Grab all URLs from the database and access each image/file viewer page. From here, grab as much metadata as possible, grab a link to the actual image, and store it in the database.
* For items such as tags or categories, a separate linking table with a one to many relationship from tag to file may be necessary.
4. Download all images from the website. It might be helpful to generate a txt dump of all file URLs, so they can be grabbed using Wget or grab-site.
### Database Manager
This time, this is a good chance to learn some SQLAlchemy. It's the better way to work with SQL.
### Gallery Parser
This parser generates a list of all files in the gallery. Generally, it grabs a certain amount of image URLs from one page, then jumps to the next page and grabs again, and so on.
The user sets the total amount of pages in the site. If the site doesn't tell you, you'll have to figure it out by going to the final page possible.
On Dagobah, the format is as follows for page 3: `http://dagobah.net/?page=3&o=n&d=i`
### Image Nabber
In this step, we actually go into the Image View URLs in question and extract any metadata from there, especially the actual Image URLs.
You could also download images in this set in this step, but it is better to grab all the URLs and metadata first here.
### Comment Scraper
Thankfully, Dagobah has a simple URL scheme where every single image at `http://dagobah.net/flash/404.swf` has a comments section at `http://dagobah.net/flash/404.swf/comments`.
Thus, we can create a `comments` table with a foreign key `filename` that links each comment to the record in the `files` table.
### Export to JSON or TXT
The final step is to export all image URLs to TXT, or a JSON file which can have the Image/File Viewer URLs as well. This makes it possible to use an external tool such as `wget` for the images, or grab-site, which creates WARCs.
### Scraper
Alternatively, we could make a specific downloader which will handle the scraping. This makes it possible to set certain ranges to download, to spread the archival workload across machines for parallel scraping and diversity in IP ranges.