Creation log - BASC-Archiver

This is a long journal and explanation of the process of creating the BA-4chan-thread-archiver script. It provides a rare insight into the process that goes into making even the simplest scripts.

Before making this script, I knew next to nothing about Python (although I did have a background in basic programming and bash scripting), so the entire thing was a

Commenting

Generally, the comments in the source code of a program is it's most important portion. Even experienced programmers may need an explanation why something is done the way it is. It's like annotations on music for musicians.

While I was learning programming, I sometimes tried my hand at reading and commenting source code of certain scripts and programs; this helped to make sure that I actually knew what it was doing, and that I would know that in the future.

Something that annoys me is when I have to work on source code with a general dearth of comments. Apparently, all programmers enter a state where they become fluent in the language and no longer need an explanation in English; however, this doesn't last forever, and definitely causes an impediment for those who may take over the effort.

Setting HTML to local folders

Originally, my plan was to automatically generate HTML from the JSON downloaded. However, perhaps the easier way was to download the HTML that 4chan already generated, and convert it to use the images the script already downloaded.

To do so, these find and replace functions converted absolute links to relative links to the downloaded images.

"// -> "http://
http://images.4chan.org/\w/src/ -> img/
http://\d.thumbs.4chan.org/\w/thumb/ -> thumb/

The function

To perform the regex, I needed a suitable function. Rather than reinvent the wheel, I borrowed code from StackOverflow.

# Regex function originally by steveha on StackOverflow: 
# http://stackoverflow.com/questions/1597649/replace-strings-in-files-by-python
import re
import os
def file_replace(fname, pat, s_after):
# first, see if the pattern is even in the file.
with open(fname) as f:
    if not any(re.search(pat, line) for line in f):
    return # pattern does not occur in file so we are done.

# pattern is in the file, so perform replace operation.
with open(fname) as f:
    out_fname = fname + ".tmp"
    out = open(out_fname, "w")
    for line in f:
    out.write(re.sub(pat, s_after, line))
    out.close()
    os.rename(out_fname, fname)

The Regex Expressions

This is the test file I created to work with the regex, known as testfile.html

<html>"//images.4chan.org/k/src/4238749327.img"</html>
<a="//9.thumbs.4chan.org/k/thumb/"<test>
<"//static.4chan.org/css/jfoajweaf.478.css">
<"//static.4chan.org/css/yuiyiiu.283.css">

So we can now use our regex.

html_path = "testfile.html"
file_replace(html_path, '"//', "http://")
file_replace(html_path, "http://images.4chan.org/\w/src/", "img/")
file_replace(html_path, "http://\d.thumbs.4chan.org/\w/thumb/", "thumb/")

Downloading and linking to a local copy of the CSS

CSS can just stay linked to 4chan, but 4chan tends to change the CSS all the time, so it's a good idea to use a local snapshot for best compatiblity.

In the HTML of threads, they have some weird CSS number in addition, for some reason... It seems to be consistent, so we should have a copy locally.

Convert link to plain file path:

http://static.4chan.org/css/(\w+).\d+.css -> css/\1.css

(FUTURE) Downloading a local copy of the CSS

We need to grab the names of all those css files, and use them to save a copy of the stylesheets locally under css/.

I used this tutorial and this stackoverflow to understand how to grab all matching lines.

Use beautifulsoup to find links with <link href=> tag

Function to find text in file

# Return lines matching a regex, code by Ben Welsh
# http://palewi.re/posts/2008/04/05/python-recipe-open-a-file-read-through-it-print-each-line-matching-a-search-term/
# Notice: `\1` notation could be interpreted by python as `\x01`! Escape it with a second backslash: `\\1`
def file_find(fname, pat)
f = open(fname, "r")

for line in f:
    if re.match(pat, line):
    print line,

Regex for linking to a local copy of the CSS

The below regex used a \1 replace, but it was giving me problems in python. Apparently, since this character was in a string literal, python converted the \1 to a \x01 character before it got to the regex replace function. The recommended way to send this raw string was to use r"\1". However, since the function added an extra layer of abstraction, there was no way to use that.

After hours of searching, the solution was to escape the \1 with a backslash, resulting in \\1. When the string finally reached the replace function, it would be interpreted as \1.

file_replace(html_path, "http://static.4chan.org/css/(\w+).\d+.css", "css/\\1.css")

Pretty Print JSON

By adding sort_keys=True, indent=2, separators=(',', ': ') as arguments to the json.dump() function, we can make the JSON much more readable. It only takes a kilobyte or two extra, so it's worth it.

Make a list of all external links quoted in comments

In the original version of the 4chandownloader script, the script would create a list of all rapidshare download links found in the thread; however, this function was removed when the new author transitioned to 4chan API.

Sometimes the external links form an integral part of the story, or contain links to filesharing services of interest. The user might want to download such files, or save those sites in case they go down.

I created a restored version that searches comments from the 4chan API and grabs all external URLs, rather than 3 filesharing sites. This set of commands is tacked to the image download loop. It then stores the URLs in an external_links.txt file subdivided with newlines, for the user to read or for wget to parse.

This presents a challenge; 4chan generally does not allow certain URL links on their site (to combat spam), so users tend to write URLs in ways that fool the site's URL regex (insert spaces, etc). We need something that will pull in even those links.

On DaringFireball, there is a monster regex that will match nearly any URL. A version for Python can be found on StackOverflow.. We use this one.

Additionally, when rendering to HTML, 4chan often adds <wbr> tags, indicating an optional line break. This tag tends to end up in the middle of URLs, screwing them up. So this statement is added to get rid of them:

# We need to get rid of all <wbr> tags before parsing
cleaned_com = re.sub(r'\<wbr\>', '', post['com'])

Using the py4chan API wrapper

After searching github repositories, I discovered that Edgeworth Euler of ED.ch fame created a python wrapper for the 4chan API. This simplified and replaced some of the messy code for handling JSON that I was using.

I had to upload the py4chan code to PyPI so that the entire library could be installed using pip.

py4chan had a very nice Thread.Files() function that returns all the URLs to images in the thread on 4chan. However, there was no equivalent for thumbnails, so I created a Thread.Thumbs() function to do so, and sent a pull request to the author.

py4chan allowed me to add a 404 checking subroutine, so that the script will stop if the thread is deleted or the user's connection drops.

I still maintain the original script as 4chan-thread-archiver-orig and keep it up-to-date, just in case.

Save to default path if a path is not given.

The original script required users to input a path. I believed that it should set a default path if none is given instead.

Because the argument didn't work otherwise, I changed the argument to use [--path=<string>] rather than <path>. Thanks to docopts, changing the parameters is a very easy and visual process.

The below code checks if the path argument is not given, and substitutes a default folder name.

if (path == None):
  path = os.path.join(os.getcwd() + os.path.sep + _DEFAULT_FOLDER)

The path is composed of the current working directory and the default folder name. Since os.path.join did not add a path seperator for some reason, I added my own with os.path.sep. This variable is used because Windows uses a \ for path seperators, while Linux and Mac use /.

(FUTURE) Moving the functions to a class, and making a GUI

(FUTURE) Metadata file

The metadata file would be in JSON or YAML format, and contain information about the thread useful to a reader.

The script could prompt the user to add a title or quick description

(FUTURE) Abstracted Interface and PyQt GUI

Something like the Chandler.

We would have to convert all the functions into a class, with a GUI and CLI interface script.

(FUTURE) chan.zip file format

For ease of transmission, this script should give the ability to transfer in .chan.zip format. All this does is zip up the folder, just like an .epub; saving a bit of space. This will provide a universal archive format for transferring information between archives, or between anons on 4chan, which may happen a lot.

We don't have to restrict ourselves to .zip; in theory, we can use any archive format that is convenient. However, to reduce incompatibility, we should restrict ourselves to the best formats for the job: .zip and .7z

.zip - PKZIP, the standard archive format for Windows-based systems.
.7z or .lzma or .xz - LZMA, the open-source heavy compression format. It is the king of the archivers; however, it demands a lot more power and time than .zip, and may not be as suitable for mobile devices.
.tar.gz and .tar.bz2 - GNU zip and bzip2, the standard archive formats for Unix-based systems. The fundamental flaw of these formats is that they rely on .tar, an antiquated tape-backup archive format that does not allow straight file deletion. However, they do compress and decompress better than zip, despite generally using the same algorithm.

(FUTURE) Implement useful functions from other CSS/JS mods

Expand image in the current HTML
IQDB/TinEye/Google/SauceNao

(FUTURE) Fix backquotes

Javascript backquotes are broken; add a .html extension.

Orig: g/33013702/33013702#p33013796
Fix: g/33013702/33013702.html#p33013796

(SOLVED) post['com'] key error bug

on these threads (which just happen to be stickies): http://boards.4chan.org/k/res/14013417 http://boards.4chan.org/g/res/33013702

py4chan spits out this error:

Traceback (most recent call last):
  File "4chan-thread-archiver", line 263, in <module>
main(args)
  File "4chan-thread-archiver", line 243, in main
list_external_links(curr_thread, dst_dir)
  File "4chan-thread-archiver", line 152, in list_external_links
if not linkregex.search(reply.Comment):
  File "/usr/lib/python2.7/site-packages/py4chan/__init__.py", line 309, in Comment
return self._data['com'] or None
KeyError: 'com'

My script that doesn't use py4chan spits out this error:

Traceback (most recent call last):
  File "4chan-thread-archiver-orig", line 239, in <module>
main(args)
  File "4chan-thread-archiver-orig", line 218, in main
get_files(dst_dir, board, thread, nothumbs, thumbsonly)
  File "4chan-thread-archiver-orig", line 191, in get_files
list_external_links(json_thread, dst_dir)
  File "4chan-thread-archiver-orig", line 156, in list_external_links
if not linkregex.search(post['com']):
KeyError: 'com'

However, the in the Python interpreter, Post.Comment works fine.

Python 2.7.3 (default, Feb 26 2013, 22:57:37) 
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import py4chan
>>> k = py4chan.Board("k")
>>> thread = k.getThread(14013417)
>>> thread.Sticky
True
>>> reply = thread.replies[1]
>>> reply.Comment
u"Once you've chosen a frame size, <etc.> anyway."

Even without py4chan, the result is the same.

>>> json_thread = requests.get(FOURCHAN_API_URL % (board, thread))
>>> post = json_thread.json['posts'][0]
>>> post['com']
u'Welcome to /k/. In this thread you will find basic knowledge to get you to know weapons and how to use them safely plus some more detailed guides. All the useful information is gathered in the following infographics.<br><br>Safety is the most important thing so we will start with safety instructions.'

Checking the JSON, every single post has an associated "com" value, so there's no problem in the JSON itself. What is going on?

Finally, I inserted this statement into list_external_links() where the loop checks the comment field for links. Maybe it was choking on a certain post?

for post in json_thread.json['posts']:
print post['com']

The idea is that if the comment field existed, the loop would return the value of the comment all the way until it failed, and I could check that value to see what was up. Right after it printed the last comment, the script failed, and I checked what entry was right after the last working comment.

{
  "ext": ".png",
  "filename": "1302620113762",
  "fsize": 1462132,
  "h": 1225,
  "md5": "wblCqtPMg5D6OdNXVIaN9w==",
  "name": "Anonymous",
  "no": 14013856,
  "now": "01/02/13(Wed)16:43",
  "resto": 14013417,
  "tim": 1357162982849,
  "time": 1357162982,
  "tn_h": 95,
  "tn_w": 125,
  "w": 1600
}

I was right. This bug involved nonexistent "com" fields, which are theoretically optional on the 4chan API; but in practice, it wasn't easy to trick 4chan into posting nothing. As a result, posts without comments were rare enough for me to forget about them.

Now I had to somehow handle nonexistent "com" fields in the program itself. Thanks to handy-dandy StackOverflow, I found that I could use a try/except loop or a special if statement. I already had a loop, so I added the if statement to my list_external_links() for loop.

4chan-thread-archiver-orig

# comments are optional in 4chan API
if "com" not in post:
continue

And for the py4chan based one, I had to modify the library itself; the library was handling optional attributes incorrectly when polled. The pull request is below.

And now it finally dumps correctly!

Pull request: BUG: Optional attributes aren't returning `None` when polled

Certain attributes in the 4chan API are optional and are not passed in the JSON. When an attribute does not exist for a post, the py4chan library should return the None object.

Going by your code, you've tried to do that with this statement:

return self._data['sub'] or None

However, it doesn't return None as expected when the attribute doesn't exist. It instead gives a KeyError that causes scripts to fail in a cryptic manner:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/site-packages/py4chan/__init__.py", line 305, in Subject
return self._data['sub'] or None
KeyError: 'sub'

Solution B

For some reason, the FileSize attribute uses this statement that returns None correctly:

return self._data.get('fsize', None)

So I simply replaced the faulty statements with this one in this pull request. Now the error disappeared.

Why wasn't this one used in the first place?

Solution A (not used)

Another solution is to check if the attribute exists with a special if statement.

if 'sub' not in self._data:
return None
else:
return self._data['com']

And now the library works fine.