The thread archive format can become a universal standard for all 4chan archivers, and drive the creation and betterment of both personal and and public archives. Standard 4chan Thread Archive Format

Even though every file only has one thread from one board and one site, it is still required to follow this folder structure, which makes it easy to store them, and then zip them back up:

  • <website-name> - The common name of the website.
  • <board-url-acronym> - On Futaba style imageboards, each board has a URL acronym, such as /b/ or /r9k/.
    • <thread-id> - The folder that actually contains the archive in question. It is differentiated and dated by it's thread id (ex. 172839).
    • Contents of the archive go here

Folder Design (Chart)

Here is an example in practice, for a thread with the URL :

- /             # root of static website/archive
|-- /b/
|-- /b/list.json    # "database" of current threads archived in the site. 
|-- /b/list.html
|-- /b/style.css    # custom section style for all threads. This is copied into every thread folder.
|-- /b/28392029
    |-- /b/28392029/images/
    |-- /b/28392029/thumbs/
    |-- /b/28392029/style.css   # An Exact Copy of the `style.css` file in the folder above.
    |-- /b/28392029/thread.json
    |-- /b/28392029/thread.html
    |-- /b/28392029/metadata.json   # contains useful metadata about the thread.

Here is the standard format for the contents of the archive itself:

  • thumbs/ - A folder containing the thumbnails from the 4chan thread.
  • images/ (optional) - A folder containing the full size images from the 4chan thread.
  • <thread-id>.json - The full thread in JSON format, generated by the 4chan API. Perfect for mobile viewing apps or import/export by databases.
  • external_links.txt - A file that contains every single web link referenced in the thread. Useful for creating a Webcite snapshot to prevent link rot.
  • <thread-id>.html - Contains an HTML-viewable version of the 4chan thread. (we currently just slurp down the entire HTML from 4chan, but one day we will create a JSON-to-HTML converter for cleaner and smaller dumps)
  • css/ - Contains any CSS that the HTML file needs. (will be depreciated and embedded in HTML after we create a JSON to HTML converter)
  • metadata.json - Contains metadata in JSON format about the thread, such as titles, tags, or comments. Very useful for large thread archives.

Converting Existing Legacy 4chan Archives

Before the invention of this script, the 4chan API. and the standard, 4chan anons typically used the Right-click, Save As (Webpage, Complete) function in Firefox or IE. This would save only the thumbnails and a snapshot of the HTML thread itself.

However, this method had serious caveats. It wouldn't save the full size images, and the threads were stored in ugly HTML that is difficult to work with.

While we can't restore the lost full size images, we can reorganize the files and folders to fit the standard, and retroactively convert the HTML into the new 4chan API JSON format.

This way, we can convert these legacy archives into the new universal standard.

Archive Indexing System (Static HTML for non-database systems)

Here is a simple thread archive indexing system, that uses static HTML and does not require any software or database on the HTML host. It only needs to be compiled again whenever a thread is added.

This static HTML indexing system will create new files, looking like this:

  • index.html
  • index.json
  • saved-threads - Contains the threads, either extracted or still in files

(no changes of any kind are made to the thread archives themselves)