11 KiB
etherpad changesets¶
url = "https://pad.xpub.nl/p/swarm02.md/export/etherpad"
from urllib.request import urlopen import json
data = json.load(urlopen(url))
data['pad:swarm02.md'].keys()
dict_keys(['atext', 'pool', 'head', 'chatHead', 'publicStatus', 'passwordHash', 'savedRevisions'])
The "head" gives the number of the last revisions.
data['pad:swarm02.md']['head']
2708
The first revision is number 0 and represents the initial welcome text.
data['pad:swarm02.md:revs:0']
{'changeset': 'Z:1>6b|5+6b$Welcome to Etherpad!\n\nThis pad text is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents!\n\nGet involved with Etherpad at http://etherpad.org\n', 'meta': {'author': '', 'timestamp': 1580310574755, 'atext': {'text': 'Welcome to Etherpad!\n\nThis pad text is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents!\n\nGet involved with Etherpad at http://etherpad.org\n\n', 'attribs': '|6+6c'}}}
data['pad:swarm02.md:revs:2708']
data['pad:swarm02.md:revs:2709']
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-7-f3133eb2f9ee> in <module> ----> 1 data['pad:swarm02.md:revs:2709'] KeyError: 'pad:swarm02.md:revs:2709'
print(json.dumps(data, indent=2))
EasySync¶
EasySync is the protocol developed for the etherpad software. The idea was to have a compact representation of a document as series of compact change sets -- ie descriptions of how a particular text was changed / edited -- rather than larger snapshots of the whole text over and over again. The design of the system reflects also its nature as a distributed system, the idea for the compactness and granularity of the changeset was to make it possible to combine the editing of multiple people making edits at possibly the same time. (The algorithm used in the end is called the Operational Transformation.
The design of changesets also reflects Engineering culture -- they are referred to as operations, like the machine operations used to program a computer. EasySync is in fact a sort of computer design where the basic operations manipulate a state-machine that represents a text.
import re
# e.g. Z:9kj>1|8=al=o4*1a|1+1$ def changeset_parse (c) : changeset_pat = re.compile(r'^Z:([0-9a-z]+)([><])([0-9a-z]+)(.+?)\$') op_pat = re.compile(r'(\|([0-9a-z]+)([\+\-\=])([0-9a-z]+))|([\*\+\-\=])([0-9a-z]+)') def parse_op (m): g = m.groups() if g[0]: if g[2] == "+": op = "insert" elif g[2] == "-": op = "delete" else: op = "hold" return { 'raw': m.group(0), 'op': op, 'lines': int(g[1], 36), 'chars': int(g[3], 36) } elif g[4] == "*": return { 'raw': m.group(0), 'op': 'attr', 'index': int(g[5], 36) } else: if g[4] == "+": op = "insert" elif g[4] == "-": op = "delete" else: op = "hold" return { 'raw': m.group(0), 'op': op, 'chars': int(g[5], 36) } m = changeset_pat.search(c) bank = c[m.end():] g = m.groups() ops_raw = g[3] op = None ret = {} ret['raw'] = c ret['source_length'] = int(g[0], 36) ret['final_op'] = g[1] ret['final_diff'] = int(g[2], 36) ret['ops_raw'] = ops_raw ret['ops'] = ops = [] ret['bank'] = bank ret['bank_length'] = len(bank) for m in op_pat.finditer(ops_raw): ops.append(parse_op(m)) return ret def perform_changeset_curline (text, c): textpos = 0 curline = 0 curline_charpos = 0 curline_insertchars = 0 bank = c['bank'] bankpos = 0 newtext = '' current_attributes = [] # loop through the operations # rebuilding the final text for op in c['ops']: if op['op'] == "attr": current_attributes.append(op['index']) elif op['op'] == "insert": newtextposition = len(newtext) insertion_text = bank[bankpos:bankpos+op['chars']] newtext += insertion_text bankpos += op['chars'] if 'lines' in op: curline += op['lines'] curline_charpos = 0 else: curline_charpos += op['chars'] curline_insertchars = op['chars'] # todo PROCESS attributes # NB on insert, the (original/old/previous) textpos does *not* increment... elif op['op'] == "delete": newtextposition = len(newtext) # is this right? # todo PROCESS attributes textpos += op['chars'] elif op['op'] == "hold": newtext += text[textpos:textpos+op['chars']] textpos += op['chars'] if 'lines' in op: curline += op['lines'] curline_charpos = 0 else: curline_charpos += op['chars'] # append rest of old text... newtext += text[textpos:] return newtext, curline, curline_charpos, curline_insertchars
changeset_parse(data['pad:swarm02.md:revs:2708']['changeset'])
Questions¶
But how compact is EasySync really? ... Despite it's claims of conciseness, the last change set for example above represents the typing of a space. The fact that each changeset is represented as a row in database, with a timestamp and other metadata (representing for instance author) adds another layer of information, and this then thousands of times to represent the history of a document.
In our practice at Constant, we need to regularly (maybe one a year) rebuild the etherpad database when it grows to be unmanageaably large (maybe 10 GB). This is due to the fact that the full history of all documents as changesets is much too expansive.