You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

11 KiB

etherpad changesets

In [1]:
url = "https://pad.xpub.nl/p/swarm02.md/export/etherpad"
In [2]:
from urllib.request import urlopen
import json
In [3]:
data = json.load(urlopen(url))
In [ ]:
 
In [4]:
data['pad:swarm02.md'].keys()
Out[4]:
dict_keys(['atext', 'pool', 'head', 'chatHead', 'publicStatus', 'passwordHash', 'savedRevisions'])

The "head" gives the number of the last revisions.

In [5]:
data['pad:swarm02.md']['head']
Out[5]:
2708

The first revision is number 0 and represents the initial welcome text.

In [6]:
data['pad:swarm02.md:revs:0']
Out[6]:
{'changeset': 'Z:1>6b|5+6b$Welcome to Etherpad!\n\nThis pad text is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents!\n\nGet involved with Etherpad at http://etherpad.org\n',
 'meta': {'author': '',
  'timestamp': 1580310574755,
  'atext': {'text': 'Welcome to Etherpad!\n\nThis pad text is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents!\n\nGet involved with Etherpad at http://etherpad.org\n\n',
   'attribs': '|6+6c'}}}

data['pad:swarm02.md:revs:2708']

In [7]:
data['pad:swarm02.md:revs:2709']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-7-f3133eb2f9ee> in <module>
----> 1 data['pad:swarm02.md:revs:2709']

KeyError: 'pad:swarm02.md:revs:2709'
In [ ]:
print(json.dumps(data, indent=2))

EasySync

EasySync is the protocol developed for the etherpad software. The idea was to have a compact representation of a document as series of compact change sets -- ie descriptions of how a particular text was changed / edited -- rather than larger snapshots of the whole text over and over again. The design of the system reflects also its nature as a distributed system, the idea for the compactness and granularity of the changeset was to make it possible to combine the editing of multiple people making edits at possibly the same time. (The algorithm used in the end is called the Operational Transformation.

The design of changesets also reflects Engineering culture -- they are referred to as operations, like the machine operations used to program a computer. EasySync is in fact a sort of computer design where the basic operations manipulate a state-machine that represents a text.

In [ ]:
import re
In [ ]:
# e.g. Z:9kj>1|8=al=o4*1a|1+1$
def changeset_parse (c) :
    changeset_pat = re.compile(r'^Z:([0-9a-z]+)([><])([0-9a-z]+)(.+?)\$')
    op_pat = re.compile(r'(\|([0-9a-z]+)([\+\-\=])([0-9a-z]+))|([\*\+\-\=])([0-9a-z]+)')

    def parse_op (m):
        g = m.groups()
        if g[0]:
            if g[2] == "+":
                op = "insert"
            elif g[2] == "-":
                op = "delete"
            else:
                op = "hold"
            return {
                'raw': m.group(0),
                'op': op,
                'lines': int(g[1], 36),
                'chars': int(g[3], 36)
            }
        elif g[4] == "*":
            return {
                'raw': m.group(0),
                'op': 'attr',
                'index': int(g[5], 36)
            }
        else:
            if g[4] == "+":
                op = "insert"
            elif g[4] == "-":
                op = "delete"
            else:
                op = "hold"
            return {
                'raw': m.group(0),
                'op': op,
                'chars': int(g[5], 36)
            }

    m = changeset_pat.search(c)
    bank = c[m.end():]
    g = m.groups()
    ops_raw = g[3]
    op = None

    ret = {}
    ret['raw'] = c
    ret['source_length'] = int(g[0], 36)
    ret['final_op'] = g[1]
    ret['final_diff'] = int(g[2], 36)
    ret['ops_raw'] = ops_raw
    ret['ops'] = ops = []
    ret['bank'] = bank
    ret['bank_length'] = len(bank)
    for m in op_pat.finditer(ops_raw):
        ops.append(parse_op(m))
    return ret

def perform_changeset_curline (text, c):
    textpos = 0
    curline = 0
    curline_charpos = 0
    curline_insertchars = 0
    bank = c['bank']
    bankpos = 0
    newtext = ''
    current_attributes = []

    # loop through the operations
    # rebuilding the final text
    for op in c['ops']:
        if op['op'] == "attr":
            current_attributes.append(op['index'])
        elif op['op'] == "insert":
            newtextposition = len(newtext)
            insertion_text = bank[bankpos:bankpos+op['chars']]
            newtext += insertion_text
            bankpos += op['chars']
            if 'lines' in op:
                curline += op['lines']
                curline_charpos = 0
            else:
                curline_charpos += op['chars']
                curline_insertchars = op['chars']
            # todo PROCESS attributes
            # NB on insert, the (original/old/previous) textpos does *not* increment...
        elif op['op'] == "delete":
            newtextposition = len(newtext) # is this right?
            # todo PROCESS attributes
            textpos += op['chars']

        elif op['op'] == "hold":
            newtext += text[textpos:textpos+op['chars']]
            textpos += op['chars']
            if 'lines' in op:
                curline += op['lines']
                curline_charpos = 0
            else:
                curline_charpos += op['chars']

    # append rest of old text...
    newtext += text[textpos:]
    return newtext, curline, curline_charpos, curline_insertchars
In [ ]:
changeset_parse(data['pad:swarm02.md:revs:2708']['changeset'])

Questions

But how compact is EasySync really? ... Despite it's claims of conciseness, the last change set for example above represents the typing of a space. The fact that each changeset is represented as a row in database, with a timestamp and other metadata (representing for instance author) adds another layer of information, and this then thousands of times to represent the history of a document.

In our practice at Constant, we need to regularly (maybe one a year) rebuild the etherpad database when it grows to be unmanageaably large (maybe 10 GB). This is due to the fact that the full history of all documents as changesets is much too expansive.

In [ ]: