You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

357 lines
11 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# etherpad changesets"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"url = \"https://pad.xpub.nl/p/swarm02.md/export/etherpad\""
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"from urllib.request import urlopen\n",
"import json"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"data = json.load(urlopen(url))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dict_keys(['atext', 'pool', 'head', 'chatHead', 'publicStatus', 'passwordHash', 'savedRevisions'])"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data['pad:swarm02.md'].keys()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The \"head\" gives the number of the last revisions."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2708"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data['pad:swarm02.md']['head']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The first revision is number 0 and represents the initial welcome text."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'changeset': 'Z:1>6b|5+6b$Welcome to Etherpad!\\n\\nThis pad text is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents!\\n\\nGet involved with Etherpad at http://etherpad.org\\n',\n",
" 'meta': {'author': '',\n",
" 'timestamp': 1580310574755,\n",
" 'atext': {'text': 'Welcome to Etherpad!\\n\\nThis pad text is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents!\\n\\nGet involved with Etherpad at http://etherpad.org\\n\\n',\n",
" 'attribs': '|6+6c'}}}"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data['pad:swarm02.md:revs:0']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"data['pad:swarm02.md:revs:2708']"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"ename": "KeyError",
"evalue": "'pad:swarm02.md:revs:2709'",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-7-f3133eb2f9ee>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdata\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'pad:swarm02.md:revs:2709'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;31mKeyError\u001b[0m: 'pad:swarm02.md:revs:2709'"
]
}
],
"source": [
"data['pad:swarm02.md:revs:2709']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(json.dumps(data, indent=2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## EasySync"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"EasySync is the *protocol* developed for the etherpad software. The idea was to have a compact representation of a document as series of compact *change sets* -- ie descriptions of how a particular text was changed / edited -- rather than larger snapshots of the whole text over and over again. The design of the system reflects also its nature as a distributed system, the idea for the compactness and granularity of the changeset was to make it possible to combine the editing of multiple people making edits at possibly the same time. (The algorithm used in the end is called the *Operational Transformation*.\n",
"\n",
"The design of changesets also reflects Engineering culture -- they are referred to as operations, like the machine operations used to program a computer. EasySync is in fact a sort of computer design where the basic operations manipulate a state-machine that represents a text."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import re"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# e.g. Z:9kj>1|8=al=o4*1a|1+1$\n",
"def changeset_parse (c) :\n",
" changeset_pat = re.compile(r'^Z:([0-9a-z]+)([><])([0-9a-z]+)(.+?)\\$')\n",
" op_pat = re.compile(r'(\\|([0-9a-z]+)([\\+\\-\\=])([0-9a-z]+))|([\\*\\+\\-\\=])([0-9a-z]+)')\n",
"\n",
" def parse_op (m):\n",
" g = m.groups()\n",
" if g[0]:\n",
" if g[2] == \"+\":\n",
" op = \"insert\"\n",
" elif g[2] == \"-\":\n",
" op = \"delete\"\n",
" else:\n",
" op = \"hold\"\n",
" return {\n",
" 'raw': m.group(0),\n",
" 'op': op,\n",
" 'lines': int(g[1], 36),\n",
" 'chars': int(g[3], 36)\n",
" }\n",
" elif g[4] == \"*\":\n",
" return {\n",
" 'raw': m.group(0),\n",
" 'op': 'attr',\n",
" 'index': int(g[5], 36)\n",
" }\n",
" else:\n",
" if g[4] == \"+\":\n",
" op = \"insert\"\n",
" elif g[4] == \"-\":\n",
" op = \"delete\"\n",
" else:\n",
" op = \"hold\"\n",
" return {\n",
" 'raw': m.group(0),\n",
" 'op': op,\n",
" 'chars': int(g[5], 36)\n",
" }\n",
"\n",
" m = changeset_pat.search(c)\n",
" bank = c[m.end():]\n",
" g = m.groups()\n",
" ops_raw = g[3]\n",
" op = None\n",
"\n",
" ret = {}\n",
" ret['raw'] = c\n",
" ret['source_length'] = int(g[0], 36)\n",
" ret['final_op'] = g[1]\n",
" ret['final_diff'] = int(g[2], 36)\n",
" ret['ops_raw'] = ops_raw\n",
" ret['ops'] = ops = []\n",
" ret['bank'] = bank\n",
" ret['bank_length'] = len(bank)\n",
" for m in op_pat.finditer(ops_raw):\n",
" ops.append(parse_op(m))\n",
" return ret\n",
"\n",
"def perform_changeset_curline (text, c):\n",
" textpos = 0\n",
" curline = 0\n",
" curline_charpos = 0\n",
" curline_insertchars = 0\n",
" bank = c['bank']\n",
" bankpos = 0\n",
" newtext = ''\n",
" current_attributes = []\n",
"\n",
" # loop through the operations\n",
" # rebuilding the final text\n",
" for op in c['ops']:\n",
" if op['op'] == \"attr\":\n",
" current_attributes.append(op['index'])\n",
" elif op['op'] == \"insert\":\n",
" newtextposition = len(newtext)\n",
" insertion_text = bank[bankpos:bankpos+op['chars']]\n",
" newtext += insertion_text\n",
" bankpos += op['chars']\n",
" if 'lines' in op:\n",
" curline += op['lines']\n",
" curline_charpos = 0\n",
" else:\n",
" curline_charpos += op['chars']\n",
" curline_insertchars = op['chars']\n",
" # todo PROCESS attributes\n",
" # NB on insert, the (original/old/previous) textpos does *not* increment...\n",
" elif op['op'] == \"delete\":\n",
" newtextposition = len(newtext) # is this right?\n",
" # todo PROCESS attributes\n",
" textpos += op['chars']\n",
"\n",
" elif op['op'] == \"hold\":\n",
" newtext += text[textpos:textpos+op['chars']]\n",
" textpos += op['chars']\n",
" if 'lines' in op:\n",
" curline += op['lines']\n",
" curline_charpos = 0\n",
" else:\n",
" curline_charpos += op['chars']\n",
"\n",
" # append rest of old text...\n",
" newtext += text[textpos:]\n",
" return newtext, curline, curline_charpos, curline_insertchars\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"changeset_parse(data['pad:swarm02.md:revs:2708']['changeset'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Questions\n",
"But how compact is EasySync really? ... Despite it's claims of conciseness, the last change set for example above represents the typing of a space. The fact that each changeset is represented as a row in database, with a timestamp and other metadata (representing for instance author) adds another layer of information, and this then thousands of times to represent the history of a document.\n",
"\n",
"In our practice at Constant, we need to regularly (maybe one a year) rebuild the etherpad database when it grows to be unmanageaably large (maybe 10 GB). This is due to the fact that the full history of all documents as changesets is much too expansive."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"## Links\n",
"\n",
"* https://github.com/ether/etherpad-lite/blob/develop/doc/easysync/easysync-full-description.pdf\n",
"* https://diversions.constantvzw.org/wiki/index.php?title=Eventual_Consistency\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}