S13-Words-for-the-Future-no.../working-with-datasets/json-making-datasets.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Making (mini) datasets using JSON files"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ways of thickening, layering, ..., text."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----------------------------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## JSON?\n",
    "\n",
    "JSON is an open standard file format that saves information in **key, value paired** objects. \n",
    "\n",
    "It looks like this:\n",
    "\n",
    "```\n",
    "{\n",
    "    \"name\": \"XPUB1\",\n",
    "    \"date\": \"26-10-2020\",\n",
    "    \"number_of_students\": 10\n",
    "}\n",
    "```\n",
    "\n",
    "In a JSON file, you can store both **strings** and **numbers**, but also **lists** or another **dictionary** object.\n",
    "\n",
    "Look for example how the number `10` is written without `\"`s around it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## dictionaries & JSON files\n",
    "\n",
    "Python uses a **dictionary** to store data in the same **key, value paired** way. \n",
    "\n",
    "`dict = {}`\n",
    "\n",
    "We will use a dictionary to store the data, and then save it as a JSON file.\n",
    "\n",
    "In this way, we can *add* one (of multiple) layer(s) of *value* to a word ..., such as:\n",
    "\n",
    "```\n",
    "{\n",
    "    \"common\": \"English\",\n",
    "    \"language\": \"communication system\",\n",
    "    \"formal\": 6,\n",
    "    \"semantic\": ['language', 'formal', 'informat'],\n",
    "    \"semantic\": { \n",
    "            \"type\": \"word\",\n",
    "            \"number of letters\": 7\n",
    "        }\n",
    "}\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'dataset' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-1-f8b69ae2bb04>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# Adding a new key to the dictionary, assigning a string as value:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mdataset\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'new'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'NEW WORD'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      3\u001b[0m \u001b[0;31m# or assigning a number as value:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0mdataset\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'new'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m10\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mNameError\u001b[0m: name 'dataset' is not defined"
     ]
    }
   ],
   "source": [
    "# Adding a new key to the dictionary, assigning a string as value:\n",
    "dataset['new'] = 'NEW WORD'\n",
    "# or assigning a number as value:\n",
    "dataset['new'] = 10 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Printing the tag of a word in the dictionary\n",
    "print(dataset['are'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Printing all the keys in the dataset\n",
    "print(dataset.keys())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Making a sample dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This is sample data, a list of words and POS tags:\n",
    "dataset = [('Common', 'JJ'), ('languages', 'NNS'), ('like', 'IN'), ('English', 'NNP'), ('are', 'VBP'), ('both', 'DT'), ('formal', 'JJ'), ('and', 'CC'), ('semantic', 'JJ'), (';', ':'), ('although', 'IN'), ('their', 'PRP$'), ('scope', 'NN'), ('extends', 'VBZ'), ('beyond', 'IN'), ('the', 'DT'), ('formal', 'JJ'), (',', ','), ('anything', 'NN'), ('that', 'WDT'), ('can', 'MD'), ('be', 'VB'), ('expressed', 'VBN'), ('in', 'IN'), ('a', 'DT'), ('computer', 'NN'), ('control', 'NN'), ('language', 'NN'), ('can', 'MD'), ('also', 'RB'), ('be', 'VB'), ('expressed', 'VBN'), ('in', 'IN'), ('common', 'JJ'), ('language', 'NN'), ('.', '.')]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'are': 'VBP', 'extends': 'VBZ', 'be': 'VB', 'expressed': 'VBN'}\n"
     ]
    }
   ],
   "source": [
    "# Making a dataset with only verbs\n",
    "dataset = {}\n",
    "\n",
    "for word, tag in data:\n",
    "    if 'VB' in tag:\n",
    "        dataset[word] = tag\n",
    "\n",
    "print(dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Saving as JSON file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "    \"are\": \"VBP\",\n",
      "    \"extends\": \"VBZ\",\n",
      "    \"be\": \"VB\",\n",
      "    \"expressed\": \"VBN\"\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "out = json.dumps(dataset, indent=4)\n",
    "print(out)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "f = open('json-dataset.json', 'w')\n",
    "f.write(out)\n",
    "f.close()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
adding a json making notebook 4 years ago			`{`
			`"cells": [`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Making (mini) datasets using JSON files"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"Ways of thickening, layering, ..., text."`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"----------------------------------"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## JSON?\n",`
			`"\n",`
			`"JSON is an open standard file format that saves information in key, value paired objects. \n",`
			`"\n",`
			`"It looks like this:\n",`
			`"\n",`
			"```\n",
			`"{\n",`
			`" \"name\": \"XPUB1\",\n",`
			`" \"date\": \"26-10-2020\",\n",`
			`" \"number_of_students\": 10\n",`
			`"}\n",`
			"```\n",
			`"\n",`
			`"In a JSON file, you can store both strings and numbers, but also lists or another dictionary object.\n",`
			`"\n",`
			"Look for example how the number `10` is written without `\"`s around it."
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## dictionaries & JSON files\n",`
			`"\n",`
			`"Python uses a dictionary to store data in the same key, value paired way. \n",`
			`"\n",`
			"`dict = {}`\n",
			`"\n",`
			`"We will use a dictionary to store the data, and then save it as a JSON file.\n",`
			`"\n",`
			`"In this way, we can add one (of multiple) layer(s) of value to a word ..., such as:\n",`
			`"\n",`
			"```\n",
			`"{\n",`
			`" \"common\": \"English\",\n",`
			`" \"language\": \"communication system\",\n",`
			`" \"formal\": 6,\n",`
			`" \"semantic\": ['language', 'formal', 'informat'],\n",`
			`" \"semantic\": { \n",`
			`" \"type\": \"word\",\n",`
			`" \"number of letters\": 7\n",`
			`" }\n",`
			`"}\n",`
			"```\n"
			`]`
			`},`
			`{`
			`"cell_type": "code",`
updated/simplified weasyprint notebook 4 years ago			`"execution_count": 1,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"ename": "NameError",`
			`"evalue": "name 'dataset' is not defined",`
			`"output_type": "error",`
			`"traceback": [`
			`"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",`
			`"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",`
			"\u001b[0;32m<ipython-input-1-f8b69ae2bb04>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# Adding a new key to the dictionary, assigning a string as value:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mdataset\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'new'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'NEW WORD'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3\u001b[0m \u001b[0;31m# or assigning a number as value:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0mdataset\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'new'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m10\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
			`"\u001b[0;31mNameError\u001b[0m: name 'dataset' is not defined"`
			`]`
			`}`
			`],`
adding a json making notebook 4 years ago			`"source": [`
			`"# Adding a new key to the dictionary, assigning a string as value:\n",`
			`"dataset['new'] = 'NEW WORD'\n",`
			`"# or assigning a number as value:\n",`
			`"dataset['new'] = 10 "`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"# Printing the tag of a word in the dictionary\n",`
			`"print(dataset['are'])"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"# Printing all the keys in the dataset\n",`
			`"print(dataset.keys())"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Making a sample dataset"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
updated/simplified weasyprint notebook 4 years ago			`"execution_count": 3,`
adding a json making notebook 4 years ago			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"# This is sample data, a list of words and POS tags:\n",`
updated/simplified weasyprint notebook 4 years ago			"dataset = [('Common', 'JJ'), ('languages', 'NNS'), ('like', 'IN'), ('English', 'NNP'), ('are', 'VBP'), ('both', 'DT'), ('formal', 'JJ'), ('and', 'CC'), ('semantic', 'JJ'), (';', ':'), ('although', 'IN'), ('their', 'PRP$'), ('scope', 'NN'), ('extends', 'VBZ'), ('beyond', 'IN'), ('the', 'DT'), ('formal', 'JJ'), (',', ','), ('anything', 'NN'), ('that', 'WDT'), ('can', 'MD'), ('be', 'VB'), ('expressed', 'VBN'), ('in', 'IN'), ('a', 'DT'), ('computer', 'NN'), ('control', 'NN'), ('language', 'NN'), ('can', 'MD'), ('also', 'RB'), ('be', 'VB'), ('expressed', 'VBN'), ('in', 'IN'), ('common', 'JJ'), ('language', 'NN'), ('.', '.')]"
adding a json making notebook 4 years ago			`]`
			`},`
			`{`
			`"cell_type": "code",`
updated/simplified weasyprint notebook 4 years ago			`"execution_count": 4,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"{'are': 'VBP', 'extends': 'VBZ', 'be': 'VB', 'expressed': 'VBN'}\n"`
			`]`
			`}`
			`],`
adding a json making notebook 4 years ago			`"source": [`
			`"# Making a dataset with only verbs\n",`
			`"dataset = {}\n",`
			`"\n",`
			`"for word, tag in data:\n",`
			`" if 'VB' in tag:\n",`
			`" dataset[word] = tag\n",`
			`"\n",`
			`"print(dataset)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Saving as JSON file"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
updated/simplified weasyprint notebook 4 years ago			`"execution_count": 5,`
adding a json making notebook 4 years ago			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"import json"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
updated/simplified weasyprint notebook 4 years ago			`"execution_count": 6,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"{\n",`
			`" \"are\": \"VBP\",\n",`
			`" \"extends\": \"VBZ\",\n",`
			`" \"be\": \"VB\",\n",`
			`" \"expressed\": \"VBN\"\n",`
			`"}\n"`
			`]`
			`}`
			`],`
adding a json making notebook 4 years ago			`"source": [`
			`"out = json.dumps(dataset, indent=4)\n",`
			`"print(out)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
updated/simplified weasyprint notebook 4 years ago			`"execution_count": 7,`
adding a json making notebook 4 years ago			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"f = open('json-dataset.json', 'w')\n",`
			`"f.write(out)\n",`
			`"f.close()"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": []`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": []`
			`}`
			`],`
			`"metadata": {`
			`"kernelspec": {`
			`"display_name": "Python 3",`
			`"language": "python",`
			`"name": "python3"`
			`},`
			`"language_info": {`
			`"codemirror_mode": {`
			`"name": "ipython",`
			`"version": 3`
			`},`
			`"file_extension": ".py",`
			`"mimetype": "text/x-python",`
			`"name": "python",`
			`"nbconvert_exporter": "python",`
			`"pygments_lexer": "ipython3",`
			`"version": "3.7.3"`
			`}`
			`},`
			`"nbformat": 4,`
			`"nbformat_minor": 4`
			`}`