{ "cells": [ { "cell_type": "markdown", "id": "b774afd4-4a80-48c5-a09a-7623429c7f0a", "metadata": {}, "source": [ "# Mashup()\n" ] }, { "cell_type": "markdown", "id": "626002a1-958c-4a31-bf75-50cd5dfab433", "metadata": {}, "source": [ "**Mashup()** is a function that compare **two similar texts** and produce **a third text** that randomly choose between the original texts. \n", "The outcome is a text with the differences chosen through random choice.\n", "\n", "A function that: \n", "1. takes into account **2 similar texts** (example: 2 different translations of a poems)\n", "2. finds the **fixed_words** and uses them as the fixed text for the new piece of text\n", "3. puts the results together into **a new piece**, randomly choosing the different options \n", "4. html output that **highlights the different random choices** of the translations\n" ] }, { "cell_type": "markdown", "id": "b3e00ed0-59e2-4e1c-aa23-fd412fc8524b", "metadata": {}, "source": [ "**how to use**: to use this function it is necessary to have two texts with the same number of lines (as it goes throught the two texts and compares them line by line). It can be use also with list of strings." ] }, { "cell_type": "markdown", "id": "04e48da5-eeae-440d-8f8a-5952d6706d81", "metadata": {}, "source": [ "**input:** 2 texts that are similar but not the exact copy of each other -- " ] }, { "cell_type": "markdown", "id": "e8b2ab5c-3b41-4dce-8490-8d7052cdb858", "metadata": {}, "source": [ "**output:** a new text that showcase the differences // a new text made out of random choice // *still not clear yet* // extracted pdf/txt file?" ] }, { "cell_type": "code", "execution_count": 2, "id": "e6d5f744-d079-4a51-a5c2-6845a7a7ac56", "metadata": {}, "outputs": [], "source": [ "# define texts\n", "\n", "text1= '''\n", "The glasses were empty\n", "The bottle was shattered\n", "The bed was wide open\n", "The door was tight shuttered\n", "Each shard was a star\n", "Of bliss and of beauty\n", "That flashed on the floor\n", "All dusty and dirty\n", "And I was dead drunk\n", "Lit up wildly ablaze\n", "You were drunk and alive\n", "In a naked embrace!\n", "'''\n", "text2= '''\n", "So the glasses were empty\n", "and the bottle broken\n", "And the bed was wide open\n", "and the door closed\n", "And all of the glass stars\n", "of happiness and beauty \n", "were sparkling in the dust\n", "of the poorly dusted room.\n", "And I was dead drunk\n", "And I was a bonfire\n", "And you were alive, drunk,\n", "all naked in my arms.\n", "'''\n" ] }, { "cell_type": "code", "execution_count": null, "id": "a8f10c45-0ce7-4ef8-9a5f-6a92f69902a2", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 3, "id": "3c62af9c-b078-4117-b84b-e191048c6a93", "metadata": {}, "outputs": [], "source": [ "import difflib\n", "from random import choice\n", "\n", "def mashup(text1,text2): #take into account 2 texts\n", " \n", " text1 = text1.splitlines() #split texts in lines\n", " text2 = text2.splitlines()\n", " \n", " fixed_words= [] #define empty list for fixed_words (words that are the same in both texts) // a list of lists of words\n", " for line_A, line_B in zip(text1, text2): #start the first loop reading line by line from both texts at the same time (=zip)\n", " words_A = line_A.split() #split lines in lists of words\n", " words_B = line_B.split()\n", " \n", " d = difflib.Differ() #Differ compare sequences of lines of text, and produce human-readable differences ('+' in text1), ('-' in text2), ('' fixed_Words)\n", " diff = d.compare(words_A, words_B) #compare the difference between the two lists of words\n", " \n", " \n", " linelist = [] #define empty list \n", " for result in diff: #second loop that goes through all the lines and then the words of both texts simultaneously\n", " code, word = result.split(' ', 1) #split result of diff in code [('+'), ('-') or ('')] and the resulting word (is it the same or is it just in one of the two texts?)\n", " word = word.strip() #to be sure it doesn't have any weird things /n at the ends of the lines\n", " if code == '' : #if the code is ' ' (nothing) it means that the word can be found in both texts\n", " linelist.append(word) #if this happens, put the corresponding words in the linelist\n", " fixed_words.append(linelist) #afterwards, add linelist to fixed_words (linelist is inside the loop so all the words in every line are put in there, but fixed_words is outside so that just the words are added just once)\n", " \n", " length = len(text1) #define lenght of text1\n", " for linenumber in range(length): #for the number of the lines in the lenght of the text\n", " cut_left1 = 0 #the beginning of both texts is position n°0 (on the left side of the lines)\n", " cut_left2 = 0\n", " words_1 = text1[linenumber].split() #words_1 is split in words keeping the position in the lines\n", " words_2 = text2[linenumber].split() \n", " if len(fixed_words[linenumber]) > 0: #if the index on the fixed words in the line is more than 0 (it's not the first one)\n", " for fixed_word in fixed_words[linenumber]: #for all the fixed_words that are in the fixed_words list always following the linenumbers\n", " cut_right1 = words_1.index(fixed_word) #finding the first fixed_word from the left (beginning / position 0) to the right(end of sentence / last word in the line)\n", " cut_right2 = words_2.index(fixed_word) #in both texts\n", "\n", " slice_1 = words_1[cut_left1 : cut_right1] #create slice_1 \n", " slice_2 = words_2[cut_left2 : cut_right2]\n", " print(choice([slice_1, slice_2]))\n", " \n", " cut_left1 = cut_right1 #now invert, when it's gone through all the words till finding the last fixed word\n", " cut_left2 = cut_right2\n", "\n", " slice_1 = words_1[cut_left1 :] #from the last fixed_word found to the right\n", " slice_2 = words_2[cut_left2 :]\n", " print(choice([slice_1, slice_2])) #choose\n", " else:\n", " slice_1 = words_1[cut_left1 :] #here is doing it outside of the loop ( it gets the last word of the line if it's not a\n", " slice_2 = words_2[cut_left2 :]\n", " print(choice([slice_1, slice_2])) #choose\n", " print('--------') \n", " " ] }, { "cell_type": "code", "execution_count": 4, "id": "78c7643c-e797-451e-872d-e6dd81f0052c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[]\n", "--------\n", "['So', 'the']\n", "['glasses']\n", "['were']\n", "['empty']\n", "--------\n", "['and', 'the']\n", "['bottle', 'was', 'shattered']\n", "--------\n", "['The']\n", "['bed']\n", "['was']\n", "['wide']\n", "['open']\n", "--------\n", "['and', 'the']\n", "['door', 'was', 'tight', 'shuttered']\n", "--------\n", "['And', 'all', 'of', 'the', 'glass', 'stars']\n", "--------\n", "['of', 'happiness']\n", "['and']\n", "['beauty']\n", "--------\n", "['were', 'sparkling', 'in']\n", "['the', 'dust']\n", "--------\n", "['All', 'dusty', 'and', 'dirty']\n", "--------\n", "[]\n", "['And']\n", "['I']\n", "['was']\n", "['dead']\n", "['drunk']\n", "--------\n", "['And', 'I', 'was', 'a', 'bonfire']\n", "--------\n", "['And', 'you']\n", "['were', 'alive,', 'drunk,']\n", "--------\n", "['In', 'a']\n", "['naked', 'in', 'my', 'arms.']\n", "--------\n" ] } ], "source": [ "mashup(text1,text2)" ] }, { "cell_type": "code", "execution_count": null, "id": "07f2e386-3380-4dc0-b680-d5b1288e7794", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "0b127fbb-6e70-4614-a727-2ec476e43236", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 5 }