Paul K space : OpenAI POC Implementation

Implement a proof of concept demonstration for use internally or with a client to show how they can leverage AI to respond to customer queries using the customer website as the source of data for training AI

Using the reference material in the openai-cookbook, two python applications have been created - with both functions being delivered in docker container - the application from it’s original form has lead to the following decisions

There may be multiple hard coded values and temporary keys in files while delivery a rapid delivery POC such as this - once final design is approved and all function as expected, standard secret stored values and non-development keys will be used

Requirements

For this POC the Azure subscriptions, access and resources are all deployed via terraform and GitHub Actions

  • the Subscription, Service Principal and GitHub Repository are controlled by the EA Subscription Vending and outside user control

  • All team members are assigned Owner on the subscription

  • an OpenAI personal key with a 60 tokes/minute key was used to configure but provides little value for testing. - for which we require a business api key

ToDo


Tickets linked are ticked completed for “functionally complete”

  • subject to changes

  • unfinished but non-blocking items aren't counted if no functional impediment


Systems Overview / High Level Summary

Take a question from a user and using pre-parsed data from crawling the source URL, respond with an answer using specific coding settings and model setting

There are 4 stages to the operation of the POC - broken down into major function

The WebCrawler

  1. the WebCrawler container launches accepting the target URL as an input. The WebCrawler starts at the target URL and creates a map of links to download in memory

  2. Once the list has been fully updated, the WebCrawler loads the page, grabs the content - converting from HTML to text, and outputs the content in a file under the folder /text/<target url>

  3. the /text folder is a remote file mount to a storage account - allowing data sharing between the other processes

sample /text folder entries
.
└── text
    ├── ai.pknw1.co.uk
    │   ├── ai.pknw1.co.uk_._cv.txt.txt
    │   └── ai.pknw1.co.uk_.txt
    └── spacex.com
        ├── spacex.com_launches_transporter-1-mission.txt
        ├── spacex.com_launches_transporter-2-mission.txt
        ├── spacex.com_launches_transporter-3-mission.txt
        ├── spacex.com_launches_transporter-4.txt
        ├── spacex.com_launches_transporter-5.txt
        ├── spacex.com_launches_turksat-5a-mission.txt
        ├── spacex.com_launches_turksat-5b.txt
        ├── spacex.com_launches_ussf-44.txt
        ├── spacex.com_legal.txt
        ├── spacex.com_mission.txt
        ├── spacex.com_rideshare.txt
        ├── spacex.com_space-track.org.txt
        ├── spacex.com_starshield.txt
        ├── spacex.com_supplier.txt
        ├── spacex.com_updates.txt
        ├── spacex.com_updates_#sustained-lunar-exploration.txt
        ├── spacex.com_updates_crew-2-mission.txt
        ├── spacex.com_updates_crew-3.txt
        ├── spacex.com_updates_crs-21-mission-launch.txt
        ├── spacex.com_updates_dart.txt
        ├── spacex.com_updates_inspiration4.txt
        ├── spacex.com_updates_nasa-certification-11-10-2020.txt
        ├── spacex.com_updates_spaceforce-selection-09-25-2020.txt
        ├── spacex.com_updates_starship-moon-announcement_index.html.txt
        ├── spacex.com_vehicles_dragon.txt
        ├── spacex.com_vehicles_falcon-9.txt
        ├── spacex.com_vehicles_falcon-heavy.txt
        └── spacex.com_vehicles_starship.txt

The Data Processor

The data processor container reads the raw text from the files under /text and performs a number of cleansing and normalising tasks before formatting the data and storing in in /scraped/scraped.csv

  1. currently has no trigger defined

  2. accepts the URL as input, using that to find and retrieve files

The Tokeniser

The tokeniser is the major interface for our data to openAI so that it can be used meaningfully

  1. format and split the data into “tokens” - which will often be 125% of the word count from the source data.

  2. calculate token totals and format the data into a row of /embeddings/embeddings.csv

  3. uses openai.Embedding.create to obtain vector calculations to associate with the data and store into /embeddings/embeddings.csv

    1. encoding can be set here - must match the UI setting

    2. engine can be set here - must match the UI setting

Query/User Interface

The UI container has a CLI and WebUI mode and in either mode performs the same discreet functions

  1. Provide the user interface for the user to pose a query

  2. Prompt for user input

  3. send the user data via openai.Embedding.create to openai.com and receive a response

  4. Display the response to the user

This implementation uses the Embeddings OpenAI function - but can be adapted to use Other functions across a wide range of the available Models

The WebCrawler , Processor and Tokeniser are non-interactive processes and the data crawl alone can be VERY time intensive.

Due to the POC nature of the implementation, the processes - while interdependent, have been split apart to minimise risk of data loss in the event of an uncaught exception (which can reset the entire process!)

All modules can be used to facilitate multiple data sets as the data will be segregated by domain - further iterations may consider

  • webcrawler process over multiple sources and updating rather than from scratch every time

  • automatic triggering of the process container when new raw data is available

  • automatic triggering of the tokenisation process when new scraped data is available

  • the facility in the UK to select which data source to query (easy quick win)


Revisions

revision 1.0.0 of the high level design

The initial build was a mono-code beast which lost all the data when an error occurred and deployed from a single container

  • the uncaught exceptions lead to splitting apps apart

  • the 5.7Gb single image was slow so evolved to a common single build base image which each component only installs its unique requirements on

  • containers using local mounts

Original version

revision 1.0.1 of the high level design

Splitting into 4 functions, we can run each instance when required and the previous step is complete

  • share data via a storage account mount for each

    • /text

    • /scraped

    • /embeddings

This revisions allows more flexibility in swapping out components and being able to add extra steps or processing without breaking the original workflow

1.0.1

ToDo

  1. refinement and due diligence on User stories

    1. rapid test of Completion

    2. Completion/Embedding process differences and switching

    3. Switching models and code changes

  2. Additional Source Data ingestion - from files or other - to accept PDF etc and deliver the data into /files folder as for Web Crawler

  3. Can we merge Embeddings after processing - update and improve rather than re-start

  4. setup guide for repository workflows and terraform/azure integration using Github actions and Secrets

  5. automation and triggering of the 3 data collection and parsing processes

  6. adaption of web-crawler to allow multiple sites to crawl

  7. adaption of web-ui to allow selection of pre-embedded data source to query

  8. other integrations

Decision Tracking

#

Resolved

Reviewed

1

  • PK
  •  

2

  • PK
  •  

3

  •  
  •  

#

Observation/Issue

Decision

1

  •  

The application initially carried out functions in sequence - causing wasted time and data loss when the process terminated but hadn't started writing to disk

Split the application into major function areas that an be run depending on the availability of data.

  • Web Crawl

  • Processing

  • Tokenisarion

  • Main Interface/WebUI

2

  •  

Very long installation time to install all the dependencies via pip3

Containerise and create a base image with the common components to use as the base for the other elements of the application

3

  •  

Large Images produce during Docker build

for each of the functions

  • deploy a new .venv

  • remove all pip packages

  • pip3 install for required imports in python

    • run application and install missing components

  • create requirements.txt by doing pip freeze > requirements.txt