Paul K space : OpenAI POC Implementation

Implement a proof of concept demonstration for use internally or with a client to show how they can leverage AI to respond to customer queries using the customer website as the source of data for training AI

Using the reference material in the openai-cookbook, two python applications have been created - with both functions being delivered in docker container - the application from it’s original form has lead to the following decisions

There may be multiple hard coded values and temporary keys in files while delivery a rapid delivery POC such as this - once final design is approved and all function as expected, standard secret stored values and non-development keys will be used

Requirements

For this POC the Azure subscriptions, access and resources are all deployed via terraform and GitHub Actions

the Subscription, Service Principal and GitHub Repository are controlled by the EA Subscription Vending and outside user control
All team members are assigned Owner on the subscription
an OpenAI personal key with a 60 tokes/minute key was used to configure but provides little value for testing. - for which we require a business api key

ToDo

Delivery Platform

Azure Subscription [ subscription_id c144229f-b1c1-4a8a-9c11-b5c292f3608b]
Access for Team to Azure

Source Code

GitHub Repo in Contino Org [ contino/openai_webcrawler_poc.git ]
Access for Team

Build and Deployment [ github actions - under development ]
OpenAI API [openapi.com]

Org [ available here ]
API Key [ available here ]

Data - any URL

the crawler can spend a VERY long time getting data

Systems Overview / High Level Summary

Take a question from a user and using pre-parsed data from crawling the source URL, respond with an answer using specific coding settings and model setting

There are 4 stages to the operation of the POC - broken down into major function

The WebCrawler

the WebCrawler container launches accepting the target URL as an input. The WebCrawler starts at the target URL and creates a map of links to download in memory
Once the list has been fully updated, the WebCrawler loads the page, grabs the content - converting from HTML to text, and outputs the content in a file under the folder /text/<target url>
the /text folder is a remote file mount to a storage account - allowing data sharing between the other processes

sample /text folder entries

.
└── text
    ├── ai.pknw1.co.uk
    │   ├── ai.pknw1.co.uk_._cv.txt.txt
    │   └── ai.pknw1.co.uk_.txt
    └── spacex.com
        ├── spacex.com_launches_transporter-1-mission.txt
        ├── spacex.com_launches_transporter-2-mission.txt
        ├── spacex.com_launches_transporter-3-mission.txt
        ├── spacex.com_launches_transporter-4.txt
        ├── spacex.com_launches_transporter-5.txt
        ├── spacex.com_launches_turksat-5a-mission.txt
        ├── spacex.com_launches_turksat-5b.txt
        ├── spacex.com_launches_ussf-44.txt
        ├── spacex.com_legal.txt
        ├── spacex.com_mission.txt
        ├── spacex.com_rideshare.txt
        ├── spacex.com_space-track.org.txt
        ├── spacex.com_starshield.txt
        ├── spacex.com_supplier.txt
        ├── spacex.com_updates.txt
        ├── spacex.com_updates_#sustained-lunar-exploration.txt
        ├── spacex.com_updates_crew-2-mission.txt
        ├── spacex.com_updates_crew-3.txt
        ├── spacex.com_updates_crs-21-mission-launch.txt
        ├── spacex.com_updates_dart.txt
        ├── spacex.com_updates_inspiration4.txt
        ├── spacex.com_updates_nasa-certification-11-10-2020.txt
        ├── spacex.com_updates_spaceforce-selection-09-25-2020.txt
        ├── spacex.com_updates_starship-moon-announcement_index.html.txt
        ├── spacex.com_vehicles_dragon.txt
        ├── spacex.com_vehicles_falcon-9.txt
        ├── spacex.com_vehicles_falcon-heavy.txt
        └── spacex.com_vehicles_starship.txt

The Data Processor

The data processor container reads the raw text from the files under /text and performs a number of cleansing and normalising tasks before formatting the data and storing in in /scraped/scraped.csv

currently has no trigger defined
accepts the URL as input, using that to find and retrieve files

The Tokeniser

The tokeniser is the major interface for our data to openAI so that it can be used meaningfully

format and split the data into “tokens” - which will often be 125% of the word count from the source data.
calculate token totals and format the data into a row of /embeddings/embeddings.csv
uses openai.Embedding.create to obtain vector calculations to associate with the data and store into /embeddings/embeddings.csv
1. encoding can be set here - must match the UI setting
2. engine can be set here - must match the UI setting

Query/User Interface

The UI container has a CLI and WebUI mode and in either mode performs the same discreet functions

Provide the user interface for the user to pose a query
Prompt for user input
send the user data via openai.Embedding.create to openai.com and receive a response
Display the response to the user

This implementation uses the Embeddings OpenAI function - but can be adapted to use Other functions across a wide range of the available Models

The WebCrawler , Processor and Tokeniser are non-interactive processes and the data crawl alone can be VERY time intensive.

Due to the POC nature of the implementation, the processes - while interdependent, have been split apart to minimise risk of data loss in the event of an uncaught exception (which can reset the entire process!)

All modules can be used to facilitate multiple data sets as the data will be segregated by domain - further iterations may consider

webcrawler process over multiple sources and updating rather than from scratch every time
automatic triggering of the process container when new raw data is available
automatic triggering of the tokenisation process when new scraped data is available
the facility in the UK to select which data source to query (easy quick win)

Revisions

revision 1.0.0 of the high level design

The initial build was a mono-code beast which lost all the data when an error occurred and deployed from a single container

the uncaught exceptions lead to splitting apps apart
the 5.7Gb single image was slow so evolved to a common single build base image which each component only installs its unique requirements on
containers using local mounts

Original version

revision 1.0.1 of the high level design

Splitting into 4 functions, we can run each instance when required and the previous step is complete

share data via a storage account mount for each
- /text
- /scraped
- /embeddings

This revisions allows more flexibility in swapping out components and being able to add extra steps or processing without breaking the original workflow

1.0.1

ToDo

refinement and due diligence on User stories
1. rapid test of Completion
2. Completion/Embedding process differences and switching
3. Switching models and code changes
Additional Source Data ingestion - from files or other - to accept PDF etc and deliver the data into /files folder as for Web Crawler
Can we merge Embeddings after processing - update and improve rather than re-start
setup guide for repository workflows and terraform/azure integration using Github actions and Secrets
automation and triggering of the 3 data collection and parsing processes
adaption of web-crawler to allow multiple sites to crawl
adaption of web-ui to allow selection of pre-embedded data source to query
other integrations

Decision Tracking

#	Resolved	Reviewed
1	PK
2	PK
3

#	Observation/Issue	Decision
1	The application initially carried out functions in sequence - causing wasted time and data loss when the process terminated but hadn't started writing to disk	Split the application into major function areas that an be run depending on the availability of data. Web Crawl Processing Tokenisarion Main Interface/WebUI
2	Very long installation time to install all the dependencies via pip3	Containerise and create a base image with the common components to use as the base for the other elements of the application
3	Large Images produce during Docker build	for each of the functions deploy a new `.venv` remove all pip packages `pip3 install` for required `imports` in python run application and install missing components create `requirements.txt` by doing `pip freeze > requirements.txt`

Paul K space : OpenAI POC Implementation

Requirements

Project Links

Trello Board

Systems Overview / High Level Summary

Revisions

revision 1.0.0 of the high level design

revision 1.0.1 of the high level design

ToDo

Decision Tracking

#

1

2

3

Attachments: