I came around few times on topic "Search on Google Cloud Platform", i.e. people asking question of different ways how to implement search. Not sure if it's not part of some Google Cloud test or exam, but it's interesting topic and GCP offers different ways how to do it so I decided to do series on this topic. So in each article I want to describe one way of implementation as well as code explanation and load testing.
Task description goes like this: Imagine you are eshop and you want to implement autocomplete of you product description functionality, so when users type in search box some words they will get products which contains those words. How would you do it on GCP to be scalable, fast etc...
Text search by itself is major topic which offers different functionalities, approaches and I don't consider myself to be an expert at all, so pardon me if I omit something.
To get some realistic data Best Buy has on github repository which contains data about some 50000 products. To really simulate big eshop I found on Kaggle dataset https://www.kaggle.com/c/acm-sf-chapter-hackathon-big/data (It's necessary to register in order to get product files.) which contains over million of products and that is already number which should be more fun to work with. I wrote a small script to extract all necessary information since it's in xml and in multiple files. I won't get all data since there are about 70 fields per product.
All code is on github.com https://github.com/zdenulo/gcp-search and it's written in Python 3 (some runs only in Python 2 as in this case).
To implement functionality, we will need frontend with some input field to enter text which will be searched and then display results. For this I will use jQuery autocomplete library which makes request to server with input query and then display result automatically.
Next we will need some backend server for which i will use Google App Engine (GAE) (both Standard and Flexible, mostly Flexible) since it's easy to deploy and it scales automatically.
And of course we will need some storage where data about products will be stored used for search which is the whole essence of this s. Truth is, eshops have more complex database architecture, but I'm simplifying here, because we are interested only in search functionality. Normally you would have usual stuff (dozen on properties related to product) stored somewhere in database and stuff that will be searched (product name) only in Search Engine and reference between.
After downloading product_data.tar.gz from Kaggle website and unpacking it, running script extract_product_data.py will extract some information from multiple xml files into one csv file. There are dozen fields per product but I am only saving few and perhaps I will not use even all of those. Obviously product name is most important. Csv file isn't included in repository since it's ~260MB big :).
Frontend is simple and straight forward.
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>Autocomplete</title> <link rel="stylesheet" href="//code.jquery.com/ui/1.12.1/themes/base/jquery-ui.css"> <script src="https://code.jquery.com/jquery-1.12.4.js"></script> <script src="https://code.jquery.com/ui/1.12.1/jquery-ui.js"></script> <script> $(function() { $("#search").autocomplete({ source: function (request, response) { $.ajax({ dataType: "json", url: "/search", data: {query: request.term}, success: function (data) { response(data); } }) }, minLength: 2 }); }); </script> <body> <div class="ui-widget"> <form> <input type="text" id="search" size="55" > </form> </div> </body> </html>
Basically like I wrote earlier, I am using jQuery autocomplete library which with few settings automatically makes queries to /search url where it sends typed query and renders received results.
As I wrote in the beginning, first service/product which I will use is Search API which is integrated inside Google App Engine Standard.
Some high level overview of Search API:
There are many interesting search features:
I am using App Engine since it's the only way how to use Search API, but beside that it's lightweight, easy to deploy, scales up and down automatically.
Code for web application is in folder gae_search_api/webapp. As mentioned earlier, application runs on GAE Standard (Python 2), I will explain most important parts.
search_base.py contains class SearchEngine with which I will wrap up all operations related to search.
search_api.py contains implementation of SearchEngine class using Search API, here is full code.
from search_base import SearchEngine from google.appengine.api import search class SearchAPI(SearchEngine): """GAE Search API implementation, can be used only withing GAE""" def __init__(self, client=None): self.client = search.Index('products') # setting Index def search(self, query): """Making search with SearchAPI and returning result""" try: search_results = self.client.search(query) results = search_results.results output = [] for item in results: out = { 'value': item.field('name').value, 'label': item.field('name').value, 'sku': item.field('sku').value } output.append(out) except Exception: output = [] return output def insert(self, item): """Inserts document in the Search Index""" doc = search.Document( fields=[ search.TextField(name='name', value=item['name']), search.TextField(name='sku', value=item['sku']), ] ) self.client.put(doc) def insert_bulk(self, items): docs = [] for item in items: doc = search.Document( fields=[ search.TextField(name='name', value=item['name']), search.TextField(name='sku', value=item['sku']), ] ) docs.append(doc) self.client.put(docs) def delete_all(self): while True: document_ids = [ document.doc_id for document in self.client.get_range(ids_only=True)] # If no IDs were returned, we've deleted everything. if not document_ids: break # Delete the documents for the given IDs self.client.delete(document_ids)
There is not much to explain, except that I am only inserting 2 product fields name and sku which I used in document both as TextFields
Web application (file main.py) is written in Flask and implements some general urls like for saving product data (because we can use Search API only withing GAE application), search and delete and of course renders html page for autocomplete.
import logging from flask import Flask, render_template, request from flask.json import jsonify from google.appengine.ext import deferred from search_api import SearchAPI app = Flask(__name__) search_client = SearchAPI() @app.route('/') def index(): return render_template('index.html') @app.route('/search') def search(): """based on user query it executes search and returns list of item in json""" query = request.args.get('query', '') results = search_client.search(query) return jsonify(results) @app.route('/upload', methods=['POST']) def upload(): """gets list of products and saves into search index""" json_data = request.get_json() search_client.insert(json_data) return 'ok' @app.route('/upload_bulk', methods=['POST']) def upload_bulk(): """gets list of products and saves into search index""" json_data = request.get_json() logging.info("received {} items".format(len(json_data))) search_client.insert_bulk(json_data) return 'ok' @app.route('/delete') def delete(): """deletes all items in search""" deferred.defer(search_client.delete_all) return 'ok'
To upload GAE web application, you need to have installed Cloud SDK. Before application upload, you need to first install locally some libraries (which will be uploaded with application).
in webapp folder execute command:
>pip install -r requirements.txt -t lib
In folder load_data, there is script upload.py which reads data from csv file and makes requests to GAE application. We are limited with 200 insertion of documents per request, since I am doing batch import as well as 250 documents per second, so I am sending 200 products in one request and making small pause. I don't remember exactly how long it took to upload all data, but something like 3 hours or maybe even more. I guess it's no problem if you upload it once per life.
Now if you uploaded application as well as data, you can try search on your app's url:
Search API returns 20 results (default number) of product names which contains word "mouse". It supports pagination, i.e. to continue getting more results which could be implemented as extra feature. Also this would be great case for Faceted search which allows refining search results. Maybe in some other article I could create example with faceted search.
Of course, it's no problem to play as single user on your webapp and it would be interesting to see how search behaves (responds) to multiple users. That's why I will use distributed load testing using Kubernetes and load testing framework Locust based on this article https://cloud.google.com/solutions/distributed-load-testing-using-kubernetes Github repository referenced in this article is out of date (Kubernetes version) so I was using this one https://github.com/fawaz-moh/distributed-load-testing-using-kubernetes.
In load-testing folder is everything that's needed for set up of load testing. This is also several steps effort, I'll try to explain briefly how to set this. First we will create Docker image which will contain Locuts files for load testing (I'm not going into details). Then we will create Kubernetes cluster on Google Kubernetes Engine and deploy Docker image and initiate load testing which will make requests and make some stats about response time. Step by step process is explained in Readme file in load-testing folder so I won't go into details here.
What I am doing with Locust framework is that I parsed words from product names and I am using those to make search queries. Locust configuration allows setting hatch rate (number of users added per second) and final number of users. So every user is making requests between 1 and 5 seconds.
Cluster is default with 3 nodes of n1-standard-1 VM types and I am using preemptible to save money :). This allows setting 12 slaves which will make requests. So here are some graphs and stats.
This is graph of number of requests per second, as it's displayed in the end it was around 630 RPS which is decent load. Whole load test lasted around 10 minutes.
Average response time varied, you can see that in the beginning it was higher due to creating new instances to serve requests. Growth of number of users was linear.
Stats are also interesting, out of 286119 requests, there were only 3 with errors, median response was 57ms and average 181ms.
Here is also screen shot from GAE dashboard where number of instance is displayed over time.
And finally excerpt from logs.
Point of this load test was to demonstrate how Search API scales-above-one-user-load, and together with App Engine it handled without problems. This playing cost me ~16 $.
More detailed and thorough description with examples is in official documentation https://cloud.google.com/appengine/docs/standard/python/search/.
In conclusion, Search API has great search capabilities and with no configuration it's easy to use it straight into code. Dissadvantage can be (depending on case) higher price and lock in under GAE Standard.
In next article we will look at Cloud Datastore and see how we can use it to make text search queries.