Queryloop

Research

The Future of Database Queries: Evaluating Text-to-SQL and Text-to-NoSQL with AI

Auzair
April 19, 2024
23 min read

Text-to-SQL and Text-to-NoSQL with AI

In the wake of ChatGPT and other large language models (LLMs) gaining prominence, the fascination with Retrieval Augmented Generation (RAG) — essentially conversing directly with your data — has skyrocketed.

Introduction

In the wake of ChatGPT and other large language models (LLMs) gaining prominence, the fascination with Retrieval Augmented Generation (RAG) — essentially conversing directly with your data — has skyrocketed. While the concept of querying databases using natural language is captivating, the practical implementation of such RAG applications poses significant challenges.
This article delves into the exciting realm of using LLMs to transform natural language into SQL and NoSQL queries. Imagine the ease of fetching and filtering data just by typing out your thoughts. However, as straightforward as this may sound, the process is fraught with complexities. LLMs, for all their intelligence, are still prone to errors and can sometimes produce inaccurate or fabricated information.
Despite these challenges, the focus of our discussion will not be on the pitfalls alone but rather on comparing two powerful methods of data interaction: Text-to-SQL versus Text-to-NoSQL. Which approach is more effective? Which offers more accurate results or lower latency? And, when dealing with simpler data formats like CSV or Excel files, which method should you prefer?
Join me as we explore these questions, offering insights and perhaps even answers on how best to leverage these groundbreaking technologies in real-world applications.

Text-to-SQL

Text-to-SQL process involves providing an LLM with the schema of a database table, sometimes accompanied by an example row, to contextualize the data structure. This setup allows the model to understand what kind of information is stored and how it is organized.
When a user poses a query in natural language, the LLM utilizes this contextual information to generate a corresponding SQL query. This query isn't just a direct transformation of text; it's an intelligent creation that considers the user's intent and the database's architecture.
Here's how it typically works in a practical setting:
  • 1. Input Preparation: The user's natural language query is received.
  • 2. Query Generation: The LLM processes this input along with the provided schema and potentially an example row, to generate a SQL query.
  • 3. Execution: The SQL query is executed within a database environment, such as SQLite in Python.
  • 4. Output: The result of the query is returned to the user, completing the cycle from natural language to database response.
Below is a Python function that illustrates this process in action:
1def read_sql_query(user_input: str, table_schema: str,database: str):
2 """Generates an SQL query based on user input and retrieves data from an SQLite database.
3
4 :param user_input: User's question or input that guides the query generation.
5 :type user_input: str
6 :param table_schema: A string representation of the table schema.
7 :type table_schema: str
8 :param database: The path to the SQLite database file.
9 :type database: str
10 :return: A list of dictionaries containing data retrieved from the SQLite database, or an error message.
11 :rtype: list[dict] or str
12 """
13 attempts = 0
14 while attempts < 3:
15 try:
16
17 user_message=user_input
18
19 system_temp = f"""
20 You are a SQL expert. Create a syntactically correct SQL query for the user question:
21
22 This is the table schema : "{table_schema}"
23 Return the SQL query in JSON.
24 output should be like "SQL_query": "json_object"
25 Unless the user specifies in the question a specific number of examples to obtain, limit your query to at most 10 results using the LIMIT stage in SQL query
26 Always use the field names as they appear in the table. Be careful not to query for fields that do not exist.
27 ###{user_message}###
28 """
29
30
31
32 client = openai.OpenAI(
33 api_key=os.environ.get("OPENAI_API_KEY"),
34 )
35 response = client.chat.completions.create(
36 model="gpt-3.5-turbo-1106",
37 temperature=0,
38 response_format={"type": "json_object"},
39 messages=[
40 {"role": "system", "content": system_temp}
41 ]
42 )
43
44 output_str = response.choices[0].message.content
45
46 # Attempt to parse the output string as JSON and check for the 'pipeline' key
47 output_json = json.loads(output_str)
48
49 print(output_json)
50 # MongoDB connection parameters
51 host = "localhost"
52 port = 27017
53 database_name = database
54
55 # Connect to MongoDB
56 connection = sqlite3.connect(database)
57 cursor = connection.cursor()
58
59 # Execute the generated SQL query
60 cursor.execute(output_json['SQL_query'])
61
62 # Fetch all rows from the query result
63 rows = cursor.fetchall()
64
65 # Convert rows to a list of dictionaries using column names as keys
66 columns = [description[0] for description in cursor.description]
67 data = [dict(zip(columns, row)) for row in rows]
68
69 # Close the database connection
70 connection.close()
71
72 return data
73
74 except sqlite3.OperationalError as e:
75 print("Operational error, retrying...")
76 attempts += 1
77 except Exception as e:
78 # Handle other exceptions here if necessary
79 connection.close()
80 return str(e)
81 finally:
82 if connection:
83 connection.close()
84
85 return "Functional error after multiple retries."
(Note: While there are several methods to extract database schemas, we'll skip that part for brevity in this discussion.)
Now, let's demonstrate the practicality of Text-to-SQL using a common dataset found in many examples, Movies.csv. To test the functionality, we'll pose a straightforward query to our system.
Here's the function call that tests the system:
1read_sql_query(user_input="What is the top rated movie?", table_schema=schema_dict, database='database.db')

SQL Query Generated by LLM:

1{'SQL_query': 'SELECT * FROM Movies ORDER BY "Movie Rating" DESC LIMIT 1'}

Output:

1[{
2 'Unnamed: 0': 0,
3 'Movie Name': 'The Shawshank Redemption',
4 'Year of Release': '(1994)',
5 'Watch Time': '142 min',
6 'Movie Rating': 9.3,
7 'Meatscore of movie': '81 ',
8 'Votes': '34,709',
9 'Gross': '$28.34M',
10 'Description': 'Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.'
11 }]
As illustrated, the query successfully identifies 'The Shawshank Redemption' as the top-rated movie according to the dataset.

Text-to-NoSQL

Moving from SQL to NoSQL databases, the approach remains fundamentally similar, yet it adapts to the distinct structure and capabilities of NoSQL systems. In the case of MongoDB, one of the most powerful features is its aggregation pipeline, which allows for complex data processing and transformation directly within the database.
Similar to the Text-to-SQL process, we provide the LLM with an understanding of the database's structure. However, instead of generating SQL queries, the LLM now crafts MongoDB aggregation pipelines based on the user's natural language queries. These pipelines are sequences of data processing stages that transform, filter, and aggregate data efficiently.

The following is the function for Text-to-NoSQL:

1def read_mongodb_query(user_input: str, table_schema: str, schema_description: str, database: str, collectionName: str):
2 """Generates a MongoDB raw aggregation pipeline based on user input and retrieves data from a MongoDB collection.
3
4 :param user_input: User's question or input that guides the query generation.
5 :type user_input: str
6 :param table_schema: A string representation of the table schema.
7 :type table_schema: str
8 :param schema_description: A string that describes the schema of the data.
9 :type schema_description: str
10 :param database: The name of the MongoDB database to query.
11 :type database: str
12 :param collectionName: The name of the MongoDB collection to query.
13 :type collectionName: str
14 :return: A list of dictionaries containing data retrieved from the MongoDB collection, or an error message.
15 :rtype: list[dict] or str
16 """
17 max_retries=3
18 retries = 0
19 user_message=user_input
20
21 system_temp = f"""
22 You are a MongoDb expert. Create a syntactically correct raw aggregation query for the user question:
23
24 This is the table schema : "{table_schema}"
25 The following is an the example of one mongodb document: {schema_description}
26 Return the raw aggregration pipeline in JSON.
27 output should be like "pipeline": "json_object"
28 Unless the user specifies in the question a specific number of examples to obtain, limit your query to at most 10 results using the $limit stage in an aggregation pipeline.
29 Always use the field names as they appear in the collection. Be careful not to query for fields that do not exist.
30 When dealing with dates, and if the question involves "today", use MongoDB's $currentDate, $dateFromString, or similar operators to work with the current date.
31 DO NOT USE $TRIM etc to convert the price in raw aggregation, data is already in double.
32 DO NOT to use $lookup
33 ###{user_message}###
34 """
35 while retries < max_retries:
36 try:
37
38
39 client = OpenAI(
40 api_key=os.environ.get("OPENAI_API_KEY"),
41 )
42 response = client.chat.completions.create(
43 model="gpt-3.5-turbo-1106",
44 response_format={"type": "json_object"},
45 messages=[
46 {"role": "system", "content": system_temp}
47 ]
48 )
49
50 output_str = response.choices[0].message.content
51
52 # Attempt to parse the output string as JSON and check for the 'pipeline' key
53 output_json = json.loads(output_str)
54 if 'pipeline' not in output_json:
55 raise ValueError("Invalid format: 'pipeline' key not found")
56
57 # MongoDB connection parameters
58 host = "localhost"
59 port = 27017
60 database_name = database
61
62 # Connect to MongoDB
63 mongo_client = pymongo.MongoClient(host, port)
64 db = mongo_client[database_name]
65 collection = db[collectionName]
66
67 # Perform the query and retrieve data
68 result = collection.aggregate(output_json['pipeline'])
69 print(f"(TEST QUERY:{output_json['pipeline']})")
70 # Convert the result to a list of dictionaries
71 data = [item for item in result]
72
73 # Close the MongoDB connection
74 mongo_client.close()
75
76 return data
77 except (json.JSONDecodeError, ValueError) as e:
78 print(f"Error encountered: {e}. Retrying...")
79 retries += 1
80 if retries == max_retries:
81 # Close the MongoDB connection if open and retries are exhausted
82 mongo_client.close()
83 return "Error: Could not figure the query out after maximum retries."
Here we'll be using property_rentals.csv, another common csv dataset available.
1read_mongodb_query(user_input="what is the average of the prices", table_schema=schema_dict, schema_description=example_schema, database=database, collectionName=property_collection)
Aggregation pipeline generated by LLM:
1{'$group': {'_id': None, 'average_price': {'$avg': {'$toDouble': '$price'}}}}, {'$limit': 10}
Output:
1{'_id': None, 'average_price': 227.61161116111612}
This example illustrates how the LLM formulates a MongoDB aggregation pipeline.

Comparing Text-to-SQL and Text-to-NoSQL: Performance, Accuracy, and Practicality

Having explored both Text-to-SQL and Text-to-NoSQL techniques, the question arises: which approach is better? The answer varies depending on several factors, including performance, accuracy, and the specific use case.

Latency and Performance

In terms of response time, Text-to-SQL generally demonstrates faster query generation compared to Text-to-NoSQL using MongoDB's aggregation pipelines. This difference primarily stems from the inherent complexity of NoSQL operations. Even basic queries in MongoDB can require multiple stages in an aggregation pipeline, making them inherently more complex and slower to generate than equivalent SQL queries.

Accuracy and Reliability

Accuracy is another critical factor where Text-to-SQL tends to outshine Text-to-NoSQL. SQL queries are often more straightforward and less prone to errors, particularly in scenarios involving complex calculations or multiple filtering steps. LLMs handling NoSQL queries can sometimes 'hallucinate' or generate inaccurate data manipulations, especially under complex querying conditions. This makes Text-to-SQL more reliable for precise data retrieval and operations that involve complex statistical calculations.

Practical Considerations

It's essential to consider the database schema complexity. For SQL databases with numerous interconnected tables, fully injecting a complex schema into an LLM can be impractical due to constraints like token limits in the model. This can lead to failures or incomplete schema understanding, which impacts the quality of the generated queries.
On the other hand, when using formats like CSV for testing both SQL and NoSQL methods, the simplicity of the data format does not fully exploit the strengths and capabilities of either approach. While simpler, this does not reflect the performance and utility of these methods in handling more complex, real-world database structures.

Use Case Specificity

The choice between Text-to-SQL and Text-to-NoSQL ultimately depends on the specific use case:
Text-to-SQL is preferable for environments where precise and accurate retrieval of structured data is critical, especially when dealing with statistical data or when the database schema is highly relational. Text-to-NoSQL may be more suitable in environments where the data is more hierarchical or loosely structured, or where the database operations require more flexible and dynamic data manipulation capabilities.

Conclusion

Both Text-to-SQL and Text-to-NoSQL have their respective strengths and weaknesses. The decision on which to use should be guided by the specific requirements of the application, including the need for speed, accuracy, and the complexity of the data and queries involved. As technology and LLM capabilities evolve, these distinctions may blur, but for now, a careful evaluation of each method’s merits and limitations is essential for optimal implementation.
LLMText to SQLText to NoSqlLarge Language ModelsOpenAIAIQueryloop