Why You Can't Trust Your NULLs in Osquery
The pain of empty strings in SQLite
A basic assumption of nearly any database is that the absence of data in a table
is recorded as NULL
. This fundamental principle guides many of the approaches an
individual would take to querying the data within. The behavior of common
operators like COUNT
, ISNULL
, IFNULL
, MAX
, are predicated on the
interpretation of NULL
rows.
…but, if you are using the popular open-source endpoint visibility tool osquery, proceed with caution, I am here to show you the quicksand —
A Brief Background on Osquery
Osquery is an endpoint visibility tool that translates native API calls into SQLite virtual tables, allowing a Mac, Linux, or Windows computer to be queried in real-time as if it were a relational database:
SELECT name, pid, parent FROM processes LIMIT 5;
+----------------+-----+--------+
| name | pid | parent |
+----------------+-----+--------+
| kernel_task | 0 | 0 |
| launchd | 1 | 0 |
| syslogd | 67 | 1 |
| UserEventAgent | 68 | 1 |
| uninstalld | 71 | 1 |
+----------------+-----+--------+
Created at and later open-sourced by Facebook, osquery is now under the ownership and governance of The Osquery Foundation, which is under The Linux Foundation. Osquery is the underlying agent for dozens of security solutions and runs on millions of devices across the globe.
If you aren’t familiar with osquery, you can get started exploring it in less than 5 minutes by installing the agent on your device:
Osquery / Downloads / Official
For those of you who are already familiar with osquery, or using another osquery endpoint manager like Kolide, feel free to dive in and get started exploring why what you thought was NULL may not be what you expect.
Empty String != NULL
A painful lesson I learned early on in my days of writing SQL in osquery was realizing that its output is littered with empty strings.
What is an empty string, you might ask? Let’s take a look:
WITH
demonstration_table (example_name,example_data,example_datatype) AS (
VALUES
('first','foo','not empty string'),
('second','','empty string'),
('third',NULL,'NULL')
)
SELECT * FROM demonstration_table;
+--------------+--------------+------------------+
| example_name | example_data | example_datatype |
+--------------+--------------+------------------+
| first | foo | not empty string |
| second | | empty string |
| third | | NULL |
+--------------+--------------+------------------+
The second value ''
is what is commonly referred to as an empty string.
Despite containing no content, ''
represents data.
The third value, on the other hand, is a NULL
, which means the absence of
data.
While these two values (the second and third rows in the column: example_data
)
look the same, they are very different, and the way the SQLite engine
interprets them could not be more dissimilar.
There are many ways in which databases, including osquery, handle empty strings
differently from NULL
s.
Where NOT NULL and Where Is NULL
The first example of difficulties you may encounter is in situations in which
you attempt to query for data where a specific column is or is not NULL
. Let’s
take the block_devices
table, for example. This table outputs a list of any
connected block devices. In this case, my hard drive (with its various volumes)
and some external media (thumb drives and external hard drives).
It also returns the basic mapping of these volumes to their respective parent volumes (e.g. a hard drive may have many partitions).
SELECT name, parent, label FROM block_devices;
+----------------+------------+----------------------------+
| name | parent | label |
+----------------+------------+----------------------------+
| /dev/disk0 | | APPLE SSD AP1024M Media |
| /dev/disk0s1 | /dev/disk0 | EFI System Partition |
| /dev/disk0s2 | /dev/disk0 | Untitled 2 |
| /dev/disk1 | | AppleAPFSMedia |
| /dev/disk1s1 | /dev/disk1 | Macintosh HD - Data |
| /dev/disk1s2 | /dev/disk1 | Preboot |
| /dev/disk1s3 | /dev/disk1 | Recovery |
| /dev/disk1s4 | /dev/disk1 | VM |
| /dev/disk1s5 | /dev/disk1 | Macintosh HD |
| /dev/disk1s6 | /dev/disk1 | Update |
| /dev/disk1s5s1 | /dev/disk1 | com.apple.os.update-5DC2B8 |
| /dev/disk2 | | WD My Passport 25E2 Media |
| /dev/disk2s1 | /dev/disk2 | EFI System Partition |
| /dev/disk2s2 | /dev/disk2 | Untitled 2 |
| /dev/disk2s3 | /dev/disk2 | Booter |
| /dev/disk3 | | Patriot Memory Media |
+----------------+------------+----------------------------+
The parent volumes are those with no data in the parent column (because they have no parent of their own). Let’s say we wanted to return just these parent volumes; we might attempt to do so by running the following:
osquery> SELECT name, parent, label FROM block_devices WHERE parent ISNULL;
osquery>
No results are returned! This is because, as we demonstrated in the previous section, these empty rows are not actually NULLs, they are empty strings. Let’s amend our SQL to return the desired output:
SELECT name, parent, label FROM block_devices WHERE parent = '';
+------------+--------+---------------------------+
| name | parent | label |
+------------+--------+---------------------------+
| /dev/disk0 | | APPLE SSD AP1024M Media |
| /dev/disk1 | | AppleAPFSMedia |
| /dev/disk2 | | WD My Passport 25E2 Media |
| /dev/disk3 | | Patriot Memory Media |
+------------+--------+---------------------------+
Changing our WHERE
clause to: WHERE parent = ''
results in the output we
expect. This basic example lays the foundation for all of the examples to
follow.
NULL != NULL
Though we didn’t return NULL
values in our previous example it is important to
point out the significance of the ISNULL
operator.
A nuance of true NULL
values is that NULL
does not equal itself and NULL
is not LIKE
itself either. You may wonder what that means, let’s see it in
practice below:
SELECT
1=1,
NULL LIKE NULL,
NULL LIKE '%',
NULL = NULL,
NULL ISNULL;
+-----+----------------+---------------+-------------+-------------+
| 1=1 | NULL LIKE NULL | NULL LIKE '%' | NULL = NULL | NULL ISNULL |
+-----+----------------+---------------+-------------+-------------+
| 1 | | | | 1 |
+-----+----------------+---------------+-------------+-------------+
You must use the operator ISNULL
to find real NULL
values.
Inconsistencies of Behavior When Aggregating
Many of the issues with empty strings stem from the way they are treated when performing aggregation operations.
COUNT treats empty strings as NOT NULL
When the COUNT
operator is used on a column, it records a count of any rows
which are not NULL
. We can see this behavior using the query below:
WITH
demonstration_table (example_data,example_datatype) AS (
VALUES
('foo','not empty string'),
('','empty string'),
(NULL,'NULL')
)
SELECT COUNT(example_data) FROM demonstration_table;
+---------------------+
| COUNT(example_data) |
+---------------------+
| 2 |
+---------------------+
While we would expect to see a count of 1
, counting only the row containing
the example data: 'foo'
, our empty string is counted alongside 'foo'
as the
presence of row data.
We have to be mindful of that when looking for something based on the presence
of data in a WHERE
statement. For example, WHERE path NOT NULL
may not
return the output we expect.
AVG and other aggregation
The confusion of empty strings vs NULL
extends into any SQLite arithmetic due
to the different ways each is handled in aggregation functions.
Let’s take a look at the AVG
function which averages the contents of the data
it operates on.
We will do a basic example first to verify the behavior:
WITH
example_data (data_type,sample_data) AS (
VALUES
('integer',50),
('integer',100))
SELECT AVG(sample_data) FROM example_data;
+------------------+
| AVG(sample_data) |
+------------------+
| 75.0 |
+------------------+
As we expect: (50+100)/2 = 75
But what happens when we throw a NULL
row in the mix?
WITH
example_data (data_type,sample_data) AS (
VALUES
('true-null',NULL),
('integer',50),
('integer',100))
SELECT AVG(sample_data) FROM example_data;
+------------------+
| AVG(sample_data) |
+------------------+
| 75.0 |
+------------------+
The absence of data is not data, so the NULL
row is ignored when
calculating the average of the values. Let’s see what happens when we
substitute our empty string ''
for NULL
:
WITH
example_data (data_type,sample_data) AS (
VALUES
('empty string',''),
('integer',50),
('integer',100))
SELECT AVG(sample_data) FROM example_data;
+------------------+
| AVG(sample_data) |
+------------------+
| 50.0 |
+------------------+
The empty string is interpreted as 0
by SQLite leading to it being calculated
as part of our average and subsequently producing a value, we may have otherwise
not expected.
Empty strings being interpreted as 0
may produce other inconsistencies you do
not expect:
value * NULL = NULL
value * empty string = 0
Let’s see that in osqueryi:
WITH
example_data (value_1,value_2,data_type) AS (
VALUES
(2,'','empty string'),
(2,NULL,'NULL'),
(2,50,'integer'),
(2,100,'integer')
)
SELECT
value_1,
value_2,
data_type,
(value_1 * value_2) AS multiplied_output
FROM example_data;
+---------+---------+--------------+-------------------+
| value_1 | value_2 | data_type | multiplied_output |
+---------+---------+--------------+-------------------+
| 2 | | empty string | 0 |
| 2 | | NULL | |
| 2 | 50 | integer | 100 |
| 2 | 100 | integer | 200 |
+---------+---------+--------------+-------------------+
Boolean interpretation
Boolean values of TRUE
and FALSE
are interpreted by SQLite as 1
and 0
respectively. This means that an average of boolean data is susceptible to the
same problems mentioned above if empty strings are present:
WITH
example_data (data_type,sample_data) AS (
VALUES
('true-null',NULL),
('boolean-true',TRUE),
('boolean-false',FALSE),
('empty string','')
)
SELECT AVG(sample_data) FROM example_data;
+-------------------+
| AVG(sample_data) |
+-------------------+
| 0.333333333333333 |
+-------------------+
Complicating things further: Boolean fields may actually be Tri-State
Many tables in osquery have columns that appear to encode boolean
(TRUE
,FALSE
) data. However, the prevalence of empty strings complicates the
utility of these columns.
Let’s take a look at the chrome_extensions
table to understand why this might
cause us some confusion:
SELECT name, state
FROM users CROSS JOIN chrome_extensions USING(uid);
+----------------------------------------------+-------+
| name | state |
+----------------------------------------------+-------+
| Gmail | 1 |
| Application Launcher For Drive (by Google) | 1 |
| Advanced REST client | 1 |
| The Camelizer | 0 |
| EditThisCookie | 1 |
| Privacy Badger | 1 |
| React Developer Tools | 1 |
| 1Password extension (desktop app required) | 1 |
| Take Webpage Screenshots Entirely - FireShot | 1 |
| Endpoint Verification | 1 |
| Chrome Web Store Payments | 1 |
| Google Drive | 1 |
| Chrome Media Router | |
| Chrome Media Router | 1 |
| Application Launcher For Drive (by Google) | 0 |
| Google Docs Offline | 1 |
| Chrome Web Store Payments | 1 |
| Sheets | 1 |
| Gmail | 1 |
| Adobe Acrobat | 0 |
If we had naively assumed the state
column was a guaranteed boolean, we would
have been mistaken and missed our Chrome Media Router
row.
And again, that empty row? You guessed it, empty string:
SELECT name,state
FROM users CROSS JOIN chrome_extensions USING(uid)
WHERE state = '';
+---------------------------------+-------+
| name | state |
+---------------------------------+-------+
| Chrome Media Router | |
+---------------------------------+-------+
If we weren’t paying attention, it would be easy to get caught by this inconsistent data.
WITH
chrome_extensions_sum AS (
SELECT
COUNT(*) AS total,
SUM(CASE WHEN state = TRUE THEN 1 END) AS enabled,
SUM(CASE WHEN state = FALSE THEN 1 END) AS disabled
FROM users CROSS JOIN chrome_extensions USING(uid)
)
SELECT *,
enabled + disabled AS enabled_plus_disabled
FROM chrome_extensions_sum;
+-------+---------+----------+-----------------------+
| total | enabled | disabled | enabled_plus_disabled |
+-------+---------+----------+-----------------------+
| 53 | 44 | 8 | 52 |
+-------+---------+----------+-----------------------+
Our reliance on the TRUE
, FALSE
operators to evaluate this column do not
account for the possibility of an empty string.
Sometimes Osquery Uses NULL? 🤷
Like so many things in life, there are few hard and fast rules. While there are
a litany of examples of osquery tables with columns that encode empty strings
instead of NULLs
– there are also outliers.
For example, the socket
column in the process_open_sockets
table correctly
encodes NULL
values:
SELECT
pid, socket, remote_address, local_port, path, state
FROM process_open_sockets
WHERE remote_address = '54.230.75.76';
+------+--------+----------------+------------+------+-------------+
| pid | socket | remote_address | local_port | path | state |
+------+--------+----------------+------------+------+-------------+
| 1464 | | 54.230.75.76 | 49501 | | ESTABLISHED |
+------+--------+----------------+------------+------+-------------+
SELECT
pid, socket, remote_address, local_port, path, state
FROM process_open_sockets
WHERE remote_address = '54.230.75.76'
AND socket ISNULL;
+------+--------+----------------+------------+------+-------------+
| pid | socket | remote_address | local_port | path | state |
+------+--------+----------------+------------+------+-------------+
| 1464 | | 54.230.75.76 | 49501 | | ESTABLISHED |
+------+--------+----------------+------------+------+-------------+
Finally! A bit of expected behavior!
🤨 Wait…
The path
column in that result above also has nothing.
🤔 🤔 🤔
Certainly, it will also be a true NULL
, let’s verify…
SELECT
COUNT(*)
FROM process_open_sockets
WHERE remote_address = '54.230.75.76'
AND path ISNULL;
+----------+
| COUNT(*) |
+----------+
| 0 |
+----------+
😑 Welp, it was a good shot. So much for that moment of hope.
But! It does reveal a characteristic that can help us avoid ambiguity when writing our queries.
Patterns which you can expect to observe:
Columns that are encoded as a
TEXT
field will typically use empty strings in place ofNULL
;Columns that are numeric (eg.
INT
,BIGINT
,DOUBLE
) will typically encode the absence of data asNULL
.
Be Careful Concatenating Results With True NULLs
One peculiarity which you should be aware of when using osquery is that the
common concatenate operator ||
will not produce output when you concatenate a
NULL
with any other string.
You can work around this behavior by instead relying on the CONCAT
operator.
Let’s see the difference in output in osqueryi below:
WITH
example_data (data_type,sample_data) AS (
VALUES
('true-null',NULL),
('boolean-true',TRUE),
('boolean-false',FALSE),
('empty string','')
)
SELECT
CONCAT(data_type,',',sample_data) AS concat_example,
(data_type || ',' || sample_data) AS double_pipe_example
FROM example_data;
+-----------------+---------------------+
| concat_example | double_pipe_example |
+-----------------+---------------------+
| true-null, | |
| boolean-true,1 | boolean-true,1 |
| boolean-false,0 | boolean-false,0 |
| empty string, | empty string, |
+-----------------+---------------------+
As we can see above, the true NULL
row produces no concatenated output in the
||
generated column. If we weren’t rigorously testing our output, we might have
been unaware that we were dropping data on the floor!
When ‘null’ is not NULL
Sadly, even with the rules above, there are stranger examples of things happening in query outputs.
What about when 'null'
is not actually NULL
…
Let’s take a look at the firefox_addons
table.
SELECT
name,
creator
FROM users
CROSS JOIN firefox_addons USING(uid);
+---------------------------------+-------------------+
| name | creator |
+---------------------------------+-------------------+
| Adobe Acrobat | Adobe Systems Inc.|
| Form Autofill | null |
| Firefox Screenshots | Mozilla |
| Web Compatibility Interventions | null |
| WebCompat Reporter | Thomas Wisniewski |
| DoH Roll-Out | null |
| Picture-In-Picture | null |
| Google | null |
| Amazon.com | null |
| Bing | null |
| DuckDuckGo | null |
| eBay | null |
| Wikipedia (en) | null |
| System theme | Mozilla |
| Dark | Mozilla |
| Light | Mozilla |
| Firefox Alpenglow | null |
| Reset Search Defaults | null |
| Add-ons Search Detection | null |
| LastPass: Free Password Manager | LastPass |
| 1Password – Password Manager | AgileBits |
+---------------------------------+-------------------+
Oh no! This is not at all what we expect to see. There are rows with the
literal text string 'null'
.
Let’s take a look at the data source osquery uses to populate the
firefox_addons
table, a JSON file:
~/Library/Application Support/Firefox/Profiles/%.default/extensions.json
While the value is correctly encoded as NULL
in the JSON file, somehow, this
data is being parsed as a literal string 'null'
instead.
This is a fantastic example of another general rule that you should keep in mind when working with osquery:
Data is populated from disparate sources and broadly speaking, osquery relays these sources with minimal consideration for whether they are misformatted, misencoded, or inconsistent.
Never forget:
osquery is merely a messenger; you must check your results!
How to Avoid the Pitfalls of Empty Strings
The following three rules will help ensure you do not fall victim to any pitfalls described above.
Consider each table’s schema (whether a column is
NUMERIC
vs.TEXT
)Sanitize the output of your queries to ensure they cannot silently fail in ways you do not predict.
Test your queries rigorously before blindly relying on their output.
1. Consider each table’s schema
As we discussed above in the process_open_sockets
example, there is a typical
pattern in osquery that TEXT
columns will typically encode empty-strings and
NUMERIC
columns will typically encode NULL
. Knowing the type of columns you
are working with will help you predict how they will behave.
When writing a new query, you can spot-check the schema by referencing the
osquery Schema Docs
or by typing the .schema table_name
command in osqueryi:
osquery> .schema process_open_sockets
CREATE TABLE process_open_sockets(
`pid` INTEGER,
`fd` BIGINT,
`socket` BIGINT,
`family` INTEGER,
`protocol` INTEGER,
`local_address` TEXT,
`remote_address` TEXT,
`local_port` INTEGER,
`remote_port` INTEGER,
`path` TEXT,
`state` TEXT,
`net_namespace` TEXT HIDDEN,
PRIMARY KEY (`pid`)) WITHOUT ROWID;
Understandably, many queries you write will call columns from more than one table. You can use another unique trick to determine the data types you are dealing with for a given query.
The .types
command will give you the datatypes from all the columns present
in your query’s output:
osquery> .types SELECT * FROM users, logged_in_users WHERE username = user;
+-------------+---------+
| name | type |
+-------------+---------+
| uid | BIGINT |
| gid | BIGINT |
| uid_signed | BIGINT |
| gid_signed | BIGINT |
| username | TEXT |
| description | TEXT |
| directory | TEXT |
| shell | TEXT |
| uuid | TEXT |
| is_hidden | INTEGER |
| type | TEXT |
| user | TEXT |
| tty | TEXT |
| host | TEXT |
| time | BIGINT |
| pid | INTEGER |
+-------------+---------+
2. Sanitize the output of your queries
Due to the sometimes unpredictable nature of NULL
vs empty string, it is often
best to take a multi-pronged approach and try to sanitize any strings you wish
to perform NULL
dependent operations (COUNT
etc.) on.
Thankfully, SQLite and, by extension, osquery support the
NULLIF
operator.
This is more or less a shortcut to a CASE
statement, but it is still handy to
know. You supply the column name you wish to nullify empty strings for and then
declare ''
as the rule for nullification. We can take any of our earlier
examples and nullify the problematic rows using this approach:
SELECT
name,
NULLIF(creator,'null') AS creator
FROM users
CROSS JOIN firefox_addons USING(uid);
+---------------------------------+-------------------+
| name | creator |
+---------------------------------+-------------------+
| Adobe Acrobat | Adobe Systems Inc.|
| Form Autofill | |
| Firefox Screenshots | Mozilla |
| Web Compatibility Interventions | |
| WebCompat Reporter | Thomas Wisniewski |
| DoH Roll-Out | |
| Picture-In-Picture | |
| Google | |
| Amazon.com | |
| Bing | |
| DuckDuckGo | |
| eBay | |
| Wikipedia (en) | |
| System theme | Mozilla |
| Dark | Mozilla |
| Light | Mozilla |
| Firefox Alpenglow | |
| Reset Search Defaults | |
| Add-ons Search Detection | |
| LastPass: Free Password Manager | LastPass |
| 1Password – Password Manager | AgileBits |
+---------------------------------+-------------------+
Let’s verify that our COUNT
function is now working the way we anticipate:
SELECT
COUNT(NULLIF(creator,'null')) AS creator_present_count
FROM users
CROSS JOIN firefox_addons USING(uid);
+-----------------------+
| creator_present_count |
+-----------------------+
| 8 |
+-----------------------+
So there we have it! If there is a table where you suspect empty strings are
being recorded in place of NULL
all you have to do is make sure to nullify that
column using the NULLIF
operator or to wrap your aggregated column inside a
NULLIF
.
3. Test your queries rigorously
Check twice, then check again.
The last point cannot be emphasized enough; you should never assume osquery is just going to ‘do the right thing.’
Independently verify the output of your queries in osqueryi on representative devices and supply sample data (including empty strings) using CTEs if you want to play it extra safe.
Never assume that just because output looks consistent at a glance that it will
always remain so across a diverse fleet of endpoints, the chrome_extensions
example earlier demonstrated how even a single row might be missing a given
column.
Further Reading
Interested in more how-to guides on osquery? I recommend reading some of my other posts: