How Elasticsearch can help in finding suspicious activity on the website

Offer to readers of "Habrahabr" translation of the article "Spotting bad actors: what your logs can tell you about protecting your business" official blog of Elasticsearch. The article talks about how you can use the Elasticsearch for log analysis web server to detect suspicious activity on the site.

Let's think about what we are doing and when in the case of attempts to hack our site? First, we often try to eliminate the threat when attackers found a vulnerability on the website and I took it. Second, often the only operational tool against attackers is blocking IP addresses, but it's not very effective tool if we do not possess information about all the addresses from which the attack on the site.

But how much the situation would change if we could in advance to obtain comprehensive information about all IP addresses and subnets that show suspicious activity and block them? Sounds great, isn't it?

We can easily do it with Elasticsearch.

In the Arsenal of the search engine is a wonderful plugin Netrisk, which takes care of log analysis and using the "Sankey" diagram (see picture), shows us the size and concentration of suspicious activity in different segments of traffic.

the

Preparation

Let's start with the installation of Netrisk. For this you should run the following command at your home directory Elasticsearch (requires version 1.4.0):

the

bin/plugin -install markharwood/netrisk

Great, now the plugin is installed. But that's not all. He expects that we will give him the index with a properly configured mapping was projected for the field, which will store IP addresses. I'll explain the details later, but for now you just need to create an index and populate it with a small amount of test data by running the following shell script:

the

$ES_HOME/plugins/netrisk/exampleData/indexAnonData.sh

Attention! This script will create index “mylogs”, in this case, if you already have an index with the same name – it will be deleted.

If you followed all the above instructions, you can access the plugin page: localhost:9200/_plugin/netrisk/

the

Run

If you look at the generated script data, it may seem that this is not enough for serious analysis, but, in fact, of valuable we have just the status of the HTTP server responses. In fact, this is enough to detect suspicious behavior. Usually, the web server generates responses in the range from 200 to 300, but in case of problems it can set the response status 400 to 500. For example, this can happen when someone tries to access a nonexistent page. It turns out that we can get a list of all suspicious calls to the server with the query:

the

status:[400 TO 599]

Netrisk uses the standard Lucene query parser (the same one that Kibana), so you can be part of a query that identifies suspicious traffic additional filters using an OR condition. For example, so we can tell the system that to access your website without specifying a UserAgent should also be considered suspicious.

It is important to understand that the request, made at this stage, will not clearly define that all log entries that match it are bad. No, we just make the assumption that it may be suspicious, that in consequence of Elasticsearch analyzed the entire log for high concentrations of suspicious calls to the server with a specific IP or subnet.

If we run the plugin using the above query, Netrisk show us Sankey diagram suspicious traffic leading to your website. Here is some information to read chart:
the

line Thickness represents the number of "bad" queries, but it is not the most important!
Much more important than the color of the line it depends on what queries are dominated by bright red color – means that almost all requests are bad, while green means that almost all requests we can consider good. If you hover the cursor over the line, it is possible to see the actual numbers underlying the definition of color.
the diagram shows which IP addresses are in each subnet. This information can be very valuable to webmasters when determining which comes from malicious traffic from specific IP addresses or subnets?
Click on a specific IP address will open the website project Honey Pot, which you will be able to see the comments from other webmasters about this IP address.

You probably noticed that the color of the line can change from red to green in the direction from left to right. This is due to the fact that in the left side of the graph, each node usually represents a small group of IP addresses from which requests were identified as suspicious. The following charts represent the nodes of a subnet comprising a large number of IP addresses with different behavior. However, some subnets will be completely red. Most likely this means that no one, except criminals, is not interested in your site in this region (naprimer if you have a Russian website, and you come to a red traffic from China, Tobe rather segob this means that the Chinese hackers want to hurt your resource).

the

How it works?

a Little about the mapping was projected data

Requests made by Netrisk, rely on the available statistics about IP addresses and subnets that are stored in the index. For this analysis, we can't just index each IP address as a string, we need to split it into tokens that will represent the IP address and subnet to which it belongs (for example, IP address type 186.28.25.186 will be divided into the following tokens: 186.28.25.186, 186.28.25, 186.28 and 186). This can be done using the following mapping was projected rules:

Rule mapping was projected

curl-XPOST "http://localhost:9200/mylogs" -d '{
"settings": {
"analysis": {
"analyzer": {
"ip4_analyzer": {
"tokenizer": "ip4_hierarchy"
}
},
"tokenizer": {
"ip4_hierarchy": {
"type": "PathHierarchy",
"delimiter": "."
}
}
}
},
"mappings": {
"log": {
"properties": {
"remote_host": {
"type": "string",
"index": "not_analyzed",
"fields": {
"subs": {
"type": "string",
"index_analyzer": "ip4_analyzer",
"search_analyzer": "keyword"
}
}
}
}
}
}
}'

This approach gives us the opportunity to make a quick search simultaneously on all 4 levels iearhii each IP address. (this is applicable for both IPv6 addresses).

What's inside?

Netrisk receives your request, which determines what should be considered as "bad references" (or more specifically "potentially bad hits"). Filtering data Netrisk significant_terms uses aggregation to determine which IP addresses or subnets often come suspicious treatment. The query template as follows:

the

curl-XGET "http://localhost:9200/anonlogs/_search?search_type=count" -d'{
"query": {
"query_string": {
"query": "status:[400 TO 599]"
}
},
"aggs": {
"sigIps": {
"significant_terms": {
"field": "remote_host.subs",
"size": 50,
"shard_size": 50000,
"gnd": {}
}
}
}
}'

This query selects 50 of the most suspicious IP addresses and subnets. There are several points that deserve attention:

To get the correct data we need in high value shard_size. It will be seriously loading the memory and the network will need space for a large number of unique entries in the index. If we do not index the IP addresses completely in the field remote_host.subs – it will reduce the load, but will reduce the depth of the result.

This single query will do a mass analysis and provide us with basic information about the offenders in our system, but it is desirable to clarify some information before making the decision to block IP addresses. For this we can use the following query:

Request

{
"query": {
"terms": {
"remote_host.subs": [
"256.52",
"186.34.56"
]
}
},
"aggs": {
"ips": {
"filters": {
"filters": {
"256.52": {
"term": {
"remote_host.subs": "256.52"
}
},
"186.34.56": {
"term": {
"remote_host.subs": "186.34.56"
}
}
}
},
"aggs": {
"badTraffic": {
"filter": {
"query": {
"query_string": {
"query": "status:[400 TO 599]"
}
}
}
},
"uniqueIps": {
"cardinality": {
"field": "remote_host"
}
}
}
}
}
}

For each referenced a suspicious IP or a subnet, we get an array with the following information:

Total number of requests (good and bad);
Total number of bad requests;
Total number of unique IP addresses in the subnet.

Now that we have the chart and detailed statistics on suspicious IP addresses, we can make the final decision on the lock.

the

Conclusion

The tracking behavior of such entities as the IP address through log analysis web server is a complex computational task, but our findings are only the tip of the iceberg.

Here are few more interesting examples of behavioral analysis that you can implement yourself:

How much time visitors spend on my website?

What IP addresses are behaving like bots (not loading CSS and JavaScript only the markup of the page)? the

Which page of the website often is the first/last when you visit the website?

Article based on information from habrahabr.ru

Поиск по этому блогу

computer express