Honing in on the Homeless – the Splunkish way

e9038252f910c840e582818a63dd9908_400x400Have you noticed Splunk just released a new version, including new data visualizations? I had been eager to start playing with one of the new charts when yesterday I came across a blog post by Bob Rudis, who is co-author of the Data-Driven Security Book and former member of the Verizon’s DBIR team.

In that post, @hrbrmstr is presenting readers with a dataviz challenge based on data from U.S. Department of Housing and Urban Development (HUD) related to homeless population estimates. So I’ve decided to give it a go with Splunk.

Even though -we can’t compare- the power of R and other Stats/Dataviz focused programming languages with current Splunk programming language (SPL), this exercise may serve to demonstrate some of the capabilities of Splunk Enterprise.

Sidenote: In case you are into Machine Learning (ML) and Splunk, it’s also worth checking the new ML stuff just released along with Splunk 6.4, including the awesome ML Toolkit showcase app.

The challenge is basically about asking insightful, relevant questions to the HUD data sets and generating visualizations that would help answering those questions.

What the data sets can tell about the homeless population issue?

The following are the questions I try to answer,  considering the one proposed in the challenge post: Which “states” have the worst problem in terms of homeless people?

  1. Which states currently have the largest homeless population per capita?
  2. Which states currently have the largest absolute homeless population?
  3. Which states are being successful or failing on lowering the figures compared to previous years?

I am far from considering myself a data scientist (was looking up standard deviation formula the other day), but love playing with data like many other Infosec folks in our community. So please take it easy with newbies!

Since we are dealing with data points representing estimates and this is a sort of experiment/lab, take them with a grain of salt and consider adding “according to the data sets…here’s what that Splunk guy verified” to the statements found here.

Which states currently have the largest homeless population per capita?

For this one, it’s pretty straightforward to go with a Column chart for quick results. Another approach would be to gather map data and work on a Choropleth chart.

Basically, after calculating the normalized values (homeless/100k population), I filter in only the US states making the top of the list, limiting to 10 values . They are then sorted by values from year 2015 and displayed on the chart below:

homeless-ratio

Homeless per 100k of population – Top 10 US states

The District of Columbia clearly stands out, followed by Hawaii and New York. That’s one  I would never guess. But there seems to be some explanation for it.

Which states currently have the largest absolute homeless population?

In this case, only the homeless figures are considered for extracting the top 10 states. Below are the US states where most homeless population lives based on latest numbers (2015), click to enlarge.

homeless-abs

Homeless by absolute values – Top 10 US states

As many would guess, New York and California are leading here. Those two states along with Florida and Texas are clearly making the top of the list since 2007.

Which states are being successful or failing on lowering the figures compared to previous years?

Here we make use of a new visualization called Horizon chart. In case you are not familiar with this one, I encourage you to check this link where everything you need to know about it is carefully explained.

Basically, it eases the challenge of visualizing multiple (time) series with less space (height) by using layered bands with different color codes to represent relative positive/negative values, and different color shades (intensity) to represent the actual measured values (data points).

After crafting the SPL query, here’s the result (3 bands, smoothed edges) for all 50 states plus DC, present in the data sets:

horizon-chart

So how to read this visualization? Keep in mind the chart is based on the same prepared data used in the first chart (homeless/100k population).

The red color means the data point is higher when compared to the previous measurement (more homeless/capita), whereas the blue represents a negative difference when comparing current and last measurements (less homeless/capita). This way, the chart also conveys trending, possibly uncovering the change in direction over time.

The more intense the color is, the higher the (absolute) value. You can also picture it as a stacked area chart without needing extra height for rendering.

The numbers listed at the right hand side represent the difference between immediate data points point in the timeline (current/previous). For instance, last year’s ratio (2015) for Washington decreased by ~96 as compared to the previous year (2014).

On a Splunk dashboard or from the search query interface (Web GUI), there’s also an interactive line that displays the relative values as the user hovers over a point in the timeline, which is really handy (seen below).

horizon_crop

The original data files are provided below and also referenced from the challenge’s blog and GitHub pages. I used a xlsx2csv one-liner before handling the data at Splunk (many other ways to do it though).

HUD’s homeless population figures (per State)
US Population (per State)

The Splunk query used to generate the data used as input for the Horizon chart is listed below. It seems a bit hacky, but does the job well without too much effort.

| inputlookup 2007-2015-PIT-Counts-by-State.csv
| streamstats last(eval(case(match(Total_Homeless, "Total"), Total_Homeless))) as _time_Homeless
| where NOT State_Homeless="State"
| rex mode=sed field=_time_Homeless "s|(^[^\d]+)(\d+)|\2-01-01|"
| rename *_Homeless AS *
| join max=0 type=inner _time State [
  | inputlookup uspop.csv
  | table iso_3166_2 name
  | map maxsearches=51 search="
    | inputlookup uspop.csv WHERE iso_3166_2=\"$iso_3166_2$\"
    | table X*
    | transpose column_name=\"_time\"
    | rename \"row 1\" AS \"Population\"
    | eval State=\"$iso_3166_2$\"
    | eval Name=\"$name$\"
  "
  | rex mode=sed field=_time "s|(^[^\d]+)(\d+)|\2-01-01|"
]
| eval _time=strptime(_time, "%Y-%m-%d&amp")
| eval ratio=round((100000*Total)/Population)
| chart useother=f limit=51 values(ratio) AS ratio over _time by Name

Want to check out more of those write-ups? I did one in Portuguese related to Brazil’s Federal Budget application (also based on Splunk charts). Perhaps I will update this one soon with new charts and a short English version.

Blame it on YOU for the damn false-positives!

indexBelow  is a list of 6 facts (and counting) you should know before whining and complaining around the infamous false-positive (FP) topic. If you’ve been there, feel free to comment and share your pain or your own facts.

As you know, the FPs are everywhere and multiplying just like Gremlins after a shower! [Misled Millennials, click here] In fact, that seems to be part of 12 in every 10 Security Monitoring projects out there, including those heavily relying on SIEM technology.

So here’s the list of facts and some quick directions:

#1 Canned content sucks

Bad news. Out-of-the-box things don’t go well with Security Monitoring, my friend. Your adversaries are quite creative and you need to catch up.

But before bashing  vendors, think about it from their perspective. Would you release a product without any content? They are there for providing an idea of its usage, they are meant for general cases. You need to RTFM and spend a great deal of time evaluating and testing the product features to enable customization.

If you are into #photography, let me ask you: are you still shooting on automatic, generating JPEGs, with lower resolution (more shots!) out of your fully customizable camera? If you want to boost your chances of having that perfect shot, you need to go manual and experiment with all possible settings.

The technology is just another tool. Regardless of the feature set, you should be able to get the best out of it by translating your knowledge into a tailor-made rule set.

#2 You don’t have a scope

Since it’s pretty much utopia consuming a risk analysis result as an input for your Security Monitoring program, you need to find a way to define an initial scope. This will drive the custom rules development cycle and define the monitoring coverage (detective controls).

We tend to go with low hanging fruits or quick wins, but even then, you still need to spend some time defining your goals. And here the “Use Cases” conversation starts, perhaps one of the most important from the program!

Actually, it deserves a full article on its own, but I will leave you with a simple, yet interesting approach delivered by Augusto Barros (Gartner). I liked it when he pretty much defines the scope as the interception between Importance (value) and Feasibility (effort).

#3 People > Technology

No matter how cool and easy to use the technology is, if you don’t have a skilled team to work on #1 (RTFM and experiment) and #2 (define goals), you will likely end up with a bunch of unattended alerts.

Therefore, don’t hire deployers, hire Security Content Designers, Security Content Engineers, enable them to extract the value of your security arsenal (investment).

#4 Exception handling is not optional

When I say tailor-made, that means your rule should already address exceptions to a certain extent, that’s also called whitelisting.

If during  development you realize a rule is generating too many alerts, try to anticipate the analysts call, and filter out scenarios that are not worth investigating.

The technology MUST provide easy ways to quickly add/modify/remove exceptions. Bonus for those products providing auditing (who, when, why).

#5 Aggregation is key

Have you noticed how many alerts are related to the very same case or target?

Some technologies (and security engineers), especially applicable to SIEMs, dare generating the very same alert – multiple times – simply because the time range considered for processing the events is overlapping, longer than the rule’s execution interval itself. “Run every hour, checking the last 5 days”. WTH?

What about aggregating on the target (or victim, if you will)? Not all products provide such functionality, but most SIEMs do. So instead of generating multiple alerts for every single hit on your data stream, why not consolidating those events on a single alert? Experiment aggregating on unique targets to start with.

#6 Threat Intel data != IDS signature

Here, there’s not much to say, with all that Threat Intel porn hype nowadays, just go ahead and read Alex Sieira’s nice blog post: Threat Intelligence Indicators are not Signatures.

Help spread the love for FPs and share your thoughts!

 

Not everything that happens in Vegas, stays there!

DSCF4009I’m here waiting for my next flight and thought about writing this short post, describing a recent experience I’ve had.

Last week, I had the pleasure to join all Splunkers during our Sales Kickoff (SKO) event in Las Vegas. What can I say? First time in that entertainment world capital, amazing atmosphere, great stories and awesome event.

Besides putting faces to names (or emails), it was a great opportunity to discuss and get instant feedback from a bunch of smart people, which by the way, is one of many  characteristics, part of this big data giant, as people are accessible and open for ideas and (technical) discussions.

Ultimately, all that will definitely not stay in Vegas as the cliche says, but be part of another nice experience within my career that I will keep sharing. It’s been an exciting ride so far at Splunk! Hope to be there next year, wherever it’s going to be!

Think about UX and Security next time you get onboard!

It works! Another title bait. Perhaps it should have been written as “Oslo’s Flytoget: a good UX/Security balance?”. Anyways, since you are here, I hope you enjoy the read.

Besides visiting many places, meeting interesting people and experiencing different cultures, working as a consultant provides you the possibility to face unique human–computer interaction experiences.

I actually started my career as a web designer and since then, I’ve been into this UX thing, so it’s great to spot different User Experience (UX) implementations.

San Francisco’s BART

BART stands for Bay Area Rapid Transit and is probably the most used public transport between SFO and the city center. If you ever been there, you will probably recall its ticket machines from the images below.

Look at how many buttons, slots and signs! That’s scary. My first time there I felt stupid trying to get it working. After a few attempts I decided to step away and check someone doing it correctly (or becoming embarrassed as well).

800px-Barttvm2

But rather than describing how it sucks it’s a struggle to buy a ticket for BART, which seems to be well known, I will simply compare it to another experience I’ve had in another country with similar goal: airport ↔ city ride.

“Wait, what if I use a Clipper card?” Well, check the official video instructions on how to top up and tell me what you think about it!

In case you don’t know BART or haven’t checked the videos, please take a moment to imagine how to operate this thing from the pictures below. And have in mind we are into IT/Computers, let alone elderly and other humans not so tech savvy.

5504887565_a5bf9ae04b

BART’s ticket machine home-screen

DSC_0011

For buying a ticket – after you check the destination’s ticket price on the wall, you should input the exact price by adding/subtracting values from side buttons!

At this point, we should agree the city of Golden Gate bridge – where a lot of startups and great minds are constantly coming up with cool ideas – deserves a better interface for its rail/subway system.

How does Oslo’s Flytoget work?

Basically, you swipe and go.

Well, all you need to do is swipe your credit card at one of the “ticket machines” and enter the train. That’s it. Done. End. No plastic/paper ticket, no ticket gate/ratchet/turnstile.

Apart from the fact BART is located in a country where a lot of people do use cars rather than public transportation (sources: Vehicles per Capita and Public Transportation usage), and that it is not an Express service like Flytoget, I am simply comparing interfaces for a train ticket system.

Here are the pictures from the simplest ticket interface I have ever seen:

IMG_0835-term

Flytoget’s ticketing machine

IMG_0629-iface

Single operation: swipe the card.

IMG_0630-msg

After successfully swiping the card, a green message is shown (assuming red otherwise?)

An additional step is needed after swiping the card, in case you are departing from the airport. There’s a touch screen (image below) where you tap the icon corresponding to your final destination. Since the ticket price is fixed, I guess that’s for Analytics reasons.

IMG_1015

There’s also an app for mobile phones, no swiping  card needed, but I’ve never tried though (perhaps less CC data exposure here?). That’s not an exceptional UX example, but better than most systems being used out there.

Everything comes with a price: Security x UX

How are they handling or storing my credit card data? What about a receipt? How can I expense the cost of the travel journey if my company does not consider credit card reports? Here Security/Privacy might become an issue.

A receipt is available at Flytoget’s website within 24hs after the journey. First, you create an account and then you link your CC number to your profile (!).

Now, needless to say they must be storing some credit card data, likely including parts of the CC number. If you know how it works, please leave a comment below.

It’s easy to suggest we need a balance between UX and Security when there are actually so many variables involved. But IMO, we need to think about the business success first, which is directly tied to UX (way beyond Security?).

If following the rules (laws, regulations, etc) does allow such an option (Swipe and Go), it should be considered. Also, users should know their CC might be exposed since the time it’s shipped, as they should know about refund policy in case of CC theft or fraud.

From the user’s perspective, depriving yourself of those solutions sometimes make little sense. That becomes even more interesting from the UX designer’s perspective, considering most users will not even bother evaluating those risks.

Should we provide an unique user experience (UX) at the price of an increased risk? Or should we provide better Security at the price of an average UX? That’s just one of the dilemmas UX/Infosec professionals face.

UX pros should consider Security as part of their design as we, Sec pros, should consider UX when planning our strategies and actions.

 

Splunkers on Twitter

Below is a list of Splunk users I am following on Twitter, including Splunkers, partners and awesome customers. Most of them are also into #Infosec. The list is not sorted in any particular order.

Missing someone, maybe you?! Please feel free to contact me to add suggestions. In case you want to follow a list, it is also available via Twitter here.

Ryan Kovar @meansec
Staff Security Strategist @Splunk. Enjoys clicking too fast, long walks in the woods, and data visualizations.

Holger Sesterhenn @sesterhenn_splk
Sales Engineer, CISSP, Security Know-How, Machinedata, Security Intelligence, IoT, Industrie 4.0, BigData, Hadoop, NoSQL, User Behavior Analytics

The Dark Overlord @StephenGailey
Towering intellect; effortlessly charming…

Cédric @_CLX
Let me grep you. #infosec and useless stuff. Using security buzzwords since 2005. https://github.com/c-x

Brad Shoop @bradshoop
Security Onion for Splunk app developer, infosec, devops, infrastructure, cloud and homebrewer.

monzy merza @monzymerza
Chief Security Evangelist @Splunk. Thoughts are my own.

Damien Dallimore @damiendallimore
Splunk Dev Evangelist, Golfer, Rugby Player, Musician, Scuba Diver, Thai linguist, Chef.

Adam Sealey @AdamSealey
Information security, both applied and research. CSIRT, DFIR, and analytics Generalist geek. Husband & father of 3. Tweets are my own.

Hacker Hurricane @HackerHurricane
Austin TX. area Information Security Professional

Mika Borner @my2ndhead
Splunk Artisan. Because Splunking is an art.

David Shpritz @automine
I Splunk all the things. Blieve, hon. Splunk, Web App Sec, Open source, EDC

Dimitri McKay @dimitrimckay
Glazed donut connoisseur, plus size hand model, technologist, splunker, replicant, security nerd, CISSP, MMA fighter, zombie killer & lover of pitbulls.

Luke Murphey @LukeMurphey
Developer of network security solutions at #splunk. Founding member of Threatfactor (http://ThreatFactor.com ) and Converged Security (acquired by GlassHouse).

Sebastien Tricaud @tricaud
Principal Security Strategist @Splunk. Playing with data, binary-ascii-utf16-whatever. Opinions are my own, not my employers. Re-tweeting != Agreeing

Dave Herrald @daveherrald
dad | husband | splunk security architect | GIAC GSE | tweets=mine

Ryan Chapman @rj_chap
Security enthusiast. Incident response analyst. Malware hobbyist. Retro game lover. Husband and father. TnVsbGl1cyBpbiB2ZXJiYS4= http://github.com/BechtelCIRT

Michael Porath @poezn
Product Manager for Data Visualization @splunk. Bay Area based Swiss Information Scientist

skywalka @skywalka
my daughter, basketball, hip hop, film, comics, linux, puppet, splunk, nagios, and sensu keep me awake

James Bower (Hando) @jamesbower
Pentester / Threat intelligence / #OSINT / #Honeypots / #Bro_IDS / #Splunk | #Python / Follower of Christ and occasional blogger – http://jamesbower.com

georgestarcher @georgestarcher
Information Security, Log analysis and Splunk, Forensics, Podcasting. Photography and OSX Fan. GnuPGP key ID: 875A3320BD558C9E

Brian Warehime @brian_warehime
Security Analyst | Threat Researcher | #Honeypots | #Splunk | #Python | #OSINT | #DFIR

Michel Oosterhof @micheloosterhof
Splunk // My opinions are my own.

Hal Rottenberg @halr9000
I am the Lorax. I speak for the Developers! @Splunk, Author, Podcaster @powerscripting , Speaker, #PowerShell MVP, #CiscoChampion, husband, father of four!

Jason McCord @digirati82
Security analyst, software developer, #Splunk fan. Log everything. #WLS #DFIR

Matthias Maier @Matthias_BY

Siri De Licori @siridelicori

Challenge your MSSP/SOC/CSIRT: what metrics can they provide you?

I was trying to recall a famous quote related to “Metrics” for including here and below is what Mr. Google hints me:

The quote has a few variations, but that seems to be the most famous one. Perhaps now it will finally stick. So, does it make sense or is it just another unquestioned corporate adage?

Basically, the idea here is to give you more food for thought in case you are into this metrics thing and trying to apply it to Security Operations.

Actually, let me start by saying I like measuring data, therefore metrics is an interesting topic to me. Simply put, translating your effort and progress to management is way easier if you are able to come up with a metric from which they can understand what you are doing and why.

As usual, bonus points if a metric ties to a business goal (more info below). So working on a good, easily digestible metric also saves management time assuming this one is not there only for you, nor can it be allocated quickly. Therefore, selecting key metrics and meaningful charts is an opportunity security practitioners cannot miss in order to keep their budgets flowing in.

many questions, few metrics

How do you evaluate the work done by your SOC or SecOps team? How to verify your MSSP is providing a good service?

Within Security Operations, and I dare using this term to refer to the tasks carried out by MSSPs, SOCs or CSIRTs, you should generate metrics that help or enable answering the following questions:

  1. How many investigations ended up being a false positive (FP) or a real threat (TP)?
  2. From above answers, what scenarios are seen or involved most often? Is there a technology, NIDS signature, correlation rule or process clearly performing better (or worse) than others?
  3. Which analysts are involved in the process of developing or tuning signatures/rules that lead to real investigations?
  4. In a multi-tier environment, which analysts were responsible for the triage of most FP cases?
  5. MSSP only – Are customers responding or interacting with cases that are raised towards their security teams?
linking Metrics to benefits

Now, read question #1 and ask yourself: Do you really believe a properly deployed security infrastructure will never, ever detect a real threat? So why are you still paying a MSSP to provide you with anything but FPs? Checkbox Security?

No wonder why your Snort/Bro guy, with a single sensor is able to provide 10 times more consumable alerts than your 5 super-duper Checkpoint NG IPS Blades? Track answers from questions #2 and #3 to find out.

From #4 you will have a better idea about where to invest your budget for training and which analysts might need some mentoring.

Many incidents evaluated doesn’t mean people are busy on analysis, nor does it mean good work. The higher the FP rate on the SOC escalations, the less interest your customer will have. That indicates less engagement on following up the investigations. Refer to #5.

And what about the relationship with business goals? That’s easier to exemplify for MSSPs: sounding metrics performing as expected are the best ammunition you can bring to the table for contract renewals or (ups!) upselling.

Here are some metrics examples (measurable!):

  • Alerts to Escalations ratio
  • Escalations to real investigations ratio
  • Alerts per shift/analyst
  • Time to triage (evaluate a new alert)
  • Time to close an investigation (by outcome)
  • Number of FPs/TPs per rule, signature, use case

If you embrace Gamification, there are many more that might be interesting, for example: Escalations to real investigations (TPs) ratio per analyst or shift.

No Case Management = No Game

An investigation must have a start and an end, otherwise it’s impossible to measure the output of it. Even if you want to monitor an attacker behavior for a while, this decision (observe, follow-up) was most likely the result of an investigation.

Now, scroll up to the list and ask yourself how many of those questions are easily answered by hooking to the ticket or case management database. Data mining your case management DB might be challenging but definitely worth it.

“I don’t have a case management system!”, then, go get one before you start the metrics conversation. If you don’t have an incident workflow in place, those systems might even drive you towards designing one.

Happy to discuss that stuff further? Feel free to comment here or message me on Twitter.

My TOP 5 Security (and techie) talks from Splunk .conf 2015

indexIf you are into Security and didn’t have an opportunity to attend the Splunk conference in Las Vegas this year (maybe you’re busy playing Blackjack instead?), here’s what you can not miss.

The list is not sorted in any particular order and, whenever possible, entries include presenters’ Twitter handles as well as takeaways or comments that might help you choose where to start.

  1. Security Operations Use Cases at Bechtel (recording / slides)
    That’s the coolest customer talk from the ones I could watch. The presenters (@ltawfall / @rj_chap) discussed some interesting use cases and provided a lot of input for those willing to make Splunk their nerve center for security.
  2. Finding Advanced Attacks and Malware with Only 6 Windows EventIDs (recording / slides)
    This presentation is a must for those willing to monitor Windows events either via native or 3rd party endpoint solutions. @HackerHurricane really knows his stuff, which is not a surprise for someone calling himself a Malware Archaeologist.
  3. Hunting the Known Unknowns (with DNS) (recording / slides)
    If you are looking for concrete security use case ideas to build based on DNS data, that’s a gold. Don’t forget to provide feedback to Ryan Kovar and Steve Brant, I’m sure they will like it.
  4. Building a Cyber Security Program with Splunk App for Enterprise Security (recording / slides)
    Enterprise Security (ES) app relies heavily on accelerated data models, so besides interesting tips on how to leverage ES, Jeff Campbell provides ways to optimize your setup, showing what goes under the hood.
  5. Build A Sample App to Streamline Security Operations – And Put It to Use Immediately (recording)
    This talk was delivered by Splunkers @dimitrimckay and @daveherrald. They presented an example on how to build custom content on top of ES to enhance the context around an asset, which is packed to an app available at GitHub.

Now, in case you are not into Security but also enjoy watching hardcore, techie talks, here’s my TOP 5 list:

  1. Optimizing Splunk Knowledge Objects – A Tale of Unintended Consequences (recording / slides)
    Martin gives an a-w-e-s-o-m-e presentation on Knowledge Objects, unraveling what happens under the hood when using tags and eventtypes. Want to provide him feedback? Martin is often found at IRC, join #splunk and say ‘Hi’!
  2. Machine Learning and Analytics in Splunk (recording / slides)
    If you are into ML and the likes of R programming, the app presented here will definitely catch your attention. Just have a quick look on the slides to see what I mean. A lot of use cases for Security here as well.
  3. Beyond the Lookup Glass: Stepping Beyond Basic Lookups (recording)
    Wanna know about the challenges with CSV Lookups and KV store in big deployments? Stop here. Kudos to Duane Waddle and @georgestarcher!
  4. Splunk Search Pro Tips (recording / slides)
    Just do the following: browse the video recording and skip to around 30′ (magic!). Now, try not watching the entire presentation and thank Dan Aiello.
  5. Building Your App on an Accelerated Data Model (recording / slides)
    In this presentation, the creator of the ubberAgent@HelgeKlein – describes how to make the most of data models in great detail.

Still eager for more security related Splunk .conf stuff? Simply pick one below (recordings only).

For all presentations (recordings and slides), please visit the conference website.

Splunk > Self-Learning Path & The Community Factor

Splunk is gaining tremendous traction in the market due to its ability to harness the value of machine data. The idea here is to highlight a few reasons for such success: free-access and community driven approaches.

Being familiar with the ways in which knowledge can be freely attained is a great advantage. Coupled with your curiosity, pretty much nothing more is needed to become an independent learner these days.

Below you will find the main references I’ve been using to learn Splunk and get up to speed with this great technology.

Splunk Platform: Free, Easy Access

Splunk provides free access to its flagship product, Splunk Enterprise. Users evaluating the product can also get a free, perpetual license. That means no initial costs for installing and evaluating most of its primary capabilities.

For developers, there is also a developer license which enables up to 10GB a day for data indexing.

TLDR? Just hit Play!

Besides the excellent Just Ask campaign, the following short videos help showing Splunk’s benefits:

Are you looking for more technical stuff, easy to follow and digest? Below is a YouTube playlist with demo-like lessons available from Splunk’s channel:

Besides, if you are an Infosec pro, don’t forget to check the current Security related apps at the portal. Aside from that, below you will find a few videos that might trigger inspiration for further research and ideas:

Q&A Forum, IRC and Wiki

The Splunk Answers forum is really an important knowledge base, and here’s why:

  • The discussions are around questions and answers, so entries tend to be clear and narrowed to a specific topic, often times matching an issue you are currently facing;
  • Not only Splunk team members provide answers. It’s common to get responses from partners and, of course, the whole Splunk community, including end-users;
  • Script/Code as well as images are allowed for easier understanding of a question or an answer. Top contributors are also awarded with points and badges to promote users interaction;
  • There is a sort of rating to answers, so users can also rely on that for choosing where to start.

I was also surprised when I joined the IRC channel as several Splunk staff members (PS, Devel, Support) take part in the discussions there. Sometimes the answer not found via documentation, or a bug report might well be the subject of a quick chat.

Besides that, there is, of course, a Splunk Wiki! As it applies to other examples listed here, it’s also community driven so anyone is able to add and edit content.

Documentation Portal

Splunk provides a well organized documentation portal, which serves as a quick reference guide (e.g., search commands) and also enables you to learn about more advanced topics such as Distributed Deployment, or the Common Information Model Add-on Manual.

Also, there are some dedicated tutorials available such as the Search Tutorial. I am listing below some doc bookmarks that I am constantly querying on:

It’s worth noting most areas from the documentation portal are provided with a Comments section, from which the answer for your issue might be found, so always keep an eye on that.

UPDATE 9-Mar-15: Also, don’t forget to bookmark Splexicon, a documentation reference that defines technical terms that are specific to Splunk. Definitions include links to related information from the Splunk documentation.

Cheatsheets

For those Splunk Ninjas pros out there who love having those neat docs around, there are some cool versions available for Splunk as well. Some of them are listed below:

The Community Factor: BIG Win!

The community engagement is a huge win in respect to knowledge sharing and as a business strength. Simply setting up a web forum doesn’t enable community integration. In my opinion, here are some of the great initiatives Splunk has been carrying out to accomplish that:

Missing something? Just let me know so I can add them here as well.

Visualizando o orçamento do Brasil

acontece-imagem-ilustracao11-2Antes de me atrever a falar sobre o processo orçamentário, gostaria que você analisasse rapidamente a seguinte frase:

Para a área de Educação, o Orçamento prevê a aplicação de R$ 82,3 bilhões em despesas referentes à manutenção e desenvolvimento do ensino.

O que isso significa? Além de indicar que determinado valor (mais de 82 bilhões!) será destinado à área de Educação, o que mais é possível compreender aqui? Esse valor é muito ou é pouco? E se a frase fosse colocada da seguinte maneira?

O Congresso Nacional aprovou na madrugada desta quarta-feira (18) o Orçamento de 2014, com previsão de receita de R$ 2,488 trilhões. Para a área de Educação, o Orçamento prevê a aplicação de R$ 82,3 bilhões em despesas referentes à manutenção e desenvolvimento do ensino.

Ficou mais fácil de visualizar? Pra muitos, talvez; pra a imensa maioria, são apenas números de grandes proporções.

Mas qual é a importância desses números?

Aquelas duas frases citadas no começo deste post possuem exatos 19 parágrafos entre si e foram transcritas da notícia abaixo, veiculada no portal de notícias da Globo (G1):

Congresso aprova Orçamento de 2014 (“Atualizado em 18/12/2013 10h59”)

Na verdade, é mais ou menos assim que costumávamos consumir notícias. Hoje em dia, os infográficos são bastante utilizados para facilitar a comunicação de ideias, principalmente quando se deseja representar proporções. Entretanto, nenhum diagrama ou infográfico foi utilizado naquela publicação, pelo menos até o momento em que a visualizei (4 de fevereiro, 2015).

Mesmo sem lançar mão de uma calculadora ou mesmo da cuca pra fazer uma conta rápida, percebe-se que a disparidade entre os números é muito grande, com ordens de grandeza significativamente distantes. Mas isso importa? Leia abaixo o que a população, aquela que tem o direito a obrigação de votar, tem a dizer sobre onde o Brasil precisa de mais atenção:

Brasileiro elege saúde, segurança e educação como prioridades para 2014

Além do título, você só precisa saber disso aqui: “Quase metade da população brasileira (49%) diz que melhorar os serviços de saúde deve ser prioridade para o governo federal em 2014, ano de eleição do novo presidente da República.”

Guarde essa informação aí por uns instantes: saúde, segurança e educação.

O que é orçamento, afinal de contas?

O famoso sociólogo Betinho definia mais ou menos assim:

É através do orçamento público que se lê a alma do governante. Por conter as provas de um jogo injusto é que o orçamento é tão complicado, técnico, oculto, disfarçado, arredio.

Essa palavra cai como uma luva para nós brasileiros, veja só. Orçar vem do italiano orzare, que significa “navegar contra o vento”, já que esta operação implica em fazer um cálculo estimativo (via Consultório Etimológico). Já o Wikipedia manda assim: Orçamento é a parte de um plano financeiro estratégico que compreende a previsão de receitas e despesas futuras para a administração de determinado exercício.

Ao longo deste texto, me refiro ao orçamento federal (da União), assim vale lembrar que os estados e municípios também possuem seu próprio orçamento, cuja arrecadação é oriunda de impostos, tributos, e taxas estaduais e municipais, além do repasse do governo federal.

Assim como você faz naquela planilha Excel sem senha, o governo também deve gerenciar como vai gastar ou investir sua renda (receita, arrecadação), seu dinheiro. Você ganha, ou espera ganhar, 2 mil reais por mês, com os quais você deverá pagar despesas essenciais, tais como alimentação, impostos, moradia, taxas, educação, transporte, tributos, etc.

Da mesma forma, o governo espera ganhar (arrecadar) alguns trilhões de reais por ano, levando em consideração que parte do orçamento do país será destinado às areas essenciais: saúde, segurança, aposentadorias, salários, etc. Mas, qual critério é utilizado para destinar o dinheiro de maneira responsável, ou tendo em vista o melhor para a população?

Basicamente, se o dinheiro é bem investido, mais chances de prosperidade, do contrário, não é navegar contra o vento, mas enfrentar uma terrível tormenta!

Enfim, assim como o país deve planejar, ou seja, visualizar ações antecipadamente, você também deve definir o destino, determinar metas para o seu dinheiro – o que não parece ocorrer com, pelo menos, 22% da população que tem acesso ao cheque que de especial só tem o nome.

Orçamento 2014

Hoje em dia é possível lançar mão de tecnologia (software) que possibilita a análise e visualização de dados de maneira simplificada, o que é extremamente importante, especialmente em um mundo hiper conectado, onde o volume de dados gerado aumenta muito rapidamente.

SIGA Brasil é um sistema de informações sobre orçamento público, que permite acesso amplo e facilitado ao SIAFI e a outras bases de dados sobre planos e orçamentos públicos, por meio de uma única ferramenta de consulta.

Dessa forma, utilizando os dados desse sistema em conjunto com plataformas de análise de dados é possível gerar visualizações, base para os infográficos e diagramas que vemos por aí.

No orçamento, cada área destino possui uma subárea ou finalidade específica (ex.: Educação → Educação Superior), entretanto, para facilitar a visualização do gráfico abaixo tendo em vista os principais temas, nem todas as finalidades são especificadas.

O gráfico abaixo considera os valores pagos, referente ao ano de 2014. Quanto mais larga a linha, maior o valor pago ou destinado a determinada área ou subárea:

Fonte: http://www8a.senado.gov.br/dwweb/abreDoc.html?docId=4434917

Quando se compara essa visualização com as tradicionais planilhas, ou mesmo com os piecharts (gráficos de pizza), percebe-se aqui algumas vantagens, mas principalmente, a ideia de fluxo e melhor noção de proporcionalidade.

O que fica claro?

  • Mais da metade do orçamento, ou seja, mais de R$ 1,2 trilhão, é destinada aos “Encargos Especiais”, que mais parece uma área criada para acomodar os gastos com a dívida pública;
  • A Previdência é muito onerosa, o que não é nenhuma novidade para o país com o regime de pensões mais generoso do mundo. Algumas estatísticas mundiais aqui, para comparação;
  • Dado que os destinos estão ordenados por valores pagos, percebe-se que a Segurança Pública se perde em meio as outras áreas que recebem menos investimentos da lista.

O que não fica claro?

  • O que é feito com o dinheiro destinado às áreas “Outras Transferências” e “Outros Encargos Especiais”, oriundos de “Encargos Especiais”? Afinal, trata-se de mais de R$ 290 bilhões!

Segurança Pública

Ao se comparar a área relativa às despesas ou aos investimentos em Segurança Pública, equivalente a pouco mais de R$ 7 bilhões, com o todo (orçamento total), percebe-se, mesmo sem saber valores exatos, que a desproporcionalidede é muito grande.

No gráfico abaixo, a área em vermelho representa, proporcionalmente, o valor destinado à área de Segurança Pública, o que claramente demonstra essa desproporção:

Você acha que isso tem alguma relação com o recorde histórico de mais de 56 mil homicídios ocorridos no Brasil em 2014? Ou ainda, que isso tem a ver com o fato de o país possuir a maior frota de blindados do mundo? Em parte, sim! Mas, que as Polícias Federal e Rodoviária Federal tiram leite de pedra com este investimento desprezível, não tenha dúvidas.

A Segurança, assim como outros destinos do orçamento, é de responsabilidade compartilhada, entre o Governo Federal e os estados, assim seria pouco prudente traçar qualquer paralelo mais aprofundado entre a falta de segurança e apenas o investimento Federal em Segurança Pública, sem considerar outros orçamentos mais específicos (estadual/municipal).

Entretanto, considerando-se que, no mínimo, um terço desse investimento é destinado ao pagamento de salários, sobra-se muito pouco para investimentos na área. Imagine só, equipamentos, qualificação, armamento, infraestrutura dos departamentos de polícia e por aí vai.

Para se ter uma idéia, em 2010, a Polífica Federal, aquela que apura infrações penais contra a ordem política (vide Mensalão, Lava Jato/My Way), faz segurança das fronteiras, combate o tráfico de drogas e outros crimes na esfera federal, teve pouco mais de R$ 28 milhões para investir. Valor similar foi investido em uma startup carioca que facilita a vida de quem precisa pedir um táxi!

Só pra constar, sabe onde o Brasil gastou investiu a mesma quantia em 2014? Naquela copa que tomamos de 7 a 1 da Alemanha.

Educação & Saúde

O mesmo se aplica às áreas de Educação e Saúde? Veja abaixo como estas áreas aparecem em relação ao todo, utilizando a mesma abordagem mostrada acima:

Em relação à Educação, quando comparado ao PIB, o Brasil investe acima da média que os países da Organização para a Cooperação e Desenvolvimento Econômico (OCDE) investem, neste caso, trata-se de quase R$ 81 bilhões.

Há quem diga que o problema não está na falta de investimentos ou recursos na área, mas na falta de competência da administraçao pública, na falta de gestão. O aumento dos salários dos professores é uma necessidade clara, mas como isso deve ser resolvido? Apenas aumentando o investimento público não parece ser a única ou melhor maneira.

A mesma OCDE, indica que a educação de um brasileiro é feita com um terço do valor gasto com um estudante dos países ricos, em média, traçando uma relação entre investimento e número de alunos (população). Daí a necessidade de gerar estatísticas per capita, conforme destacado por um professor na reportagem da BBC Brasil.

Além das análises econômicas e sociais, a OCDE também é responsável pelo desenvolvimento e aplicação do PISA, o exame de matemática, leitura e ciências direcionado aos jovens que acabaram de concluir o ensino básico. Nos últimos anos, o Brasil tem ficado abaixo dos 50 primeiros da lista, o que é um forte indicativo da má qualidade de ensino no país.

O relatório de 2012, ano em que os último exames foram aplicados, está disponível abaixo:

PISA 2012 Results in Focus (em inglês)

Ao contrário do que fora prometido na campanha de 2014, cujo lema era “Pátria Educadora”, os cortes em Educação já foram anunciados, o que prejudicará ainda mais o desenvolvimento da área, às vezes chamada de motor do progresso.

Já no que diz respeito à Saúde, a coisa fica ainda pior, aliás, bem pior. O valor destinado a esta área é de pouco mais de R$ 80 bilhões. Como investir na área e ainda atender a demanda de serviços de saúde de uma população de mais de 200 milhões de pessoas?

São muitos fatores envolvidos na avaliação da qualidade dos serviços de saúde: falta de médicos e profissionais, má administração dos recursos, muita burocracia e a lista não acaba. Baseado no último relatório do TCU, a Globo produziu uma reportagem que reflete bem o cenário. Não é por menos que médicos entram em desespero aqui e acolá.

Constatações como visto abaixo, fazem parte do relatório:

O número de médicos por mil habitantes nas capitais do País é, em média, de 4,56, enquanto no interior, esse indicador cai para 1,11. Há variações significativas entre os estados brasileiros: no Maranhão, estado com menor número relativo, há 0,71 médico por mil habitantes; já no Distrito Federal, o número sobe para 4,09, um índice comparável ao da Noruega.

Mas este investimento realmente é pequeno? Veja o que a própria Organização Mundial de Saúde (OMS) afirma em relação ao Brasil, transcrito de uma reportagem do online Último Segundo:

Embora suplementar ao sistema público de saúde, os planos médicos no Brasil investem mais no setor do que o governo federal no SUS (Sistema Único de Saúde). Este é o único caso no mundo. O relatório da organização chegou à conclusão de que, exceto pelo Brasil, em nenhum lugar em que a saúde pública é universal o sistema privado investe mais.

O trecho abaixo ainda é mais preocupante:

De acordo com a ANS (Agência Nacional de Saúde Suplementar), as operadoras desembolsaram R$ 90,5 bilhões em 2013 com pagamento de internações, consultas e exames de laboratórios para atender um total de 50 milhões de clientes. Este ano, o SUS recebeu da União R$ 91,6 bilhões para chegar a 200 milhões de pacientes.

Percebe-se que, efetivamente, o SUS deve atender a uma demanda de, pelo menos, 150 milhões de pessoas. Assim, vale a pena visualizar como os gastos com a Saúde são subdivididos:

Apesar de não conter valores, dá para se ter uma noção de proporção investida por programa. Caso queira verificar as ações, os gastos e até as empresas favorecidas em cada um deles, os detalhes estão disponíveis no Portal da Transparência, site mantido pelo próprio governo. Abaixo os detalhes do programa “Saneamento Básico”, por exemplo:

Gastos diretos com o programa “Saneamento Básico”

Note que além da Saúde, este programa também recebe recursos das áreas de Saneamento e Gestão Ambiental. Mais detalhes também são encontrados no próprio SIGA.

UPDATE: Muita gente não sabe, mas o cidadão comum pode ajudar diretamente a definir quais são as prioridades na aplicação dos recursos do orçamento de seu município. Saiba mais sobre o Orçamento Participativo por meio da leitura de um excelente artigo publicado no portal Consciência Política.

Dívida Pública

Agora que você sabe que a maior parte do dinheiro do país é destinada ao pagamento da dívida pública, você deve estar se perguntando: mas que dívida é essa? Quem são os credores? Para entender melhor todos os porquês da dívida, sugiro a leitura dos seguintes artigos do projeto Auditoria Cidadã, cuja linguagem é simples e de fácil compreensão:

Algumas hipóteses só podem se realmente verificadas por meio de análises mais detalhadas, o que é uma tarefa bem desafiadora, até mesmo para os economistas e especialistas no assunto.

Ainda assim, apenas com o intuito de expor os dados relativos a dívida de uma forma visualmente diferente, abordarei este assunto em breve aqui no blog, o que além de disseminar a informação, pode estimular ideias e feedback para futuros posts.

* Este é na verdade o primeiro post que escrevo na categoria Consciência Política.

My 1st Splunk app: RAW Charts

d3rawAfter some days playing around with a few interesting apps, I’ve decided to give it a try, and learn how to integrate RAW data visualization project into Splunk.

It turns out, by reading the (latest) right App Development documentation (thanks IRC!) and checking good examples, it’s quite an easy job, especially if you are already familiar with web development technologies (HTML, JS/jQuery and the likes).

Here’s a bit of motivation to do it:

  • Connecting with the Splunk community;
  • Getting up to speed with the Splunk Web Framework for quickly developing custom content (views, dashboards, apps, etc);
  • Easily visualizing search results in different formats by leveraging the search bar functionality, rather than editing hard-coded dashboard searches;
  • Helping to spread the word about the power of data visualization by demonstrating the incredible D3 library and the RAW project;
  • Having fun! (a must for any learning experience nowadays, right?)

RAW project?

I will not dare describing it better than the creators of this great project:

“The missing link between spreadsheets and vector graphics.”

A more detailed description is also found from the project’s README file:

RAW is an open web tool developed at the DensityDesign Research Lab (Politecnico di Milano) to create custom vector-based visualizations on top of the amazing d3.js library by Mike Bostock. Primarily conceived as a tool for designers and vis geeks, RAW aims at providing a missing link between spreadsheet applications (e.g. Microsoft Excel, Apple Numbers, Google Docs, OpenRefine, …) and vector graphics editors (e.g. Adobe Illustrator, Inkscape, …).

What you can do instead is simply browsing the project interface here: app.raw.densitydesign.org. Paste your data or just pick one data sample to realize how easy it is to create a chart without a single line of code.

And since we are talking about one line of code, let’s get straight to the point. Here’s a dirty quick hack for automatically copying the search results into RAW’s worklfow:

$scope.text = localStorage.getItem('searchresults')

In fact, I’m not sure if that’s the optimal way to accomplish it, but that’s the only change needed within RAW’s code (controllers.js). The wonderful Italian mafia team at Density Design might be reading this now, so guys please advise! (I know you are very busy).

Nevertheless, after a quick read through AngularJS, that change looks like a quick win. What it does is tell the browser to load the data from a local storage into RAW’s textarea. Local storage? Remember Cookies and HotDog editor? That’s history! Actually, not.

The Splunk Code

By using the Web Framework Toolkit, creating an app is really easy. Just use the splunkdj createapp <app-name> command and start customizing the default view that is built in, home.html. Here’s the main code piece used for this app (JavaScript block):

{% block js %}
<script>

function createIframe(){
    // reset div contents
    document.getElementById("raw-charts").innerHTML = "";

    // create an iframe
    var rawframe = document.createElement("iframe");
    rawframe.id = "rawframe";
    rawframe.src = "{{STATIC_URL}}{{app_name}}/raw/index.html";
    rawframe.scrolling = "no";
    rawframe.style.border = "none";
    rawframe.width = "100%";
    rawframe.height = "3700px";

    // insert iframe
    document.getElementById("raw-charts").appendChild(rawframe);

};

var deps = [
	"splunkjs/ready!",
	"splunkjs/mvc/searchmanager"
];

require(deps, function(mvc) {

	// this guy handles the search/results
	var SearchManager = require("splunkjs/mvc/searchmanager");

	// initial search definition
	var mainSearch = new SearchManager({
		id: "search1",
		//search: "startminutesago=1 index=_internal | stats c by group | head 2",
		search: "",
		max_count: 999999,
		preview: false,
		cache: false
	});

	// count: 0 needed for avoiding the 100 limit (Thanks IRC #splunk!)
	var myResults = mainSearch.data("results", {count: 0});

	// tested with "on search:done" but unexpected results happened
	myResults.on("data", function() {  

		// field names separated by comma
		var searchresults = myResults.data().fields.join();

		// debug code
		//console.log(myResults.collection());

		// loop through the result set
		for (var i=0; i < myResults.data().rows.length; i++) {
			searchresults = searchresults + '\n' + myResults.data().rows[i];
		}

		// better than cookie!
		localStorage.setItem('searchresults',searchresults);

		// search loaded, triggering iframe creation
		createIframe();

	});

	// keep search bar and manager in sync
	var searchbar1 = mvc.Components.getInstance('searchbar1');
	var search1 = mvc.Components.getInstance('search1');

	searchbar1.on('change', function(){
		search1.settings.unset('search');
		search1.settings.set('search', searchbar1.val());
	});
});

</script>

{% endblock js %}

The initial page for the app loads an empty search bar with a table view component right below it. After running a search, the table displays the search results and also triggers the RAW workflow, by loading the textarea with the table’s content.

Meet the workflow

In a nutshell, the visualization workflow works like Splunk’s default. The user runs a search command, formats the results and finally clicks on “Visualization” tab. Likewise, using this app the user is also able to customize chart options and export the results in different formats.

First Example

Here’s the first example in action, reachable via Chart Examples menu. The data comes from Transport of London data portal, this specific data set (CSV) is a sample for the Rolling Origin & Destination Survey (RODS) available under “Network Statistics” section from the portal.

Before handling the CSV file, the following command is needed for cleaning up the file header, basically replacing slashes and spaces by a “_” char:

sed -i '1,1s/[[:blank:]]*\/[[:blank:]]*\|\([[:alnum:]]\)[[:blank:]]\+\([[:alnum:]]\)/\1_\2/g;' rods-access-mode-2010-sample.csv

After clicking at the link example, the search bar gets preloaded with a specific search command, which triggers the table reload:

Example 1 The results are synced to RAW’s input component, which is fully editable just in case:

The user is then able to choose one chart type (multiples available). Here, the Alluvial/Sankey diagram is chosen:

There’s also an option for adding your own chart in case you are willing to integrate your D3 code implementation with the project.

The next step is to select which fields (columns) will be part of the diagram/chart, and also how they will relate to the chart’s components (dimensions, steps, hierarchy, etc). For doing so, a nice drag and drop interface eases the job.

Just follow the instructions included within the example (step-by-step) . The final map setup should look like the following:

Finally, here’s the chart generated in the end:

As you can see from this simple example, the chart better conveys the idea of flow & proportionality among the dimensions as compared to other usual charting options out there.

Optionally, the user is able to customize colors, sorting and other stuff, which may differ depending on the chart chosen. Exporting options are also available (SVG/HTML, PNG, etc).

Second Example

The second example leverages data from the World Bank data portal related to Internet subscribers. For this case, I’ve decided to apply a few constraints so that it becomes a bit simpler to render the results:

  • Only a few countries are filtered in;
  • Time period considered is 2000-2009.

By following roughly the same steps described from example previously shown, the search gets preloaded with a search command and the user is instructed to follow a few steps to generate the graph. In this case, a Bump Chart, similarly to the one featured at NYT.

I hope the screenshots speak for themselves (click for full size). Detailed instructions are available from the app’s documentation and examples.

Here’s a list of currently supported charts/diagrams: Sankey / Alluvial, Bump Chart, Circle Packing, Circular / Cluster Dendogram, Clustered Force Layout, Convex Hull, Delaunay Triangulation, Hexagonal Binning, Parallel Coordinates, Reingold-Tilford Tree, Streamgraph, Treemap, Voronoi Tessellation.

Comments and suggestions are more than welcome! The app is available at Splunk’s app portal, and I will later upload the code to a common place (Github?) so it makes easier for everyone to have access and modify it.