# The proof that your assumptions about traffic are right

Big cities in development countries suffer with many problems like housing, violence, jobs, logistics, and so on. Right now I write this blog post from São Paulo, Brasil, where one thing every is in the mind of every engineer and we call it “Brasil’s cost”.

Brasil’s cost is about the bottle neck in our logistics that makes all goods and services cost more due to the lack of infraestructure as rail roads and ports. Roads couldn’t be different.

As a civil engineering student at USP, I take some time to study how people behave on the streets and here are some of my fidings

# Left lanes goes faster than others

On urban or rural roads, its known that local lanes, the ones that are on the right, are occupied by heavy vehicles like trucks or buses, that drive slower. But lighter cars can drive in all lanes and the composition of a road (that is, the proportion of heavy and light vehicles) is mostly of cars. So could that turn local lanes faster than express ones?

As we can see, the lane that has the highest speed has the number 1, which is closer to the center of the road and less used by heavy vehicles.

# Does heavy vehicles really occupy local lanes?

Above is the correlation matrix of features on a dataframe with radar data that if the data source of this post. Is easy to see that big *lengths* correlates with greater *lane numbers. *That means that buses and trucks really do occupy local lanes (since the lanes are numbered from left to right).

# How well can a Random Forest Model estimate the new data?

The linear model, always a good head start, fitted to the data does little for the predicition, reaching an R2 score of 3%. The search for another model passes through many regressors such as on decision trees, but the trick here is to strengthen them with an ensemble method.

When working with real world data it’s not uncommon to score little with models. Often, when trying to improve the results, have to do with processing the data or chosing the best hyperparameters and the improvement of as little as 1% comes with joy.

On this aproach, the Random Forest algorithm got an R2 score 13% when using 50 binary trees and max depth of 5 nodes. At this point, the computational cost has a big role. With more estimators, depth, and tunning other hyperparameters is possible to have higher score, althoug it takes more time and computational power to run the model, and this is why it is a turning point.

Do you have anymore assumpion on trafic? Wanna help to build this project? Feel free to send me a message or fork this github repo!